Hi list, I ran into an issue which I think could be a bug.
I have a Hive table stored as parquet files. Let's say it's called testtable. I found the code below stuck forever in spark-shell with a local master or driver/executor: sqlContext.sql("select * from testtable").rdd.cache.zipWithIndex().count But it works if I use a standalone master. I also tried several different variants: don't cache the rdd(works): sqlContext.sql("select * from testtable").rdd.zipWithIndex().count cache the rdd after zipWithIndex(works): sqlContext.sql("select * from testtable").rdd.zipWithIndex().cache.count use parquet file reader(doesn't work): sqlContext.read.parquet("hdfs://localhost:8020/user/hive/warehouse/testtable").rdd.cache.zipWithIndex().count use parquet files on local file system(works) sqlContext.read.parquet("/tmp/testtable").rdd.cache.zipWithIndex().count I read the code of zipWithIndex() and the DAG visualization. I think the function cause the Spark firstly retrieve n-1 partitions of target table and cache them, then the last partition. It must be something wrong when the driver/executor tries to read the last parition from HDFS . I am using spark-1.5.2-bin-hadoop-2.6 on cloudera quickstart vm 5.4.2. -- Kai Wei Big Data Developer Pythian - love your data w...@pythian.com Tel: +1 613 565 8696 x1579 Mobile: +61 403 572 456 -- --