Can you capture one or two stack traces of the local master process and pastebin them ?
Thanks On Thu, Jan 14, 2016 at 6:01 AM, Kai Wei <w...@pythian.com> wrote: > Hi list, > > I ran into an issue which I think could be a bug. > > I have a Hive table stored as parquet files. Let's say it's called > testtable. I found the code below stuck forever in spark-shell with a local > master or driver/executor: > sqlContext.sql("select * from testtable").rdd.cache.zipWithIndex().count > > But it works if I use a standalone master. > > I also tried several different variants: > don't cache the rdd(works): > sqlContext.sql("select * from testtable").rdd.zipWithIndex().count > > cache the rdd after zipWithIndex(works): > sqlContext.sql("select * from testtable").rdd.zipWithIndex().cache.count > > use parquet file reader(doesn't work): > > sqlContext.read.parquet("hdfs://localhost:8020/user/hive/warehouse/testtable").rdd.cache.zipWithIndex().count > > use parquet files on local file system(works) > sqlContext.read.parquet("/tmp/testtable").rdd.cache.zipWithIndex().count > > I read the code of zipWithIndex() and the DAG visualization. I think the > function cause the Spark firstly retrieve n-1 partitions of target table > and cache them, then the last partition. It must be something wrong when > the driver/executor tries to read the last parition from HDFS . > > I am using spark-1.5.2-bin-hadoop-2.6 on cloudera quickstart vm 5.4.2. > > -- > Kai Wei > Big Data Developer > > Pythian - love your data > > w...@pythian.com > Tel: +1 613 565 8696 x1579 > Mobile: +61 403 572 456 > > -- > > > >