Can you capture one or two stack traces of the local master process and
pastebin them ?

Thanks

On Thu, Jan 14, 2016 at 6:01 AM, Kai Wei <w...@pythian.com> wrote:

> Hi list,
>
> I ran into an issue which I think could be a bug.
>
> I have a Hive table stored as parquet files. Let's say it's called
> testtable. I found the code below stuck forever in spark-shell with a local
> master or driver/executor:
> sqlContext.sql("select * from testtable").rdd.cache.zipWithIndex().count
>
> But it works if I use a standalone master.
>
> I also tried several different variants:
> don't cache the rdd(works):
> sqlContext.sql("select * from testtable").rdd.zipWithIndex().count
>
> cache the rdd after zipWithIndex(works):
> sqlContext.sql("select * from testtable").rdd.zipWithIndex().cache.count
>
> use parquet file reader(doesn't work):
>
> sqlContext.read.parquet("hdfs://localhost:8020/user/hive/warehouse/testtable").rdd.cache.zipWithIndex().count
>
> use parquet files on local file system(works)
> sqlContext.read.parquet("/tmp/testtable").rdd.cache.zipWithIndex().count
>
> I read the code of zipWithIndex() and the DAG visualization. I think the
> function cause the Spark firstly retrieve n-1 partitions of target table
> and cache them, then the last partition. It must be something wrong when
> the driver/executor tries to read the last parition from HDFS .
>
> I am using spark-1.5.2-bin-hadoop-2.6 on cloudera quickstart vm 5.4.2.
>
> --
> Kai Wei
> Big Data Developer
>
> Pythian - love your data
>
> w...@pythian.com
> Tel: +1 613 565 8696 x1579
> Mobile: +61 403 572 456
>
> --
>
>
>
>

Reply via email to