Re: sqlCtx.read.parquet yields lots of small tasks

2016-05-10 Thread Johnny W.
* num-cores ) - however if you use a huge > amount of data then you will see more tasks - that means it has some kind > of lower bound on num-tasks.. It may require some digging. other formats > did not seem to have this issue. > > On Sun, May 8, 2016 at 12:10 AM, Johnny W. <jzw.ser...@g

Re: sqlCtx.read.parquet yields lots of small tasks

2016-05-08 Thread Johnny W.
ish Dubey <ashish@gmail.com> wrote: > How big is your file and can you also share the code snippet > > > On Saturday, May 7, 2016, Johnny W. <jzw.ser...@gmail.com> wrote: > >> hi spark-user, >> >> I am using Spark 1.6.0. When I call sqlCtx.read.parq

sqlCtx.read.parquet yields lots of small tasks

2016-05-07 Thread Johnny W.
hi spark-user, I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a dataframe from a parquet data source with a single parquet file, it yields a stage with lots of small tasks. It seems the number of tasks depends on how many executors I have instead of how many parquet

High GC time when setting custom input partitions

2016-04-10 Thread Johnny W.
Hi spark-user, I am using spark 1.6 to build reverse index for one month of twitter data (~50GB). The split size of HDFS is 1GB, thus by default sc.textFile creates 50 partitions. I'd like to increase the parallelism by increase the number of input partitions. Thus, I use textFile(..., 200) to