Re: SparkSQL with large result size

2016-05-10 Thread Buntu Dev
Thanks Chris for pointing out the issue. I think I was able to get over this issue by: - repartitioning to increase the number of partitions (about 6k partitions) - apply sort() on the resulting dataframe to coalesce into single sorted partition file - read the sorted file and then adding just

Re: SparkSQL with large result size

2016-05-10 Thread Christophe Préaud
Hi, You may be hitting this bug: SPARK-9879 In other words: did you try without the LIMIT clause? Regards, Christophe. On 02/05/16 20:02, Gourav Sengupta wrote: Hi, I have worked on 300GB data by querying it from CSV (using SPARK CSV) and

Re: SparkSQL with large result size

2016-05-02 Thread Gourav Sengupta
Hi, I have worked on 300GB data by querying it from CSV (using SPARK CSV) and writing it to Parquet format and then querying parquet format to query it and partition the data and write out individual csv files without any issues on a single node SPARK cluster installation. Are you trying to

Re: SparkSQL with large result size

2016-05-02 Thread Buntu Dev
Thanks Ted, I thought the avg. block size was already low and less than the usual 128mb. If I need to reduce it further via parquet.block.size, it would mean an increase in the number of blocks and that should increase the number of tasks/executors. Is that the correct way to interpret this? On

Re: SparkSQL with large result size

2016-05-02 Thread Ted Yu
Please consider decreasing block size. Thanks > On May 1, 2016, at 9:19 PM, Buntu Dev wrote: > > I got a 10g limitation on the executors and operating on parquet dataset with > block size 70M with 200 blocks. I keep hitting the memory limits when doing a > 'select * from

Re: SparkSQL with large result size

2016-05-02 Thread ayan guha
How many executors are you running? Is your partition scheme ensures data is distributed evenly? It is possible that your data is skewed and one of the executors failing. Maybe you can try reduce per executor memory and increase partitions. On 2 May 2016 14:19, "Buntu Dev"

SparkSQL with large result size

2016-05-01 Thread Buntu Dev
I got a 10g limitation on the executors and operating on parquet dataset with block size 70M with 200 blocks. I keep hitting the memory limits when doing a 'select * from t1 order by c1 limit 100' (ie, 1M). It works if I limit to say 100k. What are the options to save a large dataset without