Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-06 Thread amarouni
You can get some more insights by using the Spark history server (http://spark.apache.org/docs/latest/monitoring.html), it can show you which task is failing and some other information that might help you debugging the issue. On 05/10/2016 19:00, Babak Alipour wrote: > The issue seems to lie in

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-02 Thread Babak Alipour
Thanks Vadim for sharing your experience, but I have tried multi-JVM setup (2 workers), various sizes for spark.executor.memory (8g, 16g, 20g, 32g, 64g) and spark.executor.core (2-4), same error all along. As for the files, these are all .snappy.parquet files, resulting from inserting some data

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-01 Thread Vadim Semenov
oh, and try to run even smaller executors, i.e. with `spark.executor.memory` <= 16GiB. I wonder what result you're going to get. On Sun, Oct 2, 2016 at 1:24 AM, Vadim Semenov wrote: > > Do you mean running a multi-JVM 'cluster' on the single machine? > Yes, that's

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-01 Thread Vadim Semenov
> Do you mean running a multi-JVM 'cluster' on the single machine? Yes, that's what I suggested. You can get some information here: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ > How would that affect performance/memory-consumption? If a multi-JVM setup can

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-01 Thread Babak Alipour
To add one more note, I tried running more smaller executors each with 32-64g memory and executor.cores 2-4 (with 2 workers as well) and I'm still getting the same exception: java.lang.IllegalArgumentException: Cannot allocate a page with more than 17179869176 bytes at

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-01 Thread Babak Alipour
Do you mean running a multi-JVM 'cluster' on the single machine? How would that affect performance/memory-consumption? If a multi-JVM setup can handle such a large input, then why can't a single-JVM break down the job into smaller tasks? I also found that SPARK-9411 mentions making the page_size

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-09-30 Thread Vadim Semenov
Run more smaller executors: change `spark.executor.memory` to 32g and `spark.executor.cores` to 2-4, for example. Changing driver's memory won't help because it doesn't participate in execution. On Fri, Sep 30, 2016 at 2:58 PM, Babak Alipour wrote: > Thank you for your

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-09-30 Thread Babak Alipour
Thank you for your replies. @Mich, using LIMIT 100 in the query prevents the exception but given the fact that there's enough memory, I don't think this should happen even without LIMIT. @Vadim, here's the full stack trace: Caused by: java.lang.IllegalArgumentException: Cannot allocate a page

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-09-30 Thread Vadim Semenov
Can you post the whole exception stack trace? What are your executor memory settings? Right now I assume that it happens in UnsafeExternalRowSorter -> UnsafeExternalSorter:insertRecord Running more executors with lower `spark.executor.memory` should help. On Fri, Sep 30, 2016 at 12:57 PM,

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-09-30 Thread Mich Talebzadeh
What will happen if you LIMIT the result set to 100 rows only -- select from order by field LIMIT 100. Will that work? How about running the whole query WITHOUT order by? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-09-30 Thread Babak Alipour
Greetings everyone, I'm trying to read a single field of a Hive table stored as Parquet in Spark (~140GB for the entire table, this single field should be just a few GB) and look at the sorted output using the following: sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") ​But

Dataframe sort

2016-07-05 Thread tan shai
Hi, I need to sort a dataframe and retrive the bounds of each partition. The dataframe.sort() is using the range partitioning in the physical plan. I need to retrieve partition bounds. Many thanks for your help.

Re: pyspark dataframe sort issue

2016-05-08 Thread Buntu Dev
Thanks Davies, after I did a coalesce(1) to save as single parquet file I was able to get the head() to return the correct order. On Sun, May 8, 2016 at 12:29 AM, Davies Liu wrote: > When you have multiple parquet files, the order of all the rows in > them is not defined.

Re: pyspark dataframe sort issue

2016-05-08 Thread Davies Liu
When you have multiple parquet files, the order of all the rows in them is not defined. On Sat, May 7, 2016 at 11:48 PM, Buntu Dev wrote: > I'm using pyspark dataframe api to sort by specific column and then saving > the dataframe as parquet file. But the resulting parquet

pyspark dataframe sort issue

2016-05-08 Thread Buntu Dev
I'm using pyspark dataframe api to sort by specific column and then saving the dataframe as parquet file. But the resulting parquet file doesn't seem to be sorted. Applying sort and doing a head() on the results shows the correct results sorted by 'value' column in desc order, as shown below: