[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695956#comment-15695956 ]
holdenk commented on SPARK-17788: --------------------------------- This is semi-expected behaviour of the range partitioner (and really all Spark partitioners) don't support creating a split on the same key (e.g. 70% of your data has the same key and you are partitioning on that key 70% of that day is going to end up in the same partition). We could try and fix this in a few ways - either by having Spark SQL do something special in this case or having Spark's sortBy automatically add "noise" to the key when the sampling indicates there is too much data for a given key or allowing partitioners to be non-determinstic and updating the general sortBy logic in Spark. I think this would be something good for us to consider - but it's probably going to take awhile (and certainly not in time for 2.1.0). > RangePartitioner results in few very large tasks and many small to empty > tasks > ------------------------------------------------------------------------------- > > Key: SPARK-17788 > URL: https://issues.apache.org/jira/browse/SPARK-17788 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 2.0.0 > Environment: Ubuntu 14.04 64bit > Java 1.8.0_101 > Reporter: Babak Alipour > > Greetings everyone, > I was trying to read a single field of a Hive table stored as Parquet in > Spark (~140GB for the entire table, this single field is a Double, ~1.4B > records) and look at the sorted output using the following: > sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") > ​But this simple line of code gives: > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > Same error for: > sql("SELECT " + field + " FROM MY_TABLE).sort(field) > and: > sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) > After doing some searching, the issue seems to lie in the RangePartitioner > trying to create equal ranges. [1] > [1] > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html > > The Double values I'm trying to sort are mostly in the range [0,1] (~70% of > the data which roughly equates 1 billion records), other numbers in the > dataset are as high as 2000. With the RangePartitioner trying to create equal > ranges, some tasks are becoming almost empty while others are extremely > large, due to the heavily skewed distribution. > This is either a bug in Apache Spark or a major limitation of the framework. > I hope one of the devs can help solve this issue. > P.S. Email thread on Spark user mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org