[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Herman van Hovell updated SPARK-17788: -------------------------------------- Target Version/s: 2.1.0 > RangePartitioner results in few very large tasks and many small to empty > tasks > ------------------------------------------------------------------------------- > > Key: SPARK-17788 > URL: https://issues.apache.org/jira/browse/SPARK-17788 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 2.0.0 > Environment: Ubuntu 14.04 64bit > Java 1.8.0_101 > Reporter: Babak Alipour > > Greetings everyone, > I was trying to read a single field of a Hive table stored as Parquet in > Spark (~140GB for the entire table, this single field is a Double, ~1.4B > records) and look at the sorted output using the following: > sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") > ​But this simple line of code gives: > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > Same error for: > sql("SELECT " + field + " FROM MY_TABLE).sort(field) > and: > sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) > After doing some searching, the issue seems to lie in the RangePartitioner > trying to create equal ranges. [1] > [1] > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html > > The Double values I'm trying to sort are mostly in the range [0,1] (~70% of > the data which roughly equates 1 billion records), other numbers in the > dataset are as high as 2000. With the RangePartitioner trying to create equal > ranges, some tasks are becoming almost empty while others are extremely > large, due to the heavily skewed distribution. > This is either a bug in Apache Spark or a major limitation of the framework. > I hope one of the devs can help solve this issue. > P.S. Email thread on Spark user mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org