[ https://issues.apache.org/jira/browse/SPARK-6009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-6009. ------------------------------ Resolution: Duplicate > IllegalArgumentException thrown by TimSort when SQL ORDER BY RAND () > -------------------------------------------------------------------- > > Key: SPARK-6009 > URL: https://issues.apache.org/jira/browse/SPARK-6009 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.0 > Environment: Centos 7, Hadoop 2.6.0, Hive 0.15.0 > java version "1.7.0_75" > OpenJDK Runtime Environment (rhel-2.5.4.2.el7_0-x86_64 u75-b13) > OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode) > Reporter: Paul Barber > > Running the following SparkSQL query over JDBC: > {noformat} > SELECT * > FROM FAA > WHERE Year >= 1998 AND Year <= 1999 > ORDER BY RAND () LIMIT 100000 > {noformat} > This results in one or more workers throwing the following exception, with > variations for {{mergeLo}} and {{mergeHi}}. > {noformat} > :java.lang.IllegalArgumentException: Comparison method violates its > general contract! > - at java.util.TimSort.mergeHi(TimSort.java:868) > - at java.util.TimSort.mergeAt(TimSort.java:485) > - at java.util.TimSort.mergeCollapse(TimSort.java:410) > - at java.util.TimSort.sort(TimSort.java:214) > - at java.util.Arrays.sort(Arrays.java:727) > - at > org.spark-project.guava.common.collect.Ordering.leastOf(Ordering.java:708) > - at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) > - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1138) > - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1135) > - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > - at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > - at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > - at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > - at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > - at org.apache.spark.scheduler.Task.run(Task.scala:56) > - at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > - at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > - at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > - at java.lang.Thread.run(Thread.java:745) > {noformat} > We have tested with both Spark 1.2.0 and Spark 1.2.1 and have seen the same > error in both. The query sometimes succeeds, but fails more often than not. > Whilst this sounds similar to bugs 3032 and 3656, we believe it it is not the > same. > The {{ORDER BY RAND ()}} is using TimSort to produce the random ordering by > sorting a list of random values. Having spent some time looking at the issue > with jdb, it appears that the problem is triggered by the random values being > changed during the sort - the code which triggers this is in > {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Row.scala}} > - class RowOrdering, function compare, line 250 - where a new random number > is taken for the same row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org