[ https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787733#comment-15787733 ]
J.P Feng commented on SPARK-19037: ---------------------------------- errors logs when doing dropDuplicates with sub-query in spark-shell: scala> spark.sql("select * from mytest limit 10").dropDuplicates("name").show 120.073: [GC [PSYoungGen: 233234K->12801K(282112K)] 378713K->165495K(624128K), 1.8045200 secs] [Times: user=6.52 sys=7.43, real=1.80 secs] [Stage 0:> (0 + 8) / 16]124.182: [GC [PSYoungGen: 227841K->45026K(279552K)] 380535K->202214K(621568K), 0.9970190 secs] [Times: user=2.87 sys=4.96, real=1.00 secs] [Stage 0:> (0 + 16) / 16]16/12/30 21:58:21 ERROR (Executor): Exception in task 0.0 in stage 1.0 (TID 16) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 16/12/30 21:58:21 WARN (TaskSetManager): Lost task 0.0 in stage 1.0 (TID 16, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) > Run count(distinct x) from sub query found some errors > ------------------------------------------------------ > > Key: SPARK-19037 > URL: https://issues.apache.org/jira/browse/SPARK-19037 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL > Affects Versions: 2.1.0 > Environment: spark 2.1.0, scala 2.11 > Reporter: J.P Feng > Labels: distinct, sparkSQL, sub-query > > when i use spark-shell or spark-sql to execute count(distinct name) from > subquery, some errors occur: > select count(distinct name) from (select * from mytest limit 10) as a > if i do this in hive-server2, i can get the correct result. > if i just execute select count(name) from (select * from mytest limit 10) as > a, i can also get the right result. > besides, i found the same errors when i use distinct(),groupby() with > subquery. > I think there maybe some bugs when doing key-reduce jobs with subquery. > I will add the errors in new comment. > besides, i test dropDuplicates in spark-shell: > 1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show > it will throw some exceptions > 2. spark.table("mytest").dropDuplicates("name").show > it will return the right result -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org