[ 
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787733#comment-15787733
 ] 

J.P Feng commented on SPARK-19037:
----------------------------------

errors logs when doing dropDuplicates with sub-query in spark-shell:

scala> spark.sql("select * from mytest limit 10").dropDuplicates("name").show
120.073: [GC [PSYoungGen: 233234K->12801K(282112K)] 378713K->165495K(624128K), 
1.8045200 secs] [Times: user=6.52 sys=7.43, real=1.80 secs] 
[Stage 0:>                                                         (0 + 8) / 
16]124.182: [GC [PSYoungGen: 227841K->45026K(279552K)] 
380535K->202214K(621568K), 0.9970190 secs] [Times: user=2.87 sys=4.96, 
real=1.00 secs] 
[Stage 0:>                                                        (0 + 16) / 
16]16/12/30 21:58:21 ERROR (Executor): Exception in task 0.0 in stage 1.0 (TID 
16)
java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
16/12/30 21:58:21 WARN (TaskSetManager): Lost task 0.0 in stage 1.0 (TID 16, 
localhost, executor driver): java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

> Run count(distinct x) from sub query found some errors
> ------------------------------------------------------
>
>                 Key: SPARK-19037
>                 URL: https://issues.apache.org/jira/browse/SPARK-19037
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Shell, SQL
>    Affects Versions: 2.1.0
>         Environment: spark 2.1.0, scala 2.11 
>            Reporter: J.P Feng
>              Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from 
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as 
> a, i can also get the right result.
> besides, i found the same errors when i use distinct(),groupby() with 
> subquery.
> I think there maybe some bugs when doing key-reduce jobs with subquery.
> I will add the errors in new comment.
> besides, i test dropDuplicates in spark-shell:
> 1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show
> it will throw some exceptions
> 2. spark.table("mytest").dropDuplicates("name").show
> it will return the right result



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to