[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors
[ https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090352#comment-17090352 ] peay commented on SPARK-27039: -- [~hyukjin.kwon] do you know if this was eventually back ported in 2.4.x? This was more than a year ago and I suspect 3.0 is still some ways off (taking into account official release and deployment for us), while this is a rather sneaky correctness bug. > toPandas with Arrow swallows maxResultSize errors > - > > Key: SPARK-27039 > URL: https://issues.apache.org/jira/browse/SPARK-27039 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: peay >Priority: Minor > > I am running the following simple `toPandas` with {{maxResultSize}} set to > 1mb: > {code:java} > import pyspark.sql.functions as F > df = spark.range(1000 * 1000) > df_pd = df.withColumn("test", F.lit("this is a long string that should make > the resulting dataframe too large for maxResult which is 1m")).toPandas() > {code} > > With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an > empty Pandas dataframe without any error: > {code:python} > df_pd.info() > # > # Index: 0 entries > # Data columns (total 2 columns): > # id 0 non-null object > # test0 non-null object > # dtypes: object(2) > # memory usage: 0.0+ bytes > {code} > The driver stderr does have an error, and so does the Spark UI: > {code:java} > ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job > aborted due to stage failure: Total size of serialized results of 1 tasks > (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432) > at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862) > {code} > With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call > to {{toPandas}} does fail as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors
[ https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786658#comment-16786658 ] peay commented on SPARK-27039: -- For reference, I've realized you can also get an incomplete but non-empty pandas Dataframe in 2.4. This is more worrying because that's harder to detect (without looking at the driver logs manually, that is), and can lead to serious correctness issues. You can do a {{count()}} and compare that to the size of the output of {{toPandas}}, but that's potentially quite inefficient. Since this is fixed in master, I haven't tried to make a minimal example to reproduce, but I believe that with unbalanced partitions, you can end up consuming the iterator for a partition that is small enough before the Spark job actually fails when the larger partitions are attempted. > toPandas with Arrow swallows maxResultSize errors > - > > Key: SPARK-27039 > URL: https://issues.apache.org/jira/browse/SPARK-27039 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: peay >Priority: Minor > > I am running the following simple `toPandas` with {{maxResultSize}} set to > 1mb: > {code:java} > import pyspark.sql.functions as F > df = spark.range(1000 * 1000) > df_pd = df.withColumn("test", F.lit("this is a long string that should make > the resulting dataframe too large for maxResult which is 1m")).toPandas() > {code} > > With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an > empty Pandas dataframe without any error: > {code:python} > df_pd.info() > # > # Index: 0 entries > # Data columns (total 2 columns): > # id 0 non-null object > # test0 non-null object > # dtypes: object(2) > # memory usage: 0.0+ bytes > {code} > The driver stderr does have an error, and so does the Spark UI: > {code:java} > ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job > aborted due to stage failure: Total size of serialized results of 1 tasks > (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432) > at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862) > {code} > With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call > to {{toPandas}} does fail
[jira] [Commented] (SPARK-24624) Can not mix vectorized and non-vectorized UDFs
[ https://issues.apache.org/jira/browse/SPARK-24624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786456#comment-16786456 ] peay commented on SPARK-24624: -- I mean regular aggregation functions and Pandas UDF aggregation functions (i.e., expression of the form {{.groupBy(key).agg(F.avg("col"), pd_agg_udf("col2"))}}). {{master}} seems to still require aggregation expressions to either all be regular agg. functions or all be Pandas UDF: [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L467], right? > Can not mix vectorized and non-vectorized UDFs > -- > > Key: SPARK-24624 > URL: https://issues.apache.org/jira/browse/SPARK-24624 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Assignee: Li Jin >Priority: Major > Fix For: 2.4.0 > > > In the current impl, we have the limitation: users are unable to mix > vectorized and non-vectorized UDFs in same Project. This becomes worse since > our optimizer could combine continuous Projects into a single one. For > example, > {code} > applied_df = df.withColumn('regular', my_regular_udf('total', > 'qty')).withColumn('pandas', my_pandas_udf('total', 'qty')) > {code} > Returns the following error. > {code} > IllegalArgumentException: Can not mix vectorized and non-vectorized UDFs > java.lang.IllegalArgumentException: Can not mix vectorized and non-vectorized > UDFs > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:170) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:146) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:146) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution.executedPla
[jira] [Commented] (SPARK-24624) Can not mix vectorized and non-vectorized UDFs
[ https://issues.apache.org/jira/browse/SPARK-24624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785920#comment-16785920 ] peay commented on SPARK-24624: -- Are there plans to support something similar for aggregation functions? > Can not mix vectorized and non-vectorized UDFs > -- > > Key: SPARK-24624 > URL: https://issues.apache.org/jira/browse/SPARK-24624 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Assignee: Li Jin >Priority: Major > Fix For: 2.4.0 > > > In the current impl, we have the limitation: users are unable to mix > vectorized and non-vectorized UDFs in same Project. This becomes worse since > our optimizer could combine continuous Projects into a single one. For > example, > {code} > applied_df = df.withColumn('regular', my_regular_udf('total', > 'qty')).withColumn('pandas', my_pandas_udf('total', 'qty')) > {code} > Returns the following error. > {code} > IllegalArgumentException: Can not mix vectorized and non-vectorized UDFs > java.lang.IllegalArgumentException: Can not mix vectorized and non-vectorized > UDFs > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:170) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:146) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:146) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:99) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3312) > at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:2750) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...
[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784260#comment-16784260 ] peay commented on SPARK-27019: -- Yes, I had edited my message above shortly after posting - cannot reproduce in the second scenario. Thanks! > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: Screenshot from 2019-03-01 21-31-48.png, > application_1550040445209_4748, query-1-details.png, query-1-list.png, > query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784172#comment-16784172 ] peay edited comment on SPARK-27019 at 3/5/19 7:28 AM: -- Great! -Is that compatible with my second observation above? (I tested without any executors, and even without any task starting, the SQL tab had the wrong output). I can try to get an event log for that as well if that's helpful.- edit: I tried to reproduce that to export the event log, and could not. Seems like your patch should address the issue. was (Author: peay): Great! Is that compatible with my second observation above? (I tested without any executors, and even without any task starting, the SQL tab had the wrong output). I can try to get an event log for that as well if that's helpful. > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: Screenshot from 2019-03-01 21-31-48.png, > application_1550040445209_4748, query-1-details.png, query-1-list.png, > query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784172#comment-16784172 ] peay commented on SPARK-27019: -- Great! Is that compatible with my second observation above? (I tested without any executors, and even without any task starting, the SQL tab had the wrong output). I can try to get an event log for that as well if that's helpful. > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: Screenshot from 2019-03-01 21-31-48.png, > application_1550040445209_4748, query-1-details.png, query-1-list.png, > query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors
[ https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783711#comment-16783711 ] peay commented on SPARK-27039: -- Interesting, thanks for checking. Yes, I can definitely live without that until 3.0. Is there already a timeline for 3.0? > toPandas with Arrow swallows maxResultSize errors > - > > Key: SPARK-27039 > URL: https://issues.apache.org/jira/browse/SPARK-27039 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: peay >Priority: Minor > > I am running the following simple `toPandas` with {{maxResultSize}} set to > 1mb: > {code:java} > import pyspark.sql.functions as F > df = spark.range(1000 * 1000) > df_pd = df.withColumn("test", F.lit("this is a long string that should make > the resulting dataframe too large for maxResult which is 1m")).toPandas() > {code} > > With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an > empty Pandas dataframe without any error: > {code:python} > df_pd.info() > # > # Index: 0 entries > # Data columns (total 2 columns): > # id 0 non-null object > # test0 non-null object > # dtypes: object(2) > # memory usage: 0.0+ bytes > {code} > The driver stderr does have an error, and so does the Spark UI: > {code:java} > ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job > aborted due to stage failure: Total size of serialized results of 1 tasks > (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432) > at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862) > {code} > With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call > to {{toPandas}} does fail as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors
[ https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783352#comment-16783352 ] peay commented on SPARK-27039: -- Oops, sorry, I've edited the title. I meant _Arrow_, not Avro. Maybe [~bryanc] would know something about this? > toPandas with Arrow swallows maxResultSize errors > - > > Key: SPARK-27039 > URL: https://issues.apache.org/jira/browse/SPARK-27039 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: peay >Priority: Minor > > I am running the following simple `toPandas` with {{maxResultSize}} set to > 1mb: > {code:java} > import pyspark.sql.functions as F > df = spark.range(1000 * 1000) > df_pd = df.withColumn("test", F.lit("this is a long string that should make > the resulting dataframe too large for maxResult which is 1m")).toPandas() > {code} > > With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an > empty Pandas dataframe without any error: > {code:python} > df_pd.info() > # > # Index: 0 entries > # Data columns (total 2 columns): > # id 0 non-null object > # test0 non-null object > # dtypes: object(2) > # memory usage: 0.0+ bytes > {code} > The driver stderr does have an error, and so does the Spark UI: > {code:java} > ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job > aborted due to stage failure: Total size of serialized results of 1 tasks > (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432) > at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862) > {code} > With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call > to {{toPandas}} does fail as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors
[ https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-27039: - Summary: toPandas with Arrow swallows maxResultSize errors (was: toPandas with Avro swallows maxResultSize errors) > toPandas with Arrow swallows maxResultSize errors > - > > Key: SPARK-27039 > URL: https://issues.apache.org/jira/browse/SPARK-27039 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: peay >Priority: Minor > > I am running the following simple `toPandas` with {{maxResultSize}} set to > 1mb: > {code:java} > import pyspark.sql.functions as F > df = spark.range(1000 * 1000) > df_pd = df.withColumn("test", F.lit("this is a long string that should make > the resulting dataframe too large for maxResult which is 1m")).toPandas() > {code} > > With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an > empty Pandas dataframe without any error: > {code:python} > df_pd.info() > # > # Index: 0 entries > # Data columns (total 2 columns): > # id 0 non-null object > # test0 non-null object > # dtypes: object(2) > # memory usage: 0.0+ bytes > {code} > The driver stderr does have an error, and so does the Spark UI: > {code:java} > ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job > aborted due to stage failure: Total size of serialized results of 1 tasks > (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432) > at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862) > {code} > With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call > to {{toPandas}} does fail as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27039) toPandas with Avro swallows maxResultSize errors
peay created SPARK-27039: Summary: toPandas with Avro swallows maxResultSize errors Key: SPARK-27039 URL: https://issues.apache.org/jira/browse/SPARK-27039 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Reporter: peay I am running the following simple `toPandas` with {{maxResultSize}} set to 1mb: {code:java} import pyspark.sql.functions as F df = spark.range(1000 * 1000) df_pd = df.withColumn("test", F.lit("this is a long string that should make the resulting dataframe too large for maxResult which is 1m")).toPandas() {code} With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an empty Pandas dataframe without any error: {code:python} df_pd.info() # # Index: 0 entries # Data columns (total 2 columns): # id 0 non-null object # test0 non-null object # dtypes: object(2) # memory usage: 0.0+ bytes {code} The driver stderr does have an error, and so does the Spark UI: {code:java} ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1 tasks (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313) at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282) at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435) at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436) at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432) at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862) {code} With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call to {{toPandas}} does fail as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-27019: - Attachment: application_1550040445209_4748 > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: application_1550040445209_4748, query-1-details.png, > query-1-list.png, query-job-1.png, screenshot-spark-ui-details.png, > screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781718#comment-16781718 ] peay commented on SPARK-27019: -- Attached: [^application_1550040445209_4748] > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: application_1550040445209_4748, query-1-details.png, > query-1-list.png, query-job-1.png, screenshot-spark-ui-details.png, > screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781651#comment-16781651 ] peay edited comment on SPARK-27019 at 3/1/19 1:06 PM: -- Sure does. I've done one more test, using no executors and running the same command: {code:java} spark.range(1024 * 1024 * 1024 * 10).toPandas() {code} The SQL tab directly shows the wrong output, even though no task has started - i.e., my assumption above that this was because of failing tasks appears wrong. This actually seems consistent with the 'submitted date' somehow being wrong. was (Author: peay): Sure does. I've done one more test, using no executors and running the same command: spark.range(1024 * 1024 * 1024 * 10).toPandas() The SQL tab directly shows the wrong output, even though no task has started - i.e., my assumption above that this was because of failing tasks appears wrong. This actually seems consistent with the 'submitted date' somehow being wrong. > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: query-1-details.png, query-1-list.png, query-job-1.png, > screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781651#comment-16781651 ] peay commented on SPARK-27019: -- Sure does. I've done one more test, using no executors and running the same command: spark.range(1024 * 1024 * 1024 * 10).toPandas() The SQL tab directly shows the wrong output, even though no task has started - i.e., my assumption above that this was because of failing tasks appears wrong. This actually seems consistent with the 'submitted date' somehow being wrong. > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: query-1-details.png, query-1-list.png, query-job-1.png, > screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-27019: - Attachment: query-job-1.png query-1-details.png query-1-list.png > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: query-1-details.png, query-1-list.png, query-job-1.png, > screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781604#comment-16781604 ] peay commented on SPARK-27019: -- OK, I can actually reproduce it pretty easily with pyspark: {code:java} df_test = spark.range(1024 * 1024 * 1024 * 10).toPandas(){code} This makes the tasks fail because my executors don't have enough memory, which seems to be key to hitting the issue. Using only 1000 elements, the job succeeds and it does not trigger the issue. !query-1-list.png! !query-1-details.png! !query-job-1.png! > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: query-1-details.png, query-1-list.png, query-job-1.png, > screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781595#comment-16781595 ] peay commented on SPARK-27019: -- I don't really have a minimal example - this uses a bunch of python jobs and/or interactive works in notebook. Is there a way I could collect debug information to help troubleshooting this? > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: screenshot-spark-ui-details.png, > screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781595#comment-16781595 ] peay edited comment on SPARK-27019 at 3/1/19 11:40 AM: --- I don't really have a minimal example - this uses a bunch of python jobs and/or interactive work in notebooks. Is there a way I could collect debug information to help troubleshooting this? was (Author: peay): I don't really have a minimal example - this uses a bunch of python jobs and/or interactive works in notebook. Is there a way I could collect debug information to help troubleshooting this? > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: screenshot-spark-ui-details.png, > screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781583#comment-16781583 ] peay commented on SPARK-27019: -- Also seeing this both for the Spark UI for live jobs, and when accessing past jobs through the Spark History Server. > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: screenshot-spark-ui-details.png, > screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. > > !image-2019-03-01-12-05-40-119.png! > !image-2019-03-01-12-06-48-707.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-27019: - Description: Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the Spark UI, where submitted/duration make no sense, description has the ID instead of the actual description. Clicking on the link to open a query, the SQL plan is missing as well. I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to very large values like 30k out of paranoia that we may have too many events, but to no avail. I have not identified anything particular that leads to that: it doesn't occur in all my jobs, but it does occur in a lot of them still. was: Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the Spark UI, where submitted/duration make no sense, description has the ID instead of the actual description. Clicking on the link to open a query, the SQL plan is missing as well. I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to very large values like 30k out of paranoia that we may have too many events, but to no avail. I have not identified anything particular that leads to that: it doesn't occur in all my jobs, but it does occur in a lot of them still. !image-2019-03-01-12-05-40-119.png! !image-2019-03-01-12-06-48-707.png! > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: screenshot-spark-ui-details.png, > screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781581#comment-16781581 ] peay edited comment on SPARK-27019 at 3/1/19 11:10 AM: --- Seems like the screenshots did not embed, attaching them instead. was (Author: peay): Seems like the screenshot did not embed, attaching them instead. > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: screenshot-spark-ui-details.png, > screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. > > !image-2019-03-01-12-05-40-119.png! > !image-2019-03-01-12-06-48-707.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-27019: - Attachment: screenshot-spark-ui-list.png screenshot-spark-ui-details.png > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: screenshot-spark-ui-details.png, > screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. > > !image-2019-03-01-12-05-40-119.png! > !image-2019-03-01-12-06-48-707.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781581#comment-16781581 ] peay commented on SPARK-27019: -- Seems like the screenshot did not embed, attaching them instead. > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: screenshot-spark-ui-details.png, > screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. > > !image-2019-03-01-12-05-40-119.png! > !image-2019-03-01-12-06-48-707.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
peay created SPARK-27019: Summary: Spark UI's SQL tab shows inconsistent values Key: SPARK-27019 URL: https://issues.apache.org/jira/browse/SPARK-27019 Project: Spark Issue Type: Bug Components: SQL, Web UI Affects Versions: 2.4.0 Reporter: peay Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the Spark UI, where submitted/duration make no sense, description has the ID instead of the actual description. Clicking on the link to open a query, the SQL plan is missing as well. I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to very large values like 30k out of paranoia that we may have too many events, but to no avail. I have not identified anything particular that leads to that: it doesn't occur in all my jobs, but it does occur in a lot of them still. !image-2019-03-01-12-05-40-119.png! !image-2019-03-01-12-06-48-707.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24523) InterruptedException when closing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16644602#comment-16644602 ] peay commented on SPARK-24523: -- I've started hitting the exact same error since I upgraded to EMR 5.17. Been running Spark 2.3.1 for a while before without issues, though, so I still suspect something EMR specific. In spite of this, AWS premium support has not been very helpful and mostly blames Spark itself. This occurs only for one particular job, which has a rather large computation graph with a union of a couple dozens of datasets. For this job, it occurs 100% of the time. I don't see any warning regarding events being dropped, but it seems that OP managed to get rid of those warnings although the original issue still persists, which seems to be our case. Manually closing the Spark context does seem to work for us. Adding more driver cores did not help. Attaching a [^thread-dump.log] from slightly before the crash in case that's useful. > InterruptedException when closing SparkContext > -- > > Key: SPARK-24523 > URL: https://issues.apache.org/jira/browse/SPARK-24523 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0, 2.3.1 > Environment: EMR 5.14.0, S3/HDFS inputs and outputs; EMR 5.17 > > > >Reporter: Umayr Hassan >Priority: Major > Attachments: spark-stop-jstack.log.1, spark-stop-jstack.log.2, > spark-stop-jstack.log.3, thread-dump.log > > > I'm running a Scala application in EMR with the following properties: > {{--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory > 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf > spark.dynamicAllocation.enabled=true --conf > spark.dynamicAllocation.maxExecutors=20 --conf > spark.eventLog.dir=hdfs:///var/log/spark/apps --conf > spark.eventLog.enabled=true --conf > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf > spark.scheduler.listenerbus.eventqueue.capacity=2 --conf > spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400 > --conf spark.yarn.maxAppAttempts=1}} > The application runs fine till SparkContext is (automatically) closed, at > which point the SparkContext object throws. > {{18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1 > java.lang.InterruptedException at java.lang.Object.wait(Native Method) at > java.lang.Thread.join(Thread.java:1252) at > java.lang.Thread.join(Thread.java:1326) at > org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133) at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at > scala.collection.AbstractIterable.foreach(Iterable.scala:54) at > org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at > org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at > org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) > at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ThreadPoolExecutor.runWorker(Thr
[jira] [Updated] (SPARK-24523) InterruptedException when closing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-24523: - Attachment: thread-dump.log > InterruptedException when closing SparkContext > -- > > Key: SPARK-24523 > URL: https://issues.apache.org/jira/browse/SPARK-24523 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0, 2.3.1 > Environment: EMR 5.14.0, S3/HDFS inputs and outputs; EMR 5.17 > > > >Reporter: Umayr Hassan >Priority: Major > Attachments: spark-stop-jstack.log.1, spark-stop-jstack.log.2, > spark-stop-jstack.log.3, thread-dump.log > > > I'm running a Scala application in EMR with the following properties: > {{--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory > 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf > spark.dynamicAllocation.enabled=true --conf > spark.dynamicAllocation.maxExecutors=20 --conf > spark.eventLog.dir=hdfs:///var/log/spark/apps --conf > spark.eventLog.enabled=true --conf > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf > spark.scheduler.listenerbus.eventqueue.capacity=2 --conf > spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400 > --conf spark.yarn.maxAppAttempts=1}} > The application runs fine till SparkContext is (automatically) closed, at > which point the SparkContext object throws. > {{18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1 > java.lang.InterruptedException at java.lang.Object.wait(Native Method) at > java.lang.Thread.join(Thread.java:1252) at > java.lang.Thread.join(Thread.java:1326) at > org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133) at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at > scala.collection.AbstractIterable.foreach(Iterable.scala:54) at > org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at > org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at > org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) > at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)}} > > I've not seen this behavior in Spark 2.0.2 and Spark 2.2.0 (for the same > application), so I'm not sure which change is causing Spark 2.3 to throw. Any > ideas? > best, > Umayr -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173139#comment-16173139 ] peay commented on SPARK-10925: -- Same issue on Spark 2.1.0. I have been working around this using `df.rdd.toDF()` before using `join`, but this is really far from ideal. > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala, > TestCase.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520) > at TestCase2$.main(TestCase2.scala:51) > at TestCase2.main(TestCase2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:4
[jira] [Created] (SPARK-21659) FileStreamSink checks for _spark_metadata even if path has globs
peay created SPARK-21659: Summary: FileStreamSink checks for _spark_metadata even if path has globs Key: SPARK-21659 URL: https://issues.apache.org/jira/browse/SPARK-21659 Project: Spark Issue Type: Bug Components: Input/Output, SQL Affects Versions: 2.2.0 Reporter: peay Priority: Minor I am using the GCS connector for Hadoop, and reading a Dataframe using {{context.read.format("parquet").load("...")}}. When my URI has glob patterns of the form {code} gs://uri/{a,b,c} {code} or as below, Spark incorrectly assumes that it is a single file path, and produces this rather verbose exception: {code} java.net.URISyntaxException: Illegal character in path at index xx: gs://bucket-name/path/to/date=2017-0{1-29,1-30,1-31,2-01,2-02,2-03,2-04}*/_spark_metadata at java.net.URI$Parser.fail(URI.java:2848) at java.net.URI$Parser.checkChars(URI.java:3021) at java.net.URI$Parser.parseHierarchical(URI.java:3105) at java.net.URI$Parser.parse(URI.java:3053) at java.net.URI.(URI.java:588) at com.google.cloud.hadoop.gcsio.LegacyPathCodec.getPath(LegacyPathCodec.java:93) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:171) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1421) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) {code} I am not quite sure if the GCS connector deviates from the HCFS standard here in terms of behavior, but this makes logs really hard to read for jobs that load a bunch of files like this. https://github.com/apache/spark/blob/3ac60930865209bf804ec6506d9d8b0ddd613157/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L39 already has an explicit {{case Seq(singlePath) =>}}, except that it is misleading because {{singlePath}} can have wildcards. In addition, it could check for non-escaped glob characters, like {code} {, }, ?, * {code} and go to the multiple-paths case when those are present, where looking for metadata is skipped. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21550) approxQuantiles throws "next on empty iterator" on empty data
[ https://issues.apache.org/jira/browse/SPARK-21550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay closed SPARK-21550. Resolution: Duplicate Fix Version/s: 2.2.0 > approxQuantiles throws "next on empty iterator" on empty data > - > > Key: SPARK-21550 > URL: https://issues.apache.org/jira/browse/SPARK-21550 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: peay > Fix For: 2.2.0 > > > The documentation says: > {code} > null and NaN values will be removed from the numerical column before > calculation. If > the dataframe is empty or the column only contains null or NaN, an empty > array is returned. > {code} > However, this small pyspark example > {code} > sql_context.range(10).filter(col("id") == 42).approxQuantile("id", [0.99], > 0.001) > {code} > throws > {code} > Py4JJavaError: An error occurred while calling o3493.approxQuantile. > : java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186) > at > scala.collection.TraversableLike$class.last(TraversableLike.scala:431) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186) > at > scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132) > at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207) > at > org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(StatFunctions.scala:92) > at > org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92) > at > org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow
[ https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16103539#comment-16103539 ] peay commented on SPARK-21551: -- Sure, does 15 seconds sound good? > pyspark's collect fails when getaddrinfo is too slow > > > Key: SPARK-21551 > URL: https://issues.apache.org/jira/browse/SPARK-21551 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: peay >Priority: Critical > > Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and > {{DataFrame.collect}} all work by starting an ephemeral server in the driver, > and having Python connect to it to download the data. > All three are implemented along the lines of: > {code} > port = self._jdf.collectToPython() > return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( > {code} > The server has **a hardcoded timeout of 3 seconds** > (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695) > -- i.e., the Python process has 3 seconds to connect to it from the very > moment the driver server starts. > In general, that seems fine, but I have been encountering frequent timeouts > leading to `Exception: could not open socket`. > After investigating a bit, it turns out that {{_load_from_socket}} makes a > call to {{getaddrinfo}}: > {code} > def _load_from_socket(port, serializer): > sock = None > # Support for both IPv4 and IPv6. > # On most of IPv6-ready systems, IPv6 will take precedence. > for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, > socket.SOCK_STREAM): >.. connect .. > {code} > I am not sure why, but while most such calls to {{getaddrinfo}} on my machine > only take a couple milliseconds, about 10% of them take between 2 and 10 > seconds, leading to about 10% of jobs failing. I don't think we can always > expect {{getaddrinfo}} to return instantly. More generally, Python may > sometimes pause for a couple seconds, which may not leave enough time for the > process to connect to the server. > Especially since the server timeout is hardcoded, I think it would be best to > set a rather generous value (15 seconds?) to avoid such situations. > A {{getaddrinfo}} specific fix could avoid doing it every single time, or do > it before starting up the driver server. > > cc SPARK-677 [~davies] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow
peay created SPARK-21551: Summary: pyspark's collect fails when getaddrinfo is too slow Key: SPARK-21551 URL: https://issues.apache.org/jira/browse/SPARK-21551 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.0 Reporter: peay Priority: Critical Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and {{DataFrame.collect}} all work by starting an ephemeral server in the driver, and having Python connect to it to download the data. All three are implemented along the lines of: {code} port = self._jdf.collectToPython() return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( {code} The server has **a hardcoded timeout of 3 seconds** (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695) -- i.e., the Python process has 3 seconds to connect to it from the very moment the driver server starts. In general, that seems fine, but I have been encountering frequent timeouts leading to `Exception: could not open socket`. After investigating a bit, it turns out that {{_load_from_socket}} makes a call to {{getaddrinfo}}: {code} def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM): .. connect .. {code} I am not sure why, but while most such calls to {{getaddrinfo}} on my machine only take a couple milliseconds, about 10% of them take between 2 and 10 seconds, leading to about 10% of jobs failing. I don't think we can always expect {{getaddrinfo}} to return instantly. More generally, Python may sometimes pause for a couple seconds, which may not leave enough time for the process to connect to the server. Especially since the server timeout is hardcoded, I think it would be best to set a rather generous value (15 seconds?) to avoid such situations. A {{getaddrinfo}} specific fix could avoid doing it every single time, or do it before starting up the driver server. cc SPARK-677 [~davies] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21550) approxQuantiles throws "next on empty iterator" on empty data
peay created SPARK-21550: Summary: approxQuantiles throws "next on empty iterator" on empty data Key: SPARK-21550 URL: https://issues.apache.org/jira/browse/SPARK-21550 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: peay The documentation says: {code} null and NaN values will be removed from the numerical column before calculation. If the dataframe is empty or the column only contains null or NaN, an empty array is returned. {code} However, this small pyspark example {code} sql_context.range(10).filter(col("id") == 42).approxQuantile("id", [0.99], 0.001) {code} throws {code} Py4JJavaError: An error occurred while calling o3493.approxQuantile. : java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) at scala.collection.IterableLike$class.head(IterableLike.scala:107) at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186) at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186) at scala.collection.TraversableLike$class.last(TraversableLike.scala:431) at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186) at scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132) at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186) at org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(StatFunctions.scala:92) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15672245#comment-15672245 ] peay edited comment on SPARK-18473 at 11/17/16 12:55 AM: - Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct? (edit: nevermind, the fix appears to be in 2.0.2 indeed). was (Author: peay): Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct? > Correctness issue in INNER join result with window functions > > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: peay >Assignee: Xiao Li > > I have stumbled onto a corner case where an INNER join appears to return > incorrect results. I believe the join should behave as the identity, but > instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-+-+--+++--+--+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-+-+--+++--+--+ > |1|1| 1| 0| 1| 0| 1| > |2|2| 0| 0| 1| 0| 1| > |1|3| 1| 0| 2| 0| 2| > +-+-+--+++--+--+ > {code} > with > {code} > +--+ > |sessId| > +--+ > | 1| > | 2| > +--+ > {code} > The result is > {code} > +--+-+-+--+++--+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +--+-+-+--+++--+ > | 1|2|2| 0| 0| 1| 0| > | 2|1|1| 1| 0| 1|-1| > | 2|1|3| 1| 0| 2| 0| > +--+-+-+--+++--+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and > how `fiftyCount` can now be negative while always zero in the original > dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of > `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to > `fillna` -- although there are no visible effect on the dataframe as shown by > `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", > F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", > F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - > col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15672245#comment-15672245 ] peay commented on SPARK-18473: -- Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct? > Correctness issue in INNER join result with window functions > > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: peay >Assignee: Xiao Li > > I have stumbled onto a corner case where an INNER join appears to return > incorrect results. I believe the join should behave as the identity, but > instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-+-+--+++--+--+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-+-+--+++--+--+ > |1|1| 1| 0| 1| 0| 1| > |2|2| 0| 0| 1| 0| 1| > |1|3| 1| 0| 2| 0| 2| > +-+-+--+++--+--+ > {code} > with > {code} > +--+ > |sessId| > +--+ > | 1| > | 2| > +--+ > {code} > The result is > {code} > +--+-+-+--+++--+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +--+-+-+--+++--+ > | 1|2|2| 0| 0| 1| 0| > | 2|1|1| 1| 0| 1|-1| > | 2|1|3| 1| 0| 2| 0| > +--+-+-+--+++--+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and > how `fiftyCount` can now be negative while always zero in the original > dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of > `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to > `fillna` -- although there are no visible effect on the dataframe as shown by > `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", > F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", > F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - > col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671624#comment-15671624 ] peay commented on SPARK-18473: -- Ah, great, thanks. I had checked out the CHANGELOG but couldn't find anything relevant, any reference to the corresponding issue? > Correctness issue in INNER join result with window functions > > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: peay > > I have stumbled onto a corner case where an INNER join appears to return > incorrect results. I believe the join should behave as the identity, but > instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-+-+--+++--+--+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-+-+--+++--+--+ > |1|1| 1| 0| 1| 0| 1| > |2|2| 0| 0| 1| 0| 1| > |1|3| 1| 0| 2| 0| 2| > +-+-+--+++--+--+ > {code} > with > {code} > +--+ > |sessId| > +--+ > | 1| > | 2| > +--+ > {code} > The result is > {code} > +--+-+-+--+++--+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +--+-+-+--+++--+ > | 1|2|2| 0| 0| 1| 0| > | 2|1|1| 1| 0| 1|-1| > | 2|1|3| 1| 0| 2| 0| > +--+-+-+--+++--+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and > how `fiftyCount` can now be negative while always zero in the original > dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of > `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to > `fillna` -- although there are no visible effect on the dataframe as shown by > `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", > F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", > F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - > col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how two rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is *correct* if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is *correct* if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. -
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is *correct* if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 })` {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 })` {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add `.fillna({ 'numOnesBefore': 0 })` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOI
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add `.fillna({ 'numOnesBefore': 0 })` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add .fillna({ 'numOnesBefore': 0 }), although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add .fillna({ 'numOnesBefore': 0 }), although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code:python} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add `.fillna({ 'numOnesBefore': 0 })`, although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read
[jira] [Created] (SPARK-18473) Correctness issue in INNER join result with window functions
peay created SPARK-18473: Summary: Correctness issue in INNER join result with window functions Key: SPARK-18473 URL: https://issues.apache.org/jira/browse/SPARK-18473 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, SQL Affects Versions: 2.0.1 Reporter: peay I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add `.fillna({ 'numOnesBefore': 0 })`, although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code:python} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org