[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors

2020-04-23 Thread peay (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090352#comment-17090352
 ] 

peay commented on SPARK-27039:
--

[~hyukjin.kwon] do you know if this was eventually back ported in 2.4.x? This 
was more than a year ago and I suspect 3.0 is still some ways off (taking into 
account official release and deployment for us), while this is a rather sneaky 
correctness bug.

> toPandas with Arrow swallows maxResultSize errors
> -
>
> Key: SPARK-27039
> URL: https://issues.apache.org/jira/browse/SPARK-27039
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Minor
>
> I am running the following simple `toPandas` with {{maxResultSize}} set to 
> 1mb:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(1000 * 1000)
> df_pd = df.withColumn("test", F.lit("this is a long string that should make 
> the resulting dataframe too large for maxResult which is 1m")).toPandas()
> {code}
>  
> With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an 
> empty Pandas dataframe without any error:
> {code:python}
> df_pd.info()
> # 
> # Index: 0 entries
> # Data columns (total 2 columns):
> # id  0 non-null object
> # test0 non-null object
> # dtypes: object(2)
> # memory usage: 0.0+ bytes
> {code}
> The driver stderr does have an error, and so does the Spark UI:
> {code:java}
> ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job 
> aborted due to stage failure: Total size of serialized results of 1 tasks 
> (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
>  at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
> {code}
> With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call 
> to {{toPandas}} does fail as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors

2019-03-07 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786658#comment-16786658
 ] 

peay commented on SPARK-27039:
--

For reference, I've realized you can also get an incomplete but non-empty 
pandas Dataframe in 2.4. 

This is more worrying because that's harder to detect (without looking at the 
driver logs manually, that is), and can lead to serious correctness issues. You 
can do a {{count()}} and compare that to the size of the output of 
{{toPandas}}, but that's potentially quite inefficient. 

Since this is fixed in master, I haven't tried to make a minimal example to 
reproduce, but I believe that with unbalanced partitions, you can end up 
consuming the iterator for a partition that is small enough before the Spark 
job actually fails when the larger partitions are attempted.

> toPandas with Arrow swallows maxResultSize errors
> -
>
> Key: SPARK-27039
> URL: https://issues.apache.org/jira/browse/SPARK-27039
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Minor
>
> I am running the following simple `toPandas` with {{maxResultSize}} set to 
> 1mb:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(1000 * 1000)
> df_pd = df.withColumn("test", F.lit("this is a long string that should make 
> the resulting dataframe too large for maxResult which is 1m")).toPandas()
> {code}
>  
> With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an 
> empty Pandas dataframe without any error:
> {code:python}
> df_pd.info()
> # 
> # Index: 0 entries
> # Data columns (total 2 columns):
> # id  0 non-null object
> # test0 non-null object
> # dtypes: object(2)
> # memory usage: 0.0+ bytes
> {code}
> The driver stderr does have an error, and so does the Spark UI:
> {code:java}
> ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job 
> aborted due to stage failure: Total size of serialized results of 1 tasks 
> (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
>  at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
> {code}
> With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call 
> to {{toPandas}} does fail 

[jira] [Commented] (SPARK-24624) Can not mix vectorized and non-vectorized UDFs

2019-03-06 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786456#comment-16786456
 ] 

peay commented on SPARK-24624:
--

I mean regular aggregation functions and Pandas UDF aggregation functions 
(i.e., expression of the form {{.groupBy(key).agg(F.avg("col"), 
pd_agg_udf("col2"))}}).

{{master}} seems to still require aggregation expressions to either all be 
regular agg. functions or all be Pandas UDF: 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L467],
 right?

> Can not mix vectorized and non-vectorized UDFs
> --
>
> Key: SPARK-24624
> URL: https://issues.apache.org/jira/browse/SPARK-24624
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> In the current impl, we have the limitation: users are unable to mix 
> vectorized and non-vectorized UDFs in same Project. This becomes worse since 
> our optimizer could combine continuous Projects into a single one. For 
> example, 
> {code}
> applied_df = df.withColumn('regular', my_regular_udf('total', 
> 'qty')).withColumn('pandas', my_pandas_udf('total', 'qty'))
> {code}
> Returns the following error. 
> {code}
> IllegalArgumentException: Can not mix vectorized and non-vectorized UDFs
> java.lang.IllegalArgumentException: Can not mix vectorized and non-vectorized 
> UDFs
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:170)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:146)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.immutable.List.map(List.scala:285)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:146)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>  at scala.collection.immutable.List.foldLeft(List.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPla

[jira] [Commented] (SPARK-24624) Can not mix vectorized and non-vectorized UDFs

2019-03-06 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785920#comment-16785920
 ] 

peay commented on SPARK-24624:
--

Are there plans to support something similar for aggregation functions?

> Can not mix vectorized and non-vectorized UDFs
> --
>
> Key: SPARK-24624
> URL: https://issues.apache.org/jira/browse/SPARK-24624
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> In the current impl, we have the limitation: users are unable to mix 
> vectorized and non-vectorized UDFs in same Project. This becomes worse since 
> our optimizer could combine continuous Projects into a single one. For 
> example, 
> {code}
> applied_df = df.withColumn('regular', my_regular_udf('total', 
> 'qty')).withColumn('pandas', my_pandas_udf('total', 'qty'))
> {code}
> Returns the following error. 
> {code}
> IllegalArgumentException: Can not mix vectorized and non-vectorized UDFs
> java.lang.IllegalArgumentException: Can not mix vectorized and non-vectorized 
> UDFs
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:170)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:146)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.immutable.List.map(List.scala:285)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:146)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>  at scala.collection.immutable.List.foldLeft(List.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:99)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3312)
>  at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:2750)
>  ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...

[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-05 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784260#comment-16784260
 ] 

peay commented on SPARK-27019:
--

Yes, I had edited my message above shortly after posting - cannot reproduce in 
the second scenario. Thanks!

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: Screenshot from 2019-03-01 21-31-48.png, 
> application_1550040445209_4748, query-1-details.png, query-1-list.png, 
> query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-04 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784172#comment-16784172
 ] 

peay edited comment on SPARK-27019 at 3/5/19 7:28 AM:
--

Great! 

-Is that compatible with my second observation above? (I tested without any 
executors, and even without any task starting, the SQL tab had the wrong 
output). I can try to get an event log for that as well if that's helpful.- 
edit: I tried to reproduce that to export the event log, and could not. Seems 
like your patch should address the issue.


was (Author: peay):
Great! Is that compatible with my second observation above? (I tested without 
any executors, and even without any task starting, the SQL tab had the wrong 
output). I can try to get an event log for that as well if that's helpful.

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: Screenshot from 2019-03-01 21-31-48.png, 
> application_1550040445209_4748, query-1-details.png, query-1-list.png, 
> query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-04 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784172#comment-16784172
 ] 

peay commented on SPARK-27019:
--

Great! Is that compatible with my second observation above? (I tested without 
any executors, and even without any task starting, the SQL tab had the wrong 
output). I can try to get an event log for that as well if that's helpful.

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: Screenshot from 2019-03-01 21-31-48.png, 
> application_1550040445209_4748, query-1-details.png, query-1-list.png, 
> query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors

2019-03-04 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783711#comment-16783711
 ] 

peay commented on SPARK-27039:
--

Interesting, thanks for checking. Yes, I can definitely live without that until 
3.0. Is there already a timeline for 3.0?

> toPandas with Arrow swallows maxResultSize errors
> -
>
> Key: SPARK-27039
> URL: https://issues.apache.org/jira/browse/SPARK-27039
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Minor
>
> I am running the following simple `toPandas` with {{maxResultSize}} set to 
> 1mb:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(1000 * 1000)
> df_pd = df.withColumn("test", F.lit("this is a long string that should make 
> the resulting dataframe too large for maxResult which is 1m")).toPandas()
> {code}
>  
> With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an 
> empty Pandas dataframe without any error:
> {code:python}
> df_pd.info()
> # 
> # Index: 0 entries
> # Data columns (total 2 columns):
> # id  0 non-null object
> # test0 non-null object
> # dtypes: object(2)
> # memory usage: 0.0+ bytes
> {code}
> The driver stderr does have an error, and so does the Spark UI:
> {code:java}
> ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job 
> aborted due to stage failure: Total size of serialized results of 1 tasks 
> (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
>  at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
> {code}
> With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call 
> to {{toPandas}} does fail as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors

2019-03-04 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783352#comment-16783352
 ] 

peay commented on SPARK-27039:
--

Oops, sorry, I've edited the title. I meant _Arrow_, not Avro.

Maybe [~bryanc] would know something about this?

> toPandas with Arrow swallows maxResultSize errors
> -
>
> Key: SPARK-27039
> URL: https://issues.apache.org/jira/browse/SPARK-27039
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Minor
>
> I am running the following simple `toPandas` with {{maxResultSize}} set to 
> 1mb:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(1000 * 1000)
> df_pd = df.withColumn("test", F.lit("this is a long string that should make 
> the resulting dataframe too large for maxResult which is 1m")).toPandas()
> {code}
>  
> With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an 
> empty Pandas dataframe without any error:
> {code:python}
> df_pd.info()
> # 
> # Index: 0 entries
> # Data columns (total 2 columns):
> # id  0 non-null object
> # test0 non-null object
> # dtypes: object(2)
> # memory usage: 0.0+ bytes
> {code}
> The driver stderr does have an error, and so does the Spark UI:
> {code:java}
> ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job 
> aborted due to stage failure: Total size of serialized results of 1 tasks 
> (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
>  at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
> {code}
> With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call 
> to {{toPandas}} does fail as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors

2019-03-04 Thread peay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-27039:
-
Summary: toPandas with Arrow swallows maxResultSize errors  (was: toPandas 
with Avro swallows maxResultSize errors)

> toPandas with Arrow swallows maxResultSize errors
> -
>
> Key: SPARK-27039
> URL: https://issues.apache.org/jira/browse/SPARK-27039
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Minor
>
> I am running the following simple `toPandas` with {{maxResultSize}} set to 
> 1mb:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(1000 * 1000)
> df_pd = df.withColumn("test", F.lit("this is a long string that should make 
> the resulting dataframe too large for maxResult which is 1m")).toPandas()
> {code}
>  
> With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an 
> empty Pandas dataframe without any error:
> {code:python}
> df_pd.info()
> # 
> # Index: 0 entries
> # Data columns (total 2 columns):
> # id  0 non-null object
> # test0 non-null object
> # dtypes: object(2)
> # memory usage: 0.0+ bytes
> {code}
> The driver stderr does have an error, and so does the Spark UI:
> {code:java}
> ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job 
> aborted due to stage failure: Total size of serialized results of 1 tasks 
> (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
>  at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
> {code}
> With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call 
> to {{toPandas}} does fail as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27039) toPandas with Avro swallows maxResultSize errors

2019-03-04 Thread peay (JIRA)
peay created SPARK-27039:


 Summary: toPandas with Avro swallows maxResultSize errors
 Key: SPARK-27039
 URL: https://issues.apache.org/jira/browse/SPARK-27039
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: peay


I am running the following simple `toPandas` with {{maxResultSize}} set to 1mb:
{code:java}
import pyspark.sql.functions as F
df = spark.range(1000 * 1000)
df_pd = df.withColumn("test", F.lit("this is a long string that should make the 
resulting dataframe too large for maxResult which is 1m")).toPandas()
{code}
 
With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an 
empty Pandas dataframe without any error:

{code:python}
df_pd.info()

# 
# Index: 0 entries
# Data columns (total 2 columns):
# id  0 non-null object
# test0 non-null object
# dtypes: object(2)
# memory usage: 0.0+ bytes
{code}

The driver stderr does have an error, and so does the Spark UI:
{code:java}
ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) is 
bigger than spark.driver.maxResultSize (1024.0 KB)
ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) is 
bigger than spark.driver.maxResultSize (1024.0 KB)

Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job aborted 
due to stage failure: Total size of serialized results of 1 tasks (52.8 MB) is 
bigger than spark.driver.maxResultSize (1024.0 KB)
 at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
 at scala.Option.foreach(Option.scala:257)
 at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
 at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
 at 
org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313)
 at 
org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282)
 at 
org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
 at 
org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
 at 
org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
 at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
 at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
{code}

With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call to 
{{toPandas}} does fail as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-27019:
-
Attachment: application_1550040445209_4748

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: application_1550040445209_4748, query-1-details.png, 
> query-1-list.png, query-job-1.png, screenshot-spark-ui-details.png, 
> screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781718#comment-16781718
 ] 

peay commented on SPARK-27019:
--

Attached: [^application_1550040445209_4748]

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: application_1550040445209_4748, query-1-details.png, 
> query-1-list.png, query-job-1.png, screenshot-spark-ui-details.png, 
> screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781651#comment-16781651
 ] 

peay edited comment on SPARK-27019 at 3/1/19 1:06 PM:
--

Sure does.

I've done one more test, using no executors and running the same command:
{code:java}
spark.range(1024 * 1024 * 1024 * 10).toPandas()
{code}

 The SQL tab directly shows the wrong output, even though no task has started - 
i.e., my assumption above that this was because of failing tasks appears wrong. 
This actually seems consistent with the 'submitted date' somehow being wrong.


was (Author: peay):
Sure does.

I've done one more test, using no executors and running the same command:
spark.range(1024 * 1024 * 1024 * 10).toPandas()
The SQL tab directly shows the wrong output, even though no task has started - 
i.e., my assumption above that this was because of failing tasks appears wrong. 
This actually seems consistent with the 'submitted date' somehow being wrong.

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: query-1-details.png, query-1-list.png, query-job-1.png, 
> screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781651#comment-16781651
 ] 

peay commented on SPARK-27019:
--

Sure does.

I've done one more test, using no executors and running the same command:
spark.range(1024 * 1024 * 1024 * 10).toPandas()
The SQL tab directly shows the wrong output, even though no task has started - 
i.e., my assumption above that this was because of failing tasks appears wrong. 
This actually seems consistent with the 'submitted date' somehow being wrong.

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: query-1-details.png, query-1-list.png, query-job-1.png, 
> screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-27019:
-
Attachment: query-job-1.png
query-1-details.png
query-1-list.png

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: query-1-details.png, query-1-list.png, query-job-1.png, 
> screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781604#comment-16781604
 ] 

peay commented on SPARK-27019:
--

OK, I can actually reproduce it pretty easily with pyspark:
{code:java}
df_test = spark.range(1024 * 1024 * 1024 * 10).toPandas(){code}
This makes the tasks fail because my executors don't have enough memory, which 
seems to be key to hitting the issue. Using only 1000 elements, the job 
succeeds and it does not trigger the issue.

!query-1-list.png!

 

!query-1-details.png!

!query-job-1.png!

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: query-1-details.png, query-1-list.png, query-job-1.png, 
> screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781595#comment-16781595
 ] 

peay commented on SPARK-27019:
--

I don't really have a minimal example - this uses a bunch of python jobs and/or 
interactive works in notebook. Is there a way I could collect debug information 
to help troubleshooting this?

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: screenshot-spark-ui-details.png, 
> screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781595#comment-16781595
 ] 

peay edited comment on SPARK-27019 at 3/1/19 11:40 AM:
---

I don't really have a minimal example - this uses a bunch of python jobs and/or 
interactive work in notebooks. Is there a way I could collect debug information 
to help troubleshooting this?


was (Author: peay):
I don't really have a minimal example - this uses a bunch of python jobs and/or 
interactive works in notebook. Is there a way I could collect debug information 
to help troubleshooting this?

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: screenshot-spark-ui-details.png, 
> screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781583#comment-16781583
 ] 

peay commented on SPARK-27019:
--

Also seeing this both for the Spark UI for live jobs, and when accessing past 
jobs through the Spark History Server.

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: screenshot-spark-ui-details.png, 
> screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.
>  
> !image-2019-03-01-12-05-40-119.png!
> !image-2019-03-01-12-06-48-707.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-27019:
-
Description: 
Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the Spark 
UI, where submitted/duration make no sense, description has the ID instead of 
the actual description.

Clicking on the link to open a query, the SQL plan is missing as well.

I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
very large values like 30k out of paranoia that we may have too many events, 
but to no avail. I have not identified anything particular that leads to that: 
it doesn't occur in all my jobs, but it does occur in a lot of them still.

  was:
Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the Spark 
UI, where submitted/duration make no sense, description has the ID instead of 
the actual description.

Clicking on the link to open a query, the SQL plan is missing as well.

I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
very large values like 30k out of paranoia that we may have too many events, 
but to no avail. I have not identified anything particular that leads to that: 
it doesn't occur in all my jobs, but it does occur in a lot of them still.

 

!image-2019-03-01-12-05-40-119.png!

!image-2019-03-01-12-06-48-707.png!


> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: screenshot-spark-ui-details.png, 
> screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781581#comment-16781581
 ] 

peay edited comment on SPARK-27019 at 3/1/19 11:10 AM:
---

Seems like the screenshots did not embed, attaching them instead.


was (Author: peay):
Seems like the screenshot did not embed, attaching them instead.

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: screenshot-spark-ui-details.png, 
> screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.
>  
> !image-2019-03-01-12-05-40-119.png!
> !image-2019-03-01-12-06-48-707.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-27019:
-
Attachment: screenshot-spark-ui-list.png
screenshot-spark-ui-details.png

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: screenshot-spark-ui-details.png, 
> screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.
>  
> !image-2019-03-01-12-05-40-119.png!
> !image-2019-03-01-12-06-48-707.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781581#comment-16781581
 ] 

peay commented on SPARK-27019:
--

Seems like the screenshot did not embed, attaching them instead.

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: screenshot-spark-ui-details.png, 
> screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.
>  
> !image-2019-03-01-12-05-40-119.png!
> !image-2019-03-01-12-06-48-707.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-01 Thread peay (JIRA)
peay created SPARK-27019:


 Summary: Spark UI's SQL tab shows inconsistent values
 Key: SPARK-27019
 URL: https://issues.apache.org/jira/browse/SPARK-27019
 Project: Spark
  Issue Type: Bug
  Components: SQL, Web UI
Affects Versions: 2.4.0
Reporter: peay


Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the Spark 
UI, where submitted/duration make no sense, description has the ID instead of 
the actual description.

Clicking on the link to open a query, the SQL plan is missing as well.

I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
very large values like 30k out of paranoia that we may have too many events, 
but to no avail. I have not identified anything particular that leads to that: 
it doesn't occur in all my jobs, but it does occur in a lot of them still.

 

!image-2019-03-01-12-05-40-119.png!

!image-2019-03-01-12-06-48-707.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24523) InterruptedException when closing SparkContext

2018-10-10 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16644602#comment-16644602
 ] 

peay commented on SPARK-24523:
--

I've started hitting the exact same error since I upgraded to EMR 5.17. Been 
running Spark 2.3.1 for a while before without issues, though, so I still 
suspect something EMR specific. In spite of this, AWS premium support has not 
been very helpful and mostly blames Spark itself.

This occurs only for one particular job, which has a rather large computation 
graph with a union of a couple dozens of datasets. For this job, it occurs 100% 
of the time. I don't see any warning regarding events being dropped, but it 
seems that OP managed to get rid of those warnings although the original issue 
still persists, which seems to be our case.

Manually closing the Spark context does seem to work for us. Adding more driver 
cores did not help. Attaching a [^thread-dump.log] from slightly before the 
crash in case that's useful.

> InterruptedException when closing SparkContext
> --
>
> Key: SPARK-24523
> URL: https://issues.apache.org/jira/browse/SPARK-24523
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0, 2.3.1
> Environment: EMR 5.14.0, S3/HDFS inputs and outputs; EMR 5.17
>  
>  
>  
>Reporter: Umayr Hassan
>Priority: Major
> Attachments: spark-stop-jstack.log.1, spark-stop-jstack.log.2, 
> spark-stop-jstack.log.3, thread-dump.log
>
>
> I'm running a Scala application in EMR with the following properties:
> {{--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory 
> 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=20 --conf 
> spark.eventLog.dir=hdfs:///var/log/spark/apps --conf 
> spark.eventLog.enabled=true --conf 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf 
> spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
> spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400 
> --conf spark.yarn.maxAppAttempts=1}}
> The application runs fine till SparkContext is (automatically) closed, at 
> which point the SparkContext object throws. 
> {{18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1 
> java.lang.InterruptedException at java.lang.Object.wait(Native Method) at 
> java.lang.Thread.join(Thread.java:1252) at 
> java.lang.Thread.join(Thread.java:1326) at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133) at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at 
> org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) 
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) 
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at scala.util.Try$.apply(Try.scala:192) at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Thr

[jira] [Updated] (SPARK-24523) InterruptedException when closing SparkContext

2018-10-10 Thread peay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-24523:
-
Attachment: thread-dump.log

> InterruptedException when closing SparkContext
> --
>
> Key: SPARK-24523
> URL: https://issues.apache.org/jira/browse/SPARK-24523
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0, 2.3.1
> Environment: EMR 5.14.0, S3/HDFS inputs and outputs; EMR 5.17
>  
>  
>  
>Reporter: Umayr Hassan
>Priority: Major
> Attachments: spark-stop-jstack.log.1, spark-stop-jstack.log.2, 
> spark-stop-jstack.log.3, thread-dump.log
>
>
> I'm running a Scala application in EMR with the following properties:
> {{--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory 
> 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=20 --conf 
> spark.eventLog.dir=hdfs:///var/log/spark/apps --conf 
> spark.eventLog.enabled=true --conf 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf 
> spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
> spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400 
> --conf spark.yarn.maxAppAttempts=1}}
> The application runs fine till SparkContext is (automatically) closed, at 
> which point the SparkContext object throws. 
> {{18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1 
> java.lang.InterruptedException at java.lang.Object.wait(Native Method) at 
> java.lang.Thread.join(Thread.java:1252) at 
> java.lang.Thread.join(Thread.java:1326) at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133) at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at 
> org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) 
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) 
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at scala.util.Try$.apply(Try.scala:192) at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)}}
>  
> I've not seen this behavior in Spark 2.0.2 and Spark 2.2.0 (for the same 
> application), so I'm not sure which change is causing Spark 2.3 to throw. Any 
> ideas?
> best,
> Umayr



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2017-09-20 Thread peay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173139#comment-16173139
 ] 

peay commented on SPARK-10925:
--

Same issue on Spark 2.1.0. I have been working around this using 
`df.rdd.toDF()` before using `join`, but this is really far from ideal.

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala, 
> TestCase.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:4

[jira] [Created] (SPARK-21659) FileStreamSink checks for _spark_metadata even if path has globs

2017-08-07 Thread peay (JIRA)
peay created SPARK-21659:


 Summary: FileStreamSink checks for _spark_metadata even if path 
has globs
 Key: SPARK-21659
 URL: https://issues.apache.org/jira/browse/SPARK-21659
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Affects Versions: 2.2.0
Reporter: peay
Priority: Minor


I am using the GCS connector for Hadoop, and reading a Dataframe using 
{{context.read.format("parquet").load("...")}}.

When my URI has glob patterns of the form
{code}
gs://uri/{a,b,c}
{code}
or as below, Spark incorrectly assumes that it is a single file path, and 
produces this rather verbose exception:

{code}
java.net.URISyntaxException: Illegal character in path at index xx: 
gs://bucket-name/path/to/date=2017-0{1-29,1-30,1-31,2-01,2-02,2-03,2-04}*/_spark_metadata
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parseHierarchical(URI.java:3105)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.(URI.java:588)
at 
com.google.cloud.hadoop.gcsio.LegacyPathCodec.getPath(LegacyPathCodec.java:93)
at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:171)
at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1421)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
{code}

I am not quite sure if the GCS connector deviates from the HCFS standard here 
in terms of behavior, but this makes logs really hard to read for jobs that 
load a bunch of files like this.

https://github.com/apache/spark/blob/3ac60930865209bf804ec6506d9d8b0ddd613157/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L39
 already has an explicit {{case Seq(singlePath) =>}}, except that it is 
misleading because {{singlePath}} can have wildcards. In addition, it could 
check for non-escaped glob characters, like

{code}
{, }, ?, *
{code}

and go to the multiple-paths case when those are present, where looking for 
metadata is skipped.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21550) approxQuantiles throws "next on empty iterator" on empty data

2017-07-27 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay closed SPARK-21550.

   Resolution: Duplicate
Fix Version/s: 2.2.0

> approxQuantiles throws "next on empty iterator" on empty data
> -
>
> Key: SPARK-21550
> URL: https://issues.apache.org/jira/browse/SPARK-21550
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: peay
> Fix For: 2.2.0
>
>
> The documentation says:
> {code}
> null and NaN values will be removed from the numerical column before 
> calculation. If
> the dataframe is empty or the column only contains null or NaN, an empty 
> array is returned.
> {code}
> However, this small pyspark example
> {code}
> sql_context.range(10).filter(col("id") == 42).approxQuantile("id", [0.99], 
> 0.001)
> {code}
> throws
> {code}
> Py4JJavaError: An error occurred while calling o3493.approxQuantile.
> : java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:107)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
>   at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
>   at 
> scala.collection.TraversableLike$class.last(TraversableLike.scala:431)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186)
>   at 
> scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132)
>   at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207)
>   at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(StatFunctions.scala:92)
>   at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92)
>   at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow

2017-07-27 Thread peay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16103539#comment-16103539
 ] 

peay commented on SPARK-21551:
--

Sure, does 15 seconds sound good?

> pyspark's collect fails when getaddrinfo is too slow
> 
>
> Key: SPARK-21551
> URL: https://issues.apache.org/jira/browse/SPARK-21551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: peay
>Priority: Critical
>
> Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and 
> {{DataFrame.collect}} all work by starting an ephemeral server in the driver, 
> and having Python connect to it to download the data.
> All three are implemented along the lines of:
> {code}
> port = self._jdf.collectToPython()
> return list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
> {code}
> The server has **a hardcoded timeout of 3 seconds** 
> (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695)
>  -- i.e., the Python process has 3 seconds to connect to it from the very 
> moment the driver server starts.
> In general, that seems fine, but I have been encountering frequent timeouts 
> leading to `Exception: could not open socket`.
> After investigating a bit, it turns out that {{_load_from_socket}} makes a 
> call to {{getaddrinfo}}:
> {code}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
>.. connect ..
> {code}
> I am not sure why, but while most such calls to {{getaddrinfo}} on my machine 
> only take a couple milliseconds, about 10% of them take between 2 and 10 
> seconds, leading to about 10% of jobs failing. I don't think we can always 
> expect {{getaddrinfo}} to return instantly. More generally, Python may 
> sometimes pause for a couple seconds, which may not leave enough time for the 
> process to connect to the server.
> Especially since the server timeout is hardcoded, I think it would be best to 
> set a rather generous value (15 seconds?) to avoid such situations.
> A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do 
> it before starting up the driver server.
>  
> cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow

2017-07-27 Thread peay (JIRA)
peay created SPARK-21551:


 Summary: pyspark's collect fails when getaddrinfo is too slow
 Key: SPARK-21551
 URL: https://issues.apache.org/jira/browse/SPARK-21551
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: peay
Priority: Critical


Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and 
{{DataFrame.collect}} all work by starting an ephemeral server in the driver, 
and having Python connect to it to download the data.

All three are implemented along the lines of:

{code}
port = self._jdf.collectToPython()
return list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
{code}

The server has **a hardcoded timeout of 3 seconds** 
(https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695)
 -- i.e., the Python process has 3 seconds to connect to it from the very 
moment the driver server starts.

In general, that seems fine, but I have been encountering frequent timeouts 
leading to `Exception: could not open socket`.

After investigating a bit, it turns out that {{_load_from_socket}} makes a call 
to {{getaddrinfo}}:

{code}
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
   .. connect ..
{code}

I am not sure why, but while most such calls to {{getaddrinfo}} on my machine 
only take a couple milliseconds, about 10% of them take between 2 and 10 
seconds, leading to about 10% of jobs failing. I don't think we can always 
expect {{getaddrinfo}} to return instantly. More generally, Python may 
sometimes pause for a couple seconds, which may not leave enough time for the 
process to connect to the server.

Especially since the server timeout is hardcoded, I think it would be best to 
set a rather generous value (15 seconds?) to avoid such situations.

A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do 
it before starting up the driver server.
 
cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21550) approxQuantiles throws "next on empty iterator" on empty data

2017-07-27 Thread peay (JIRA)
peay created SPARK-21550:


 Summary: approxQuantiles throws "next on empty iterator" on empty 
data
 Key: SPARK-21550
 URL: https://issues.apache.org/jira/browse/SPARK-21550
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: peay


The documentation says:
{code}
null and NaN values will be removed from the numerical column before 
calculation. If
the dataframe is empty or the column only contains null or NaN, an empty array 
is returned.
{code}

However, this small pyspark example
{code}
sql_context.range(10).filter(col("id") == 42).approxQuantile("id", [0.99], 
0.001)
{code}

throws

{code}
Py4JJavaError: An error occurred while calling o3493.approxQuantile.
: java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
at 
scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike$class.head(IterableLike.scala:107)
at 
scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
at 
scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$class.last(TraversableLike.scala:431)
at 
scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186)
at 
scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132)
at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186)
at 
org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(StatFunctions.scala:92)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15672245#comment-15672245
 ] 

peay edited comment on SPARK-18473 at 11/17/16 12:55 AM:
-

Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct? (edit: 
nevermind, the fix appears to be in 2.0.2 indeed).


was (Author: peay):
Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct?

> Correctness issue in INNER join result with window functions
> 
>
> Key: SPARK-18473
> URL: https://issues.apache.org/jira/browse/SPARK-18473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: peay
>Assignee: Xiao Li
>
> I have stumbled onto a corner case where an INNER join appears to return 
> incorrect results. I believe the join should behave as the identity, but 
> instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-+-+--+++--+--+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-+-+--+++--+--+
> |1|1| 1|   0|   1| 0| 1|
> |2|2| 0|   0|   1| 0| 1|
> |1|3| 1|   0|   2| 0| 2|
> +-+-+--+++--+--+
> {code}
> with
> {code}
> +--+
> |sessId|
> +--+
> | 1|
> | 2|
> +--+
> {code}
> The result is
> {code}
> +--+-+-+--+++--+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +--+-+-+--+++--+
> | 1|2|2| 0|   0|   1| 0|
> | 2|1|1| 1|   0|   1|-1|
> | 2|1|3| 1|   0|   2| 0|
> +--+-+-+--+++--+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and 
> how `fiftyCount` can now be negative while always zero in the original 
> dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
> `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
> `fillna` -- although there are no visible effect on the dataframe as shown by 
> `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
> pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
> df1
> .withColumn("hasOne", (col("index") == 1).cast("int"))
> .withColumn("hasFifty", (col("index") == 50).cast("int"))
> .withColumn("numOnesBefore", 
> F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
> .withColumn("numFiftyStrictlyBefore", 
> F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
> .fillna({ 'numFiftyStrictlyBefore': 0 })
> .withColumn("sessId", col("numOnesBefore") - 
> col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15672245#comment-15672245
 ] 

peay commented on SPARK-18473:
--

Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct?

> Correctness issue in INNER join result with window functions
> 
>
> Key: SPARK-18473
> URL: https://issues.apache.org/jira/browse/SPARK-18473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: peay
>Assignee: Xiao Li
>
> I have stumbled onto a corner case where an INNER join appears to return 
> incorrect results. I believe the join should behave as the identity, but 
> instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-+-+--+++--+--+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-+-+--+++--+--+
> |1|1| 1|   0|   1| 0| 1|
> |2|2| 0|   0|   1| 0| 1|
> |1|3| 1|   0|   2| 0| 2|
> +-+-+--+++--+--+
> {code}
> with
> {code}
> +--+
> |sessId|
> +--+
> | 1|
> | 2|
> +--+
> {code}
> The result is
> {code}
> +--+-+-+--+++--+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +--+-+-+--+++--+
> | 1|2|2| 0|   0|   1| 0|
> | 2|1|1| 1|   0|   1|-1|
> | 2|1|3| 1|   0|   2| 0|
> +--+-+-+--+++--+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and 
> how `fiftyCount` can now be negative while always zero in the original 
> dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
> `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
> `fillna` -- although there are no visible effect on the dataframe as shown by 
> `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
> pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
> df1
> .withColumn("hasOne", (col("index") == 1).cast("int"))
> .withColumn("hasFifty", (col("index") == 50).cast("int"))
> .withColumn("numOnesBefore", 
> F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
> .withColumn("numFiftyStrictlyBefore", 
> F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
> .fillna({ 'numFiftyStrictlyBefore': 0 })
> .withColumn("sessId", col("numOnesBefore") - 
> col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671624#comment-15671624
 ] 

peay commented on SPARK-18473:
--

Ah, great, thanks. I had checked out the CHANGELOG but couldn't find anything 
relevant, any reference to the corresponding issue?

> Correctness issue in INNER join result with window functions
> 
>
> Key: SPARK-18473
> URL: https://issues.apache.org/jira/browse/SPARK-18473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: peay
>
> I have stumbled onto a corner case where an INNER join appears to return 
> incorrect results. I believe the join should behave as the identity, but 
> instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-+-+--+++--+--+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-+-+--+++--+--+
> |1|1| 1|   0|   1| 0| 1|
> |2|2| 0|   0|   1| 0| 1|
> |1|3| 1|   0|   2| 0| 2|
> +-+-+--+++--+--+
> {code}
> with
> {code}
> +--+
> |sessId|
> +--+
> | 1|
> | 2|
> +--+
> {code}
> The result is
> {code}
> +--+-+-+--+++--+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +--+-+-+--+++--+
> | 1|2|2| 0|   0|   1| 0|
> | 2|1|1| 1|   0|   1|-1|
> | 2|1|3| 1|   0|   2| 0|
> +--+-+-+--+++--+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and 
> how `fiftyCount` can now be negative while always zero in the original 
> dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
> `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
> `fillna` -- although there are no visible effect on the dataframe as shown by 
> `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
> pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
> df1
> .withColumn("hasOne", (col("index") == 1).cast("int"))
> .withColumn("hasFifty", (col("index") == 50).cast("int"))
> .withColumn("numOnesBefore", 
> F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
> .withColumn("numFiftyStrictlyBefore", 
> F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
> .fillna({ 'numFiftyStrictlyBefore': 0 })
> .withColumn("sessId", col("numOnesBefore") - 
> col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how two rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is *correct* if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is *correct* if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- 

[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is *correct* if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 })` {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- I

[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 })` {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add `.fillna({ 'numOnesBefore': 0 })` -- although there are no visible 
effect on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOI

[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add `.fillna({ 'numOnesBefore': 0 })` -- although there are no visible 
effect on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add .fillna({ 'numOnesBefore': 0 }), although there are no visible effect 
on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them

[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add .fillna({ 'numOnesBefore': 0 }), although there are no visible effect 
on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code:python}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add `.fillna({ 'numOnesBefore': 0 })`, although there are no visible effect 
on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read

[jira] [Created] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)
peay created SPARK-18473:


 Summary: Correctness issue in INNER join result with window 
functions
 Key: SPARK-18473
 URL: https://issues.apache.org/jira/browse/SPARK-18473
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, SQL
Affects Versions: 2.0.1
Reporter: peay


I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add `.fillna({ 'numOnesBefore': 0 })`, although there are no visible effect 
on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code:python}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org