[jira] [Commented] (SPARK-31632) The ApplicationInfo in KVStore may be accessed before it's prepared

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134655#comment-17134655
 ] 

Apache Spark commented on SPARK-31632:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/28820

> The ApplicationInfo in KVStore may be accessed before it's prepared
> ---
>
> Key: SPARK-31632
> URL: https://issues.apache.org/jira/browse/SPARK-31632
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Xingcan Cui
>Assignee: Xingcan Cui
>Priority: Minor
> Fix For: 2.4.6, 3.0.0
>
>
> While starting some local tests, I occasionally encountered the following 
> exceptions for Web UI.
> {noformat}
> 23:00:29.845 WARN org.eclipse.jetty.server.HttpChannel: /jobs/
>  java.util.NoSuchElementException
>  at java.util.Collections$EmptyIterator.next(Collections.java:4191)
>  at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467)
>  at 
> org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39)
>  at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266)
>  at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89)
>  at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
>  at 
> org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
>  at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>  at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)
>  at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>  at org.eclipse.jetty.server.Server.handle(Server.java:505)
>  at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
>  at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
>  at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
>  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
>  at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
>  at java.lang.Thread.run(Thread.java:748){noformat}
> *Reason*
>  That is because {{AppStatusStore.applicationInfo()}} accesses an empty view 
> (iterator) returned by {{InMemoryStore}}.
> AppStatusStore
> {code:java}
> def applicationInfo(): v1.ApplicationInfo = {
> store.view(classOf[ApplicationInfoWrapper]).max(1).iterator().next().info
> }
> {code}
> InMemoryStore
> {code:java}
> public  KVStoreView view(Class type){
> InstanceList list = inMemoryLists.get(type);
> return list != null ? list.view() : emptyView();
>  }
> {code}
> During the initialization of {{SparkContext}}, it first starts the Web UI 
> (SparkContext: L475 _ui.foreach(_.bind())) and then setup the 
> {{LiveListenerBus}} thread (SparkContext: L608 
> {{setupAndStartListenerBus()}}) for dispatching the 
> {{SparkListenerApplicationStart}} event (which will trigger writing the 
> requested {{ApplicationInfo}} to {{InMemoryStore}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31632) The ApplicationInfo in KVStore may be accessed before it's prepared

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134654#comment-17134654
 ] 

Apache Spark commented on SPARK-31632:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/28820

> The ApplicationInfo in KVStore may be accessed before it's prepared
> ---
>
> Key: SPARK-31632
> URL: https://issues.apache.org/jira/browse/SPARK-31632
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Xingcan Cui
>Assignee: Xingcan Cui
>Priority: Minor
> Fix For: 2.4.6, 3.0.0
>
>
> While starting some local tests, I occasionally encountered the following 
> exceptions for Web UI.
> {noformat}
> 23:00:29.845 WARN org.eclipse.jetty.server.HttpChannel: /jobs/
>  java.util.NoSuchElementException
>  at java.util.Collections$EmptyIterator.next(Collections.java:4191)
>  at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467)
>  at 
> org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39)
>  at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266)
>  at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89)
>  at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
>  at 
> org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
>  at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>  at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)
>  at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>  at org.eclipse.jetty.server.Server.handle(Server.java:505)
>  at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
>  at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
>  at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
>  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
>  at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
>  at java.lang.Thread.run(Thread.java:748){noformat}
> *Reason*
>  That is because {{AppStatusStore.applicationInfo()}} accesses an empty view 
> (iterator) returned by {{InMemoryStore}}.
> AppStatusStore
> {code:java}
> def applicationInfo(): v1.ApplicationInfo = {
> store.view(classOf[ApplicationInfoWrapper]).max(1).iterator().next().info
> }
> {code}
> InMemoryStore
> {code:java}
> public  KVStoreView view(Class type){
> InstanceList list = inMemoryLists.get(type);
> return list != null ? list.view() : emptyView();
>  }
> {code}
> During the initialization of {{SparkContext}}, it first starts the Web UI 
> (SparkContext: L475 _ui.foreach(_.bind())) and then setup the 
> {{LiveListenerBus}} thread (SparkContext: L608 
> {{setupAndStartListenerBus()}}) for dispatching the 
> {{SparkListenerApplicationStart}} event (which will trigger writing the 
> requested {{ApplicationInfo}} to {{InMemoryStore}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31980) Spark sequence() fails if start and end of range are identical dates

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134642#comment-17134642
 ] 

Apache Spark commented on SPARK-31980:
--

User 'TJX2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/28819

> Spark sequence() fails if start and end of range are identical dates
> 
>
> Key: SPARK-31980
> URL: https://issues.apache.org/jira/browse/SPARK-31980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4 standalone and on AWS EMR
>Reporter: Dave DeCaprio
>Priority: Minor
>
>  
> The following Spark SQL query throws an exception
> {code:java}
> select sequence(cast("2011-03-01" as date), cast("2011-03-01" as date), 
> interval 1 month)
> {code}
> The error is:
>  
>  
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 
> 1java.lang.ArrayIndexOutOfBoundsException: 1 at 
> scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:92) at 
> org.apache.spark.sql.catalyst.expressions.Sequence$TemporalSequenceImpl.eval(collectionOperations.scala:2681)
>  at 
> org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:2514)
>  at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:389){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31980) Spark sequence() fails if start and end of range are identical dates

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31980:


Assignee: Apache Spark

> Spark sequence() fails if start and end of range are identical dates
> 
>
> Key: SPARK-31980
> URL: https://issues.apache.org/jira/browse/SPARK-31980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4 standalone and on AWS EMR
>Reporter: Dave DeCaprio
>Assignee: Apache Spark
>Priority: Minor
>
>  
> The following Spark SQL query throws an exception
> {code:java}
> select sequence(cast("2011-03-01" as date), cast("2011-03-01" as date), 
> interval 1 month)
> {code}
> The error is:
>  
>  
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 
> 1java.lang.ArrayIndexOutOfBoundsException: 1 at 
> scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:92) at 
> org.apache.spark.sql.catalyst.expressions.Sequence$TemporalSequenceImpl.eval(collectionOperations.scala:2681)
>  at 
> org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:2514)
>  at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:389){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31980) Spark sequence() fails if start and end of range are identical dates

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134641#comment-17134641
 ] 

Apache Spark commented on SPARK-31980:
--

User 'TJX2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/28819

> Spark sequence() fails if start and end of range are identical dates
> 
>
> Key: SPARK-31980
> URL: https://issues.apache.org/jira/browse/SPARK-31980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4 standalone and on AWS EMR
>Reporter: Dave DeCaprio
>Priority: Minor
>
>  
> The following Spark SQL query throws an exception
> {code:java}
> select sequence(cast("2011-03-01" as date), cast("2011-03-01" as date), 
> interval 1 month)
> {code}
> The error is:
>  
>  
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 
> 1java.lang.ArrayIndexOutOfBoundsException: 1 at 
> scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:92) at 
> org.apache.spark.sql.catalyst.expressions.Sequence$TemporalSequenceImpl.eval(collectionOperations.scala:2681)
>  at 
> org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:2514)
>  at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:389){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31980) Spark sequence() fails if start and end of range are identical dates

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31980:


Assignee: (was: Apache Spark)

> Spark sequence() fails if start and end of range are identical dates
> 
>
> Key: SPARK-31980
> URL: https://issues.apache.org/jira/browse/SPARK-31980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4 standalone and on AWS EMR
>Reporter: Dave DeCaprio
>Priority: Minor
>
>  
> The following Spark SQL query throws an exception
> {code:java}
> select sequence(cast("2011-03-01" as date), cast("2011-03-01" as date), 
> interval 1 month)
> {code}
> The error is:
>  
>  
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 
> 1java.lang.ArrayIndexOutOfBoundsException: 1 at 
> scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:92) at 
> org.apache.spark.sql.catalyst.expressions.Sequence$TemporalSequenceImpl.eval(collectionOperations.scala:2681)
>  at 
> org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:2514)
>  at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:389){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31198) Use graceful decommissioning as part of dynamic scaling

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134628#comment-17134628
 ] 

Apache Spark commented on SPARK-31198:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28818

> Use graceful decommissioning as part of dynamic scaling
> ---
>
> Key: SPARK-31198
> URL: https://issues.apache.org/jira/browse/SPARK-31198
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31198) Use graceful decommissioning as part of dynamic scaling

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31198:


Assignee: Apache Spark  (was: Holden Karau)

> Use graceful decommissioning as part of dynamic scaling
> ---
>
> Key: SPARK-31198
> URL: https://issues.apache.org/jira/browse/SPARK-31198
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31198) Use graceful decommissioning as part of dynamic scaling

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134627#comment-17134627
 ] 

Apache Spark commented on SPARK-31198:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28818

> Use graceful decommissioning as part of dynamic scaling
> ---
>
> Key: SPARK-31198
> URL: https://issues.apache.org/jira/browse/SPARK-31198
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31198) Use graceful decommissioning as part of dynamic scaling

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31198:


Assignee: Holden Karau  (was: Apache Spark)

> Use graceful decommissioning as part of dynamic scaling
> ---
>
> Key: SPARK-31198
> URL: https://issues.apache.org/jira/browse/SPARK-31198
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31197) Exit the executor once all tasks & migrations are finished

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31197:


Assignee: (was: Apache Spark)

> Exit the executor once all tasks & migrations are finished
> --
>
> Key: SPARK-31197
> URL: https://issues.apache.org/jira/browse/SPARK-31197
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31197) Exit the executor once all tasks & migrations are finished

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31197:


Assignee: Apache Spark

> Exit the executor once all tasks & migrations are finished
> --
>
> Key: SPARK-31197
> URL: https://issues.apache.org/jira/browse/SPARK-31197
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31197) Exit the executor once all tasks & migrations are finished

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134626#comment-17134626
 ] 

Apache Spark commented on SPARK-31197:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28817

> Exit the executor once all tasks & migrations are finished
> --
>
> Key: SPARK-31197
> URL: https://issues.apache.org/jira/browse/SPARK-31197
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2020-06-12 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134610#comment-17134610
 ] 

L. C. Hsieh commented on SPARK-24528:
-

The bucket info including bucket ids and files in buckets are the details in 
particular physical plan (FileSourceScanExec). A merge-sort operator needs to 
know these details in order to do merge-sort. I feel we should extract the 
iterating logic in FileScanRDD and do merge-sort in there.

[~cloud_fan] WDYT?

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ohad Raviv
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-24528#Closely related to  
> SPARK-24410, we're trying to optimize a very common use case we have of 
> getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31967) Loading jobs UI page takes 40 seconds

2020-06-12 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-31967:
--

Assignee: Gengliang Wang

> Loading jobs UI page takes 40 seconds
> -
>
> Key: SPARK-31967
> URL: https://issues.apache.org/jira/browse/SPARK-31967
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Blocker
> Attachments: load_time.jpeg, profile.png
>
>
> In the latest master branch, I find that the job list page becomes very slow.
> To reproduce in local setup:
> {code:java}
> spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1")
> spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2")
> (1 to 1000).map(_ =>  spark.sql("select * from t1, t2 where 
> t1.value=t2.value").show())
> {code}
> And that, open live UI: http://localhost:4040/
> The loading time is about 40 seconds.
> If we comment out the function call for `drawApplicationTimeline`, then the 
> loading time is around 1 second.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31967) Loading jobs UI page takes 40 seconds

2020-06-12 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134608#comment-17134608
 ] 

Gengliang Wang commented on SPARK-31967:


The issue is resoveld in https://github.com/apache/spark/pull/28806

> Loading jobs UI page takes 40 seconds
> -
>
> Key: SPARK-31967
> URL: https://issues.apache.org/jira/browse/SPARK-31967
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Gengliang Wang
>Priority: Blocker
> Attachments: load_time.jpeg, profile.png
>
>
> In the latest master branch, I find that the job list page becomes very slow.
> To reproduce in local setup:
> {code:java}
> spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1")
> spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2")
> (1 to 1000).map(_ =>  spark.sql("select * from t1, t2 where 
> t1.value=t2.value").show())
> {code}
> And that, open live UI: http://localhost:4040/
> The loading time is about 40 seconds.
> If we comment out the function call for `drawApplicationTimeline`, then the 
> loading time is around 1 second.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31967) Loading jobs UI page takes 40 seconds

2020-06-12 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-31967.

Resolution: Fixed

> Loading jobs UI page takes 40 seconds
> -
>
> Key: SPARK-31967
> URL: https://issues.apache.org/jira/browse/SPARK-31967
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Blocker
> Attachments: load_time.jpeg, profile.png
>
>
> In the latest master branch, I find that the job list page becomes very slow.
> To reproduce in local setup:
> {code:java}
> spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1")
> spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2")
> (1 to 1000).map(_ =>  spark.sql("select * from t1, t2 where 
> t1.value=t2.value").show())
> {code}
> And that, open live UI: http://localhost:4040/
> The loading time is about 40 seconds.
> If we comment out the function call for `drawApplicationTimeline`, then the 
> loading time is around 1 second.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31976) use MemoryUsage to control the size of block

2020-06-12 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-31976:
---
Priority: Blocker  (was: Major)

> use MemoryUsage to control the size of block
> 
>
> Key: SPARK-31976
> URL: https://issues.apache.org/jira/browse/SPARK-31976
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Blocker
>
> According to the performance test in 
> https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
> mainly related to the nnz of block.
> So it maybe reasonable to control the size of block by memory usage, instead 
> of number of rows.
>  
> note1: param blockSize had already used in ALS and MLP to stack vectors 
> (expected to be dense);
> note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models;
>  
> There may be two ways to impl:
> 1, compute the sparsity of input vectors ahead of train (this can be computed 
> with other statistics computation, maybe no extra pass), and infer a 
> reasonable number of vectors to stack;
> 2, stack the input vectors adaptively, by monitoring the memory usage in a 
> block;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-12 Thread Christopher Highman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:

Description: 
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical files in CSV format.  When I start reading from a 
folder, however, I might only care about files that were created after a 
certain time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against _modificationDate_ and a date specified 
from an option.  Without the filter specified, we would be expending less 
effort than if the filter were applied by itself since we are comparing 
primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime 
like below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("filesModifiedAfterDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been modified at or later than the 
specified time in order to be consumed for purposes of reading files from a 
folder path or via structured streaming.

 I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.

  was:
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical files in CSV format.  When I start reading from a 
folder, however, I might only care about files that were created after a 
certain time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against _modificationDate_ and a date specified 
from an option.  Without the filter specified, we would be expending less 
effort than if the filter were applied by itself since we are comparing 
primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime 
like below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("filesModifiedAfterDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been created at or later than the specified 
time in order to be consumed for purposes of reading files from a folder path 
or via structured streaming.

 I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.


> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using stru

[jira] [Resolved] (SPARK-31977) Returns the projected plan directly from NestedColumnAliasing

2020-06-12 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31977.
--
Fix Version/s: 3.1.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28812

> Returns the projected plan directly from NestedColumnAliasing
> -
>
> Key: SPARK-31977
> URL: https://issues.apache.org/jira/browse/SPARK-31977
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.1.0
>
>
> {{NestedColumnAliasing}} and {{GeneratorNestedColumnAliasing}} have a 
> different usage pattern:
> {code}
> case p @ NestedColumnAliasing(nestedFieldToAlias, attrToAliases) =>
>   NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, 
> attrToAliases)
> {code}
> vs
> {code}
> case GeneratorNestedColumnAliasing(p) => p
> {code}
> It should be better to match it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31950) Extract SQL keywords from the generated parser class in TableIdentifierParserSuite

2020-06-12 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31950.
--
Fix Version/s: 3.1.0
 Assignee: Takeshi Yamamuro  (was: Apache Spark)
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28802

> Extract SQL keywords from the generated parser class in 
> TableIdentifierParserSuite
> --
>
> Key: SPARK-31950
> URL: https://issues.apache.org/jira/browse/SPARK-31950
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31980) Spark sequence() fails if start and end of range are identical dates

2020-06-12 Thread Dave DeCaprio (Jira)
Dave DeCaprio created SPARK-31980:
-

 Summary: Spark sequence() fails if start and end of range are 
identical dates
 Key: SPARK-31980
 URL: https://issues.apache.org/jira/browse/SPARK-31980
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
 Environment: Spark 2.4.4 standalone and on AWS EMR
Reporter: Dave DeCaprio


 

The following Spark SQL query throws an exception
{code:java}
select sequence(cast("2011-03-01" as date), cast("2011-03-01" as date), 
interval 1 month)
{code}
The error is:

 

 
{noformat}
java.lang.ArrayIndexOutOfBoundsException: 
1java.lang.ArrayIndexOutOfBoundsException: 1 at 
scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:92) at 
org.apache.spark.sql.catalyst.expressions.Sequence$TemporalSequenceImpl.eval(collectionOperations.scala:2681)
 at 
org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:2514)
 at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:389){noformat}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31979) release script should not fail when remove non-existing files

2020-06-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31979.
---
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28815
[https://github.com/apache/spark/pull/28815]

> release script should not fail when remove non-existing files
> -
>
> Key: SPARK-31979
> URL: https://issues.apache.org/jira/browse/SPARK-31979
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30119) Support pagination for spark streaming tab

2020-06-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30119.
--
Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28748

> Support pagination for  spark streaming tab
> ---
>
> Key: SPARK-30119
> URL: https://issues.apache.org/jira/browse/SPARK-30119
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Assignee: Rakesh Raushan
>Priority: Minor
> Fix For: 3.1.0
>
>
> Support pagination for spark streaming tab



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31947) Solve string value error about Date/Timestamp in ScriptTransform

2020-06-12 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31947:
--
Description: 
For test case

 
{code:java}
 test("SPARK-25990: TRANSFORM should handle different data types correctly") {
assume(TestUtils.testCommandAvailable("python"))
val scriptFilePath = getTestResourcePath("test_script.py")
withTempView("v") {
  val df = Seq(
(1, "1", 1.0, BigDecimal(1.0), new Timestamp(1), 
Date.valueOf("2015-05-21")),
(2, "2", 2.0, BigDecimal(2.0), new Timestamp(2), 
Date.valueOf("2015-05-22")),
(3, "3", 3.0, BigDecimal(3.0), new Timestamp(3), 
Date.valueOf("2015-05-23"))
  ).toDF("a", "b", "c", "d", "e", "f") // Note column d's data type is 
Decimal(38, 18)
  df.createTempView("v")  val query = sql(
s"""
   |SELECT
   |TRANSFORM(a, b, c, d, e, f)
   |USING 'python $scriptFilePath' AS (a, b, c, d, e, f)
   |FROM v
""".stripMargin)  val decimalToString: Column => Column = c => 
c.cast("string")  checkAnswer(query, identity, df.select(
'a.cast("string"),
'b.cast("string"),
'c.cast("string"),
decimalToString('d),
'e.cast("string"),
'f.cast("string")).collect())
}
  }

{code}
 

 

Get wrong result
{code:java}
[info] - SPARK-25990: TRANSFORM should handle different data types correctly 
*** FAILED *** (4 seconds, 997 milliseconds)
[info]   Results do not match for Spark plan:
[info]ScriptTransformation [a#19, b#20, c#21, d#22, e#23, f#24], python 
/Users/angerszhu/Documents/project/AngersZhu/spark/sql/core/target/scala-2.12/test-classes/test_script.py,
 [a#31, b#32, c#33, d#34, e#35, f#36], 
org.apache.spark.sql.execution.script.ScriptTransformIOSchema@1ad5a29c
[info]   +- Project [_1#6 AS a#19, _2#7 AS b#20, _3#8 AS c#21, _4#9 AS d#22, 
_5#10 AS e#23, _6#11 AS f#24]
[info]  +- LocalTableScan [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11]
[info]
[info]
[info]== Results ==
[info]!== Expected Answer - 3 ==   
== Actual Answer - 3 ==
[info]   ![1,1,1.0,1.00,1970-01-01 08:00:00.001,2015-05-21]   
[1,1,1.0,1.00,1000,16576]
[info]   ![2,2,2.0,2.00,1970-01-01 08:00:00.002,2015-05-22]   
[2,2,2.0,2.00,2000,16577]
[info]   ![3,3,3.0,3.00,1970-01-01 08:00:00.003,2015-05-23]   
[3,3,3.0,3.00,3000,16578] (SparkPlanTest.scala:95)
[
{code}

> Solve string value error about Date/Timestamp in ScriptTransform
> 
>
> Key: SPARK-31947
> URL: https://issues.apache.org/jira/browse/SPARK-31947
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> For test case
>  
> {code:java}
>  test("SPARK-25990: TRANSFORM should handle different data types correctly") {
> assume(TestUtils.testCommandAvailable("python"))
> val scriptFilePath = getTestResourcePath("test_script.py")
> withTempView("v") {
>   val df = Seq(
> (1, "1", 1.0, BigDecimal(1.0), new Timestamp(1), 
> Date.valueOf("2015-05-21")),
> (2, "2", 2.0, BigDecimal(2.0), new Timestamp(2), 
> Date.valueOf("2015-05-22")),
> (3, "3", 3.0, BigDecimal(3.0), new Timestamp(3), 
> Date.valueOf("2015-05-23"))
>   ).toDF("a", "b", "c", "d", "e", "f") // Note column d's data type is 
> Decimal(38, 18)
>   df.createTempView("v")  val query = sql(
> s"""
>|SELECT
>|TRANSFORM(a, b, c, d, e, f)
>|USING 'python $scriptFilePath' AS (a, b, c, d, e, f)
>|FROM v
> """.stripMargin)  val decimalToString: Column => Column = c => 
> c.cast("string")  checkAnswer(query, identity, df.select(
> 'a.cast("string"),
> 'b.cast("string"),
> 'c.cast("string"),
> decimalToString('d),
> 'e.cast("string"),
> 'f.cast("string")).collect())
> }
>   }
> {code}
>  
>  
> Get wrong result
> {code:java}
> [info] - SPARK-25990: TRANSFORM should handle different data types correctly 
> *** FAILED *** (4 seconds, 997 milliseconds)
> [info]   Results do not match for Spark plan:
> [info]ScriptTransformation [a#19, b#20, c#21, d#22, e#23, f#24], python 
> /Users/angerszhu/Documents/project/AngersZhu/spark/sql/core/target/scala-2.12/test-classes/test_script.py,
>  [a#31, b#32, c#33, d#34, e#35, f#36], 
> org.apache.spark.sql.execution.script.ScriptTransformIOSchema@1ad5a29c
> [info]   +- Project [_1#6 AS a#19, _2#7 AS b#20, _3#8 AS c#21, _4#9 AS d#22, 
> _5#10 AS e#23, _6#11 AS f#24]
> [info]  +- LocalTableScan [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11]
> [info]
> [info]
> [info]== Results ==
> [info]!== Expect

[jira] [Updated] (SPARK-31969) StreamingJobProgressListener threw an exception java.util.NoSuchElementException for Long Running Streaming Job

2020-06-12 Thread ThimmeGowda (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ThimmeGowda updated SPARK-31969:

Description: 
We are running a long running streaming job and Below exception is seen 
continuosly after sometime. After the jobs starts all of a sudden our Spark 
streaming application's batch durations start to increase. At around the same 
time there starts to appear an error log that does not refer to the application 
code at all. We couldn't find any other significant errors in the driver logs.

 

Our Application runs fine for several hours(4 or so) , then delay starts to 
build up and finally every 9-10 hours the application restarts. 

Any suggestion on what could be the issue ?

 

Refrerred ticket : https://issues.apache.org/jira/browse/SPARK-21065 for 
similar issue, in our case we are not setting anything for 
spark.streaming.concurrentJobs and default value is taken.

 \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", 
"time":"2020-06-09T04:31:43.918Z", "timezone":"UTC", 
"class":"spark-listener-group-appStatus", 
"method":"streaming.scheduler.StreamingListenerBus.logError(91)", 
"log":"Listener StreamingJobProgressListener threw an 
exception\u000Ajava.util.NoSuchElementException: key not found: 159167710 
ms\u000A\u0009at 
scala.collection.MapLike$class.default(MapLike.scala:228)\u000A\u0009at 
scala.collection.AbstractMap.default(Map.scala:59)\u000A\u0009at 
scala.collection.mutable.HashMap.apply(HashMap.scala:65)\u000A\u0009at 
org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:134)\u000A\u0009at
 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)\u000A\u0009at
 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)\u000A\u0009at
 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)\u000A\u0009at
 
org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)\u000A\u0009at
 
org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)\u000A\u0009at
 
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:80)\u000A\u0009at
 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)\u000A\u0009at
 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)\u000A\u0009at
 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)\u000A\u0009at
 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)\u000A\u0009at
 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)\u000A\u0009at
 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)\u000A\u0009at
 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)\u000A\u0009at
 scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)\u000A\u0009at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)\u000A\u0009at
 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)\u000A\u0009at
 
org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)\u000A\u0009at
 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)\u000A"}
 java.util.NoSuchElementException: key not found: 159167710 ms
 at scala.collection.MapLike$class.default(MapLike.scala:228) 
~[scala-library-2.11.12.jar:?]
 at scala.collection.AbstractMap.default(Map.scala:59) 
~[scala-library-2.11.12.jar:?]
 at scala.collection.mutable.HashMap.apply(HashMap.scala:65) 
~[scala-library-2.11.12.jar:?]
 at 
org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:134)
 ~[spark-streaming_2.11-2.4.0.jar:2.4.0]
 at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
 ~[spark-streaming_2.11-2.4.0.jar:2.4.0]
 at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
 [spark-streaming_2.11-2.4.0.jar:2.4.0]
 at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) 
[spark-core_2.11-2.4.0.jar:2.4.0]
 at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
 [spark-streaming_2.11-2.4.0.jar:2.4.0]
 at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
 [spark-streaming_2.11-2.4.0.jar:2.4.0]

[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-12 Thread Christopher Highman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:

Description: 
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical files in CSV format.  When I start reading from a 
folder, however, I might only care about files that were created after a 
certain time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against _modificationDate_ and a date specified 
from an option.  Without the filter specified, we would be expending less 
effort than if the filter were applied by itself since we are comparing 
primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime 
like below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("filesModifiedAfterDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been created at or later than the specified 
time in order to be consumed for purposes of reading files from a folder path 
or via structured streaming.

 I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.

  was:
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical files in CSV format.  When I start reading from a 
folder, however, I might only care about files that were created after a 
certain time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against modificationDate and a date specified from 
an option.  Without the filter specified, we would be expending less effort 
than if the filter were applied by itself since we are comparing primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime 
like below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("filesModifiedAfterDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been created at or later than the specified 
time in order to be consumed for purposes of reading files from a folder path 
or via structured streaming.

 I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.


> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using structur

[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-12 Thread Christopher Highman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:

Description: 
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical files in CSV format.  When I start reading from a 
folder, however, I might only care about files that were created after a 
certain time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against modificationDate and a date specified from 
an option.  Without the filter specified, we would be expending less effort 
than if the filter were applied by itself since we are comparing primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime 
like below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("filesModifiedAfterDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been created at or later than the specified 
time in order to be consumed for purposes of reading files from a folder path 
or via structured streaming.

 I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.

  was:
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical files in CSV format.  When I start reading from a 
folder, however, I might only care about files that were created after a 
certain time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds FileStatus objects containing 
an implicit _modificationDate_ property.  We may already iterate the resulting 
files if a filter is applied to the path.  In this case, its trivial to do an 
primitive comparison against modificationDate.  Without the filter specified, 
we would be expending less effort than if the filter were applied by itself.  

Having the ability to provide an option specifying a timestamp by which to 
begin globbing files would result in quite a bit of less complexity needed on a 
consumer who leverages the ability to stream from a folder path but does not 
have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "filesModifiedAfterDate" accepting a UTC datetime like 
below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("filesModifiedAfterDate", "2020-05-01 00:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been created at or later than the specified 
time in order to be consumed for purposes of reading files in general or for 
purposes of structured streaming.

 

I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.


> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using structured streaming with a FileDataSource, I've encountered a 
> number of occasio

[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-12 Thread Christopher Highman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:

Description: 
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical files in CSV format.  When I start reading from a 
folder, however, I might only care about files that were created after a 
certain time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds FileStatus objects containing 
an implicit _modificationDate_ property.  We may already iterate the resulting 
files if a filter is applied to the path.  In this case, its trivial to do an 
primitive comparison against modificationDate.  Without the filter specified, 
we would be expending less effort than if the filter were applied by itself.  

Having the ability to provide an option specifying a timestamp by which to 
begin globbing files would result in quite a bit of less complexity needed on a 
consumer who leverages the ability to stream from a folder path but does not 
have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "filesModifiedAfterDate" accepting a UTC datetime like 
below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("filesModifiedAfterDate", "2020-05-01 00:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been created at or later than the specified 
time in order to be consumed for purposes of reading files in general or for 
purposes of structured streaming.

 

I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.

  was:
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical delta files in CSV format.  When I start reading from 
a folder, however, I might only care about files were created after a certain 
time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
 

In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
 there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
in-memory index of files for a given path.  There may a rather clean 
opportunity to consider options here.

Having the ability to provide an option specifying a timestamp by which to 
begin globbing files would result in quite a bit of less complexity needed on a 
consumer who leverages the ability to stream from a folder path but does not 
have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "createdFileTime" accepting a UTC datetime like below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("createdFileTime", "2020-05-01 00:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
 

If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been created at or later than the specified 
time in order to be consumed for purposes of reading the files in general or 
for purposes of structured streaming.

 


> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using structured streaming with a FileDataSource, I've encountered a 
> number of occasions where I want to be able to stream from a folder 
> containing any number of historical files in CSV format.  When I start 
> reading from a folder, however, I might only care about files that were 
> created after a certain time.
> {code:java}
> spark.readStream
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
> In 
> [https://gith

[jira] [Commented] (SPARK-31979) release script should not fail when remove non-existing files

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134098#comment-17134098
 ] 

Apache Spark commented on SPARK-31979:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/28815

> release script should not fail when remove non-existing files
> -
>
> Key: SPARK-31979
> URL: https://issues.apache.org/jira/browse/SPARK-31979
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31979) release script should not fail when remove non-existing files

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31979:


Assignee: Apache Spark  (was: Wenchen Fan)

> release script should not fail when remove non-existing files
> -
>
> Key: SPARK-31979
> URL: https://issues.apache.org/jira/browse/SPARK-31979
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31979) release script should not fail when remove non-existing files

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31979:


Assignee: Wenchen Fan  (was: Apache Spark)

> release script should not fail when remove non-existing files
> -
>
> Key: SPARK-31979
> URL: https://issues.apache.org/jira/browse/SPARK-31979
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31979) release script should not fail when remove non-existing files

2020-06-12 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-31979:
---

 Summary: release script should not fail when remove non-existing 
files
 Key: SPARK-31979
 URL: https://issues.apache.org/jira/browse/SPARK-31979
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31978) spark AE 中处理数据倾斜为啥对join type有要求

2020-06-12 Thread zhaoming chen (Jira)
zhaoming chen created SPARK-31978:
-

 Summary: spark AE 中处理数据倾斜为啥对join type有要求
 Key: SPARK-31978
 URL: https://issues.apache.org/jira/browse/SPARK-31978
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 3.0.0
Reporter: zhaoming chen


why can't split right partition when the jointype is left join ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31760) Simplification Based on Containment

2020-06-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-31760:
---

Assignee: (was: Yuming Wang)

> Simplification Based on Containment
> ---
>
> Key: SPARK-31760
> URL: https://issues.apache.org/jira/browse/SPARK-31760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: starter
>
> https://docs.teradata.com/reader/Ws7YT1jvRK2vEr1LpVURug/V~FCwD9BL7gY4ac3WwHInw



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31968) write.partitionBy() creates duplicate subdirectories when user provide duplicate columns

2020-06-12 Thread Xuzhou Qin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuzhou Qin updated SPARK-31968:
---
Summary: write.partitionBy() creates duplicate subdirectories when user 
provide duplicate columns  (was: write.partitionBy() creates duplicated 
subdirectories when user provide duplicated columns)

> write.partitionBy() creates duplicate subdirectories when user provide 
> duplicate columns
> 
>
> Key: SPARK-31968
> URL: https://issues.apache.org/jira/browse/SPARK-31968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Xuzhou Qin
>Priority: Major
>
> I recently remarked that if there are duplicated elements in the argument of 
> write.partitionBy(), then the same partition subdirectory will be created 
> multiple times.
> For example: 
> {code:java}
> import spark.implicits._
> val df: DataFrame = Seq(
>   (1, "p1", "c1", 1L),
>   (2, "p2", "c2", 2L),
>   (2, "p1", "c2", 2L),
>   (3, "p3", "c3", 3L),
>   (3, "p2", "c3", 3L),
>   (3, "p3", "c3", 3L)
> ).toDF("col1", "col2", "col3", "col4")
> df.write
>   .partitionBy("col1", "col1")  // we have "col1" twice
>   .mode(SaveMode.Overwrite)
>   .csv("output_dir"){code}
> The above code will produce an output directory with this structure:
>  
> {code:java}
> output_dir
>   |
>   |--col1=1
>   ||--col1=1
>   |
>   |--col1=2
>   ||--col1=2
>   |
>   |--col1=3
>|--col1=3{code}
> And we won't be able to read the output
>  
> {code:java}
> spark.read.csv("output_dir").show()
> // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
> duplicate column(s) in the partition schema: `col1`;{code}
>  
> I am not sure if partitioning a dataframe twice by the same column make sense 
> in some real-world applications, but it will cause schema inference problems 
> in tools like AWS Glue crawler.
> Should Spark handle the deduplication of the partition columns? Or maybe 
> throw an exception when duplicated columns are detected?
> If this behaviour is unexpected, I will work on a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31968) write.partitionBy() creates duplicated subdirectories when user provide duplicated columns

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31968:


Assignee: (was: Apache Spark)

> write.partitionBy() creates duplicated subdirectories when user provide 
> duplicated columns
> --
>
> Key: SPARK-31968
> URL: https://issues.apache.org/jira/browse/SPARK-31968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Xuzhou Qin
>Priority: Major
>
> I recently remarked that if there are duplicated elements in the argument of 
> write.partitionBy(), then the same partition subdirectory will be created 
> multiple times.
> For example: 
> {code:java}
> import spark.implicits._
> val df: DataFrame = Seq(
>   (1, "p1", "c1", 1L),
>   (2, "p2", "c2", 2L),
>   (2, "p1", "c2", 2L),
>   (3, "p3", "c3", 3L),
>   (3, "p2", "c3", 3L),
>   (3, "p3", "c3", 3L)
> ).toDF("col1", "col2", "col3", "col4")
> df.write
>   .partitionBy("col1", "col1")  // we have "col1" twice
>   .mode(SaveMode.Overwrite)
>   .csv("output_dir"){code}
> The above code will produce an output directory with this structure:
>  
> {code:java}
> output_dir
>   |
>   |--col1=1
>   ||--col1=1
>   |
>   |--col1=2
>   ||--col1=2
>   |
>   |--col1=3
>|--col1=3{code}
> And we won't be able to read the output
>  
> {code:java}
> spark.read.csv("output_dir").show()
> // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
> duplicate column(s) in the partition schema: `col1`;{code}
>  
> I am not sure if partitioning a dataframe twice by the same column make sense 
> in some real-world applications, but it will cause schema inference problems 
> in tools like AWS Glue crawler.
> Should Spark handle the deduplication of the partition columns? Or maybe 
> throw an exception when duplicated columns are detected?
> If this behaviour is unexpected, I will work on a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31968) write.partitionBy() creates duplicated subdirectories when user provide duplicated columns

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31968:


Assignee: (was: Apache Spark)

> write.partitionBy() creates duplicated subdirectories when user provide 
> duplicated columns
> --
>
> Key: SPARK-31968
> URL: https://issues.apache.org/jira/browse/SPARK-31968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Xuzhou Qin
>Priority: Major
>
> I recently remarked that if there are duplicated elements in the argument of 
> write.partitionBy(), then the same partition subdirectory will be created 
> multiple times.
> For example: 
> {code:java}
> import spark.implicits._
> val df: DataFrame = Seq(
>   (1, "p1", "c1", 1L),
>   (2, "p2", "c2", 2L),
>   (2, "p1", "c2", 2L),
>   (3, "p3", "c3", 3L),
>   (3, "p2", "c3", 3L),
>   (3, "p3", "c3", 3L)
> ).toDF("col1", "col2", "col3", "col4")
> df.write
>   .partitionBy("col1", "col1")  // we have "col1" twice
>   .mode(SaveMode.Overwrite)
>   .csv("output_dir"){code}
> The above code will produce an output directory with this structure:
>  
> {code:java}
> output_dir
>   |
>   |--col1=1
>   ||--col1=1
>   |
>   |--col1=2
>   ||--col1=2
>   |
>   |--col1=3
>|--col1=3{code}
> And we won't be able to read the output
>  
> {code:java}
> spark.read.csv("output_dir").show()
> // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
> duplicate column(s) in the partition schema: `col1`;{code}
>  
> I am not sure if partitioning a dataframe twice by the same column make sense 
> in some real-world applications, but it will cause schema inference problems 
> in tools like AWS Glue crawler.
> Should Spark handle the deduplication of the partition columns? Or maybe 
> throw an exception when duplicated columns are detected?
> If this behaviour is unexpected, I will work on a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31968) write.partitionBy() creates duplicated subdirectories when user provide duplicated columns

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31968:


Assignee: Apache Spark

> write.partitionBy() creates duplicated subdirectories when user provide 
> duplicated columns
> --
>
> Key: SPARK-31968
> URL: https://issues.apache.org/jira/browse/SPARK-31968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Xuzhou Qin
>Assignee: Apache Spark
>Priority: Major
>
> I recently remarked that if there are duplicated elements in the argument of 
> write.partitionBy(), then the same partition subdirectory will be created 
> multiple times.
> For example: 
> {code:java}
> import spark.implicits._
> val df: DataFrame = Seq(
>   (1, "p1", "c1", 1L),
>   (2, "p2", "c2", 2L),
>   (2, "p1", "c2", 2L),
>   (3, "p3", "c3", 3L),
>   (3, "p2", "c3", 3L),
>   (3, "p3", "c3", 3L)
> ).toDF("col1", "col2", "col3", "col4")
> df.write
>   .partitionBy("col1", "col1")  // we have "col1" twice
>   .mode(SaveMode.Overwrite)
>   .csv("output_dir"){code}
> The above code will produce an output directory with this structure:
>  
> {code:java}
> output_dir
>   |
>   |--col1=1
>   ||--col1=1
>   |
>   |--col1=2
>   ||--col1=2
>   |
>   |--col1=3
>|--col1=3{code}
> And we won't be able to read the output
>  
> {code:java}
> spark.read.csv("output_dir").show()
> // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
> duplicate column(s) in the partition schema: `col1`;{code}
>  
> I am not sure if partitioning a dataframe twice by the same column make sense 
> in some real-world applications, but it will cause schema inference problems 
> in tools like AWS Glue crawler.
> Should Spark handle the deduplication of the partition columns? Or maybe 
> throw an exception when duplicated columns are detected?
> If this behaviour is unexpected, I will work on a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31968) write.partitionBy() creates duplicated subdirectories when user provide duplicated columns

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134079#comment-17134079
 ] 

Apache Spark commented on SPARK-31968:
--

User 'TJX2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/28814

> write.partitionBy() creates duplicated subdirectories when user provide 
> duplicated columns
> --
>
> Key: SPARK-31968
> URL: https://issues.apache.org/jira/browse/SPARK-31968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Xuzhou Qin
>Priority: Major
>
> I recently remarked that if there are duplicated elements in the argument of 
> write.partitionBy(), then the same partition subdirectory will be created 
> multiple times.
> For example: 
> {code:java}
> import spark.implicits._
> val df: DataFrame = Seq(
>   (1, "p1", "c1", 1L),
>   (2, "p2", "c2", 2L),
>   (2, "p1", "c2", 2L),
>   (3, "p3", "c3", 3L),
>   (3, "p2", "c3", 3L),
>   (3, "p3", "c3", 3L)
> ).toDF("col1", "col2", "col3", "col4")
> df.write
>   .partitionBy("col1", "col1")  // we have "col1" twice
>   .mode(SaveMode.Overwrite)
>   .csv("output_dir"){code}
> The above code will produce an output directory with this structure:
>  
> {code:java}
> output_dir
>   |
>   |--col1=1
>   ||--col1=1
>   |
>   |--col1=2
>   ||--col1=2
>   |
>   |--col1=3
>|--col1=3{code}
> And we won't be able to read the output
>  
> {code:java}
> spark.read.csv("output_dir").show()
> // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
> duplicate column(s) in the partition schema: `col1`;{code}
>  
> I am not sure if partitioning a dataframe twice by the same column make sense 
> in some real-world applications, but it will cause schema inference problems 
> in tools like AWS Glue crawler.
> Should Spark handle the deduplication of the partition columns? Or maybe 
> throw an exception when duplicated columns are detected?
> If this behaviour is unexpected, I will work on a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31968) write.partitionBy() creates duplicated subdirectories when user provide duplicated columns

2020-06-12 Thread Xuzhou Qin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuzhou Qin updated SPARK-31968:
---
Summary: write.partitionBy() creates duplicated subdirectories when user 
provide duplicated columns  (was: write.partitionBy() creates duplicated 
subdirectory when user give duplicated columns)

> write.partitionBy() creates duplicated subdirectories when user provide 
> duplicated columns
> --
>
> Key: SPARK-31968
> URL: https://issues.apache.org/jira/browse/SPARK-31968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Xuzhou Qin
>Priority: Major
>
> I recently remarked that if there are duplicated elements in the argument of 
> write.partitionBy(), then the same partition subdirectory will be created 
> multiple times.
> For example: 
> {code:java}
> import spark.implicits._
> val df: DataFrame = Seq(
>   (1, "p1", "c1", 1L),
>   (2, "p2", "c2", 2L),
>   (2, "p1", "c2", 2L),
>   (3, "p3", "c3", 3L),
>   (3, "p2", "c3", 3L),
>   (3, "p3", "c3", 3L)
> ).toDF("col1", "col2", "col3", "col4")
> df.write
>   .partitionBy("col1", "col1")  // we have "col1" twice
>   .mode(SaveMode.Overwrite)
>   .csv("output_dir"){code}
> The above code will produce an output directory with this structure:
>  
> {code:java}
> output_dir
>   |
>   |--col1=1
>   ||--col1=1
>   |
>   |--col1=2
>   ||--col1=2
>   |
>   |--col1=3
>|--col1=3{code}
> And we won't be able to read the output
>  
> {code:java}
> spark.read.csv("output_dir").show()
> // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
> duplicate column(s) in the partition schema: `col1`;{code}
>  
> I am not sure if partitioning a dataframe twice by the same column make sense 
> in some real-world applications, but it will cause schema inference problems 
> in tools like AWS Glue crawler.
> Should Spark handle the deduplication of the partition columns? Or maybe 
> throw an exception when duplicated columns are detected?
> If this behaviour is unexpected, I will work on a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31968) write.partitionBy() creates duplicated subdirectory when user give duplicated columns

2020-06-12 Thread Xuzhou Qin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuzhou Qin updated SPARK-31968:
---
Description: 
I recently remarked that if there are duplicated elements in the argument of 
write.partitionBy(), then the same partition subdirectory will be created 
multiple times.

For example: 
{code:java}
import spark.implicits._

val df: DataFrame = Seq(
  (1, "p1", "c1", 1L),
  (2, "p2", "c2", 2L),
  (2, "p1", "c2", 2L),
  (3, "p3", "c3", 3L),
  (3, "p2", "c3", 3L),
  (3, "p3", "c3", 3L)
).toDF("col1", "col2", "col3", "col4")

df.write
  .partitionBy("col1", "col1")  // we have "col1" twice
  .mode(SaveMode.Overwrite)
  .csv("output_dir"){code}
The above code will produce an output directory with this structure:

 
{code:java}
output_dir
  |
  |--col1=1
  ||--col1=1
  |
  |--col1=2
  ||--col1=2
  |
  |--col1=3
   |--col1=3{code}
And we won't be able to read the output

 
{code:java}
spark.read.csv("output_dir").show()
// Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
duplicate column(s) in the partition schema: `col1`;{code}
 

I am not sure if partitioning a dataframe twice by the same column make sense 
in some real-world applications, but it will cause schema inference problems in 
tools like AWS Glue crawler.

Should Spark handle the deduplication of the partition columns? Or maybe throw 
an exception when duplicated columns are detected?

If this behaviour is unexpected, I will work on a fix. 

  was:
I recently remarked that if there are duplicated elements in the argument of 
write.partitionBy(), then the same partition subdirectory will be created 
multiple times.

For example: 
{code:java}
import spark.implicits._

val df: DataFrame = Seq(
  (1, "p1", "c1", 1L),
  (2, "p2", "c2", 2L),
  (2, "p1", "c2", 2L),
  (3, "p3", "c3", 3L),
  (3, "p2", "c3", 3L),
  (3, "p3", "c3", 3L)
).toDF("col1", "col2", "col3", "col4")

df.write
  .partitionBy("col1", "col1")  // we have "col1" twice
  .mode(SaveMode.Overwrite)
  .csv("output_dir"){code}
The above code will produce an output directory with this structure:

 
{code:java}
output_dir
  |
  |--col1=1
  ||--col1=1
  |
  |--col1=2
  ||--col1=2
  |
  |--col1=3
   |--col1=3{code}
And we won't be able to read the output

 
{code:java}
spark.read.csv("output_dir").show()
// Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
duplicate column(s) in the partition schema: `col1`;{code}
 

I am not sure if partitioning a dataframe twice by the same column make sense 
in some real world applications, but it will cause schema inference problems in 
tools like AWS Glue crawler.

Should Spark handle the deduplication of the partition columns? Or maybe throw 
an exception when duplicated columns are detected?

If this is an unexpected behaviour, I will working on a fix. 


> write.partitionBy() creates duplicated subdirectory when user give duplicated 
> columns
> -
>
> Key: SPARK-31968
> URL: https://issues.apache.org/jira/browse/SPARK-31968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Xuzhou Qin
>Priority: Major
>
> I recently remarked that if there are duplicated elements in the argument of 
> write.partitionBy(), then the same partition subdirectory will be created 
> multiple times.
> For example: 
> {code:java}
> import spark.implicits._
> val df: DataFrame = Seq(
>   (1, "p1", "c1", 1L),
>   (2, "p2", "c2", 2L),
>   (2, "p1", "c2", 2L),
>   (3, "p3", "c3", 3L),
>   (3, "p2", "c3", 3L),
>   (3, "p3", "c3", 3L)
> ).toDF("col1", "col2", "col3", "col4")
> df.write
>   .partitionBy("col1", "col1")  // we have "col1" twice
>   .mode(SaveMode.Overwrite)
>   .csv("output_dir"){code}
> The above code will produce an output directory with this structure:
>  
> {code:java}
> output_dir
>   |
>   |--col1=1
>   ||--col1=1
>   |
>   |--col1=2
>   ||--col1=2
>   |
>   |--col1=3
>|--col1=3{code}
> And we won't be able to read the output
>  
> {code:java}
> spark.read.csv("output_dir").show()
> // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
> duplicate column(s) in the partition schema: `col1`;{code}
>  
> I am not sure if partitioning a dataframe twice by the same column make sense 
> in some real-world applications, but it will cause schema inference problems 
> in tools like AWS Glue crawler.
> Should Spark handle the deduplication of the partition columns? Or maybe 
> throw an exception when duplicated columns are detected?
> If this behaviour is unexpected, I will work on a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsub

[jira] [Commented] (SPARK-31977) Returns the projected plan directly from NestedColumnAliasing

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134051#comment-17134051
 ] 

Apache Spark commented on SPARK-31977:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28812

> Returns the projected plan directly from NestedColumnAliasing
> -
>
> Key: SPARK-31977
> URL: https://issues.apache.org/jira/browse/SPARK-31977
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {{NestedColumnAliasing}} and {{GeneratorNestedColumnAliasing}} have a 
> different usage pattern:
> {code}
> case p @ NestedColumnAliasing(nestedFieldToAlias, attrToAliases) =>
>   NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, 
> attrToAliases)
> {code}
> vs
> {code}
> case GeneratorNestedColumnAliasing(p) => p
> {code}
> It should be better to match it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31977) Returns the projected plan directly from NestedColumnAliasing

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31977:


Assignee: (was: Apache Spark)

> Returns the projected plan directly from NestedColumnAliasing
> -
>
> Key: SPARK-31977
> URL: https://issues.apache.org/jira/browse/SPARK-31977
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {{NestedColumnAliasing}} and {{GeneratorNestedColumnAliasing}} have a 
> different usage pattern:
> {code}
> case p @ NestedColumnAliasing(nestedFieldToAlias, attrToAliases) =>
>   NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, 
> attrToAliases)
> {code}
> vs
> {code}
> case GeneratorNestedColumnAliasing(p) => p
> {code}
> It should be better to match it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31977) Returns the projected plan directly from NestedColumnAliasing

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31977:


Assignee: Apache Spark

> Returns the projected plan directly from NestedColumnAliasing
> -
>
> Key: SPARK-31977
> URL: https://issues.apache.org/jira/browse/SPARK-31977
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> {{NestedColumnAliasing}} and {{GeneratorNestedColumnAliasing}} have a 
> different usage pattern:
> {code}
> case p @ NestedColumnAliasing(nestedFieldToAlias, attrToAliases) =>
>   NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, 
> attrToAliases)
> {code}
> vs
> {code}
> case GeneratorNestedColumnAliasing(p) => p
> {code}
> It should be better to match it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31968) write.partitionBy() creates duplicated subdirectory when user give duplicated columns

2020-06-12 Thread Xuzhou Qin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuzhou Qin updated SPARK-31968:
---
Description: 
I recently remarked that if there are duplicated elements in the argument of 
write.partitionBy(), then the same partition subdirectory will be created 
multiple times.

For example: 
{code:java}
import spark.implicits._

val df: DataFrame = Seq(
  (1, "p1", "c1", 1L),
  (2, "p2", "c2", 2L),
  (2, "p1", "c2", 2L),
  (3, "p3", "c3", 3L),
  (3, "p2", "c3", 3L),
  (3, "p3", "c3", 3L)
).toDF("col1", "col2", "col3", "col4")

df.write
  .partitionBy("col1", "col1")  // we have "col1" twice
  .mode(SaveMode.Overwrite)
  .csv("output_dir"){code}
The above code will produce an output directory with this structure:

 
{code:java}
output_dir
  |
  |--col1=1
  ||--col1=1
  |
  |--col1=2
  ||--col1=2
  |
  |--col1=3
   |--col1=3{code}
And we won't be able to read the output

 
{code:java}
spark.read.csv("output_dir").show()
// Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
duplicate column(s) in the partition schema: `col1`;{code}
 

I am not sure if partitioning a dataframe twice by the same column make sense 
in some real world applications, but it will cause schema inference problems in 
tools like AWS Glue crawler.

Should Spark handle the deduplication of the partition columns? Or maybe throw 
an exception when duplicated columns are detected?

If this is an unexpected behaviour, I will working on a fix. 

  was:
I recently remarked that if there are duplicated elements in the argument of 
write.partitionBy(), then the same partition subdirectory will be created 
multiple times.

For example: 
{code:java}
import spark.implicits._

val df: DataFrame = Seq(
  (1, "p1", "c1", 1L),
  (2, "p2", "c2", 2L),
  (2, "p1", "c2", 2L),
  (3, "p3", "c3", 3L),
  (3, "p2", "c3", 3L),
  (3, "p3", "c3", 3L)
).toDF("col1", "col2", "col3", "col4")

df.write
  .partitionBy("col1", "col1")  // we have "col1" twice
  .mode(SaveMode.Overwrite)
  .csv("output_dir"){code}
The above code will produce an output directory with this structure:

 
{code:java}
output_dir
  |
  |--col1=1
  ||--col1=1
  |
  |--col1=2
  ||--col1=2
  |
  |--col1=3
   |--col1=3{code}
And we won't be able to read the output

 
{code:java}
spark.read.csv("output_dir").show()
// Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
duplicate column(s) in the partition schema: `col1`;{code}
 

I am not sure if partitioning a dataframe twice by the same column make sense 
in some real world applications, but it will cause schema inference problems in 
tools like AWS Glue crawler.

Should Spark handle the deduplication of the partition columns? Or maybe throw 
an exception when duplicated columns are detected?

 

 


> write.partitionBy() creates duplicated subdirectory when user give duplicated 
> columns
> -
>
> Key: SPARK-31968
> URL: https://issues.apache.org/jira/browse/SPARK-31968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Xuzhou Qin
>Priority: Major
>
> I recently remarked that if there are duplicated elements in the argument of 
> write.partitionBy(), then the same partition subdirectory will be created 
> multiple times.
> For example: 
> {code:java}
> import spark.implicits._
> val df: DataFrame = Seq(
>   (1, "p1", "c1", 1L),
>   (2, "p2", "c2", 2L),
>   (2, "p1", "c2", 2L),
>   (3, "p3", "c3", 3L),
>   (3, "p2", "c3", 3L),
>   (3, "p3", "c3", 3L)
> ).toDF("col1", "col2", "col3", "col4")
> df.write
>   .partitionBy("col1", "col1")  // we have "col1" twice
>   .mode(SaveMode.Overwrite)
>   .csv("output_dir"){code}
> The above code will produce an output directory with this structure:
>  
> {code:java}
> output_dir
>   |
>   |--col1=1
>   ||--col1=1
>   |
>   |--col1=2
>   ||--col1=2
>   |
>   |--col1=3
>|--col1=3{code}
> And we won't be able to read the output
>  
> {code:java}
> spark.read.csv("output_dir").show()
> // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found 
> duplicate column(s) in the partition schema: `col1`;{code}
>  
> I am not sure if partitioning a dataframe twice by the same column make sense 
> in some real world applications, but it will cause schema inference problems 
> in tools like AWS Glue crawler.
> Should Spark handle the deduplication of the partition columns? Or maybe 
> throw an exception when duplicated columns are detected?
> If this is an unexpected behaviour, I will working on a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional comman

[jira] [Commented] (SPARK-31967) Loading jobs UI page takes 40 seconds

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134050#comment-17134050
 ] 

Apache Spark commented on SPARK-31967:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/28813

> Loading jobs UI page takes 40 seconds
> -
>
> Key: SPARK-31967
> URL: https://issues.apache.org/jira/browse/SPARK-31967
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Gengliang Wang
>Priority: Blocker
> Attachments: load_time.jpeg, profile.png
>
>
> In the latest master branch, I find that the job list page becomes very slow.
> To reproduce in local setup:
> {code:java}
> spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1")
> spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2")
> (1 to 1000).map(_ =>  spark.sql("select * from t1, t2 where 
> t1.value=t2.value").show())
> {code}
> And that, open live UI: http://localhost:4040/
> The loading time is about 40 seconds.
> If we comment out the function call for `drawApplicationTimeline`, then the 
> loading time is around 1 second.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31967) Loading jobs UI page takes 40 seconds

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134046#comment-17134046
 ] 

Apache Spark commented on SPARK-31967:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/28811

> Loading jobs UI page takes 40 seconds
> -
>
> Key: SPARK-31967
> URL: https://issues.apache.org/jira/browse/SPARK-31967
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Gengliang Wang
>Priority: Blocker
> Attachments: load_time.jpeg, profile.png
>
>
> In the latest master branch, I find that the job list page becomes very slow.
> To reproduce in local setup:
> {code:java}
> spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1")
> spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2")
> (1 to 1000).map(_ =>  spark.sql("select * from t1, t2 where 
> t1.value=t2.value").show())
> {code}
> And that, open live UI: http://localhost:4040/
> The loading time is about 40 seconds.
> If we comment out the function call for `drawApplicationTimeline`, then the 
> loading time is around 1 second.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31967) Loading jobs UI page takes 40 seconds

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134043#comment-17134043
 ] 

Apache Spark commented on SPARK-31967:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/28811

> Loading jobs UI page takes 40 seconds
> -
>
> Key: SPARK-31967
> URL: https://issues.apache.org/jira/browse/SPARK-31967
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Gengliang Wang
>Priority: Blocker
> Attachments: load_time.jpeg, profile.png
>
>
> In the latest master branch, I find that the job list page becomes very slow.
> To reproduce in local setup:
> {code:java}
> spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1")
> spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2")
> (1 to 1000).map(_ =>  spark.sql("select * from t1, t2 where 
> t1.value=t2.value").show())
> {code}
> And that, open live UI: http://localhost:4040/
> The loading time is about 40 seconds.
> If we comment out the function call for `drawApplicationTimeline`, then the 
> loading time is around 1 second.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134039#comment-17134039
 ] 

Apache Spark commented on SPARK-31705:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/28810

> Rewrite join condition to conjunctive normal form
> -
>
> Key: SPARK-31705
> URL: https://issues.apache.org/jira/browse/SPARK-31705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Rewrite join condition to [conjunctive normal 
> form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
> conditions to filter.
> PostgreSQL:
> {code:sql}
> CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, 
>   
> l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),
> 
> l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), 
>   
> l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
> DATE,
> l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
>   
> CREATE TABLE orders (
> o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
> o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
> o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem,
>orders
> WHERE  l_orderkey = o_orderkey
>AND ( ( l_suppkey > 3
>AND o_custkey > 13 )
>   OR ( l_suppkey > 1
>AND o_custkey > 11 ) )
>AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem
>JOIN orders
>  ON l_orderkey = o_orderkey
> AND ( ( l_suppkey > 3
> AND o_custkey > 13 )
>OR ( l_suppkey > 1
> AND o_custkey > 11 ) )
> AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*) 
> FROM   lineitem, 
>orders 
> WHERE  l_orderkey = o_orderkey 
>AND NOT ( ( l_suppkey > 3 
>AND ( l_suppkey > 2 
>   OR o_custkey > 13 ) ) 
>   OR ( l_suppkey > 1 
>AND o_custkey > 11 ) ) 
>AND l_partkey > 19;
> {code}
> {noformat}
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem,
> postgres-#orders
> postgres-# WHERE  l_orderkey = o_orderkey
> postgres-#AND ( ( l_suppkey > 3
> postgres(#AND o_custkey > 13 )
> postgres(#   OR ( l_suppkey > 1
> postgres(#AND o_custkey > 11 ) )
> postgres-#AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
> (l_suppkey > 1)))
> (9 rows)
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem
> postgres-#JOIN orders
> postgres-#  ON l_orderkey = o_orderkey
> postgres-# AND ( ( l_suppkey > 3
> postgres(# AND o_custkey > 13 )
> postgres(#OR ( l_suppkey > 1
> postgres(# AND o_custkey > 11 ) )
> postgres-# AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>

[jira] [Created] (SPARK-31977) Returns the projected plan directly from NestedColumnAliasing

2020-06-12 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-31977:


 Summary: Returns the projected plan directly from 
NestedColumnAliasing
 Key: SPARK-31977
 URL: https://issues.apache.org/jira/browse/SPARK-31977
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


{{NestedColumnAliasing}} and {{GeneratorNestedColumnAliasing}} have a different 
usage pattern:

{code}
case p @ NestedColumnAliasing(nestedFieldToAlias, attrToAliases) =>
  NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, 
attrToAliases)
{code}

vs

{code}
case GeneratorNestedColumnAliasing(p) => p
{code}

It should be better to match it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31959) Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian to Julian"

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134032#comment-17134032
 ] 

Apache Spark commented on SPARK-31959:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28809

> Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian 
> to Julian"
> 
>
> Key: SPARK-31959
> URL: https://issues.apache.org/jira/browse/SPARK-31959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> See 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123688/testReport/org.apache.spark.sql.catalyst.util/RebaseDateTimeSuite/optimization_of_micros_rebasing___Gregorian_to_Julian/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31736) Nested column pruning for other operators

2020-06-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31736.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28556
[https://github.com/apache/spark/pull/28556]

> Nested column pruning for other operators
> -
>
> Key: SPARK-31736
> URL: https://issues.apache.org/jira/browse/SPARK-31736
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently we only push nested column pruning through a few operators such as 
> LIMIT, SAMPLE, etc. This is the ticket for supporting other operators for 
> nested column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31976) use MemoryUsage to control the size of block

2020-06-12 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-31976:


 Summary: use MemoryUsage to control the size of block
 Key: SPARK-31976
 URL: https://issues.apache.org/jira/browse/SPARK-31976
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.1.0
Reporter: zhengruifeng


According to the performance test in 
https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
mainly related to the nnz of block.

So it maybe reasonable to control the size of block by memory usage, instead of 
number of rows.

 

note1: param blockSize had already used in ALS and MLP to stack vectors 
(expected to be dense);

note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models;

 

There may be two ways to impl:

1, compute the sparsity of input vectors ahead of train (this can be computed 
with other statistics computation, maybe no extra pass), and infer a reasonable 
number of vectors to stack;

2, stack the input vectors adaptively, by counting the memory usage in a block;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31976) use MemoryUsage to control the size of block

2020-06-12 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-31976:
-
Description: 
According to the performance test in 
https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
mainly related to the nnz of block.

So it maybe reasonable to control the size of block by memory usage, instead of 
number of rows.

 

note1: param blockSize had already used in ALS and MLP to stack vectors 
(expected to be dense);

note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models;

 

There may be two ways to impl:

1, compute the sparsity of input vectors ahead of train (this can be computed 
with other statistics computation, maybe no extra pass), and infer a reasonable 
number of vectors to stack;

2, stack the input vectors adaptively, by monitoring the memory usage in a 
block;

  was:
According to the performance test in 
https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
mainly related to the nnz of block.

So it maybe reasonable to control the size of block by memory usage, instead of 
number of rows.

 

note1: param blockSize had already used in ALS and MLP to stack vectors 
(expected to be dense);

note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models;

 

There may be two ways to impl:

1, compute the sparsity of input vectors ahead of train (this can be computed 
with other statistics computation, maybe no extra pass), and infer a reasonable 
number of vectors to stack;

2, stack the input vectors adaptively, by counting the memory usage in a block;


> use MemoryUsage to control the size of block
> 
>
> Key: SPARK-31976
> URL: https://issues.apache.org/jira/browse/SPARK-31976
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> According to the performance test in 
> https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
> mainly related to the nnz of block.
> So it maybe reasonable to control the size of block by memory usage, instead 
> of number of rows.
>  
> note1: param blockSize had already used in ALS and MLP to stack vectors 
> (expected to be dense);
> note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models;
>  
> There may be two ways to impl:
> 1, compute the sparsity of input vectors ahead of train (this can be computed 
> with other statistics computation, maybe no extra pass), and infer a 
> reasonable number of vectors to stack;
> 2, stack the input vectors adaptively, by monitoring the memory usage in a 
> block;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31975) Throw user facing error when use WindowFunction directly

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134000#comment-17134000
 ] 

Apache Spark commented on SPARK-31975:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/28808

> Throw user facing error when use WindowFunction directly
> 
>
> Key: SPARK-31975
> URL: https://issues.apache.org/jira/browse/SPARK-31975
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31975) Throw user facing error when use WindowFunction directly

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31975:


Assignee: (was: Apache Spark)

> Throw user facing error when use WindowFunction directly
> 
>
> Key: SPARK-31975
> URL: https://issues.apache.org/jira/browse/SPARK-31975
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31975) Throw user facing error when use WindowFunction directly

2020-06-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133999#comment-17133999
 ] 

Apache Spark commented on SPARK-31975:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/28808

> Throw user facing error when use WindowFunction directly
> 
>
> Key: SPARK-31975
> URL: https://issues.apache.org/jira/browse/SPARK-31975
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31975) Throw user facing error when use WindowFunction directly

2020-06-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31975:


Assignee: Apache Spark

> Throw user facing error when use WindowFunction directly
> 
>
> Key: SPARK-31975
> URL: https://issues.apache.org/jira/browse/SPARK-31975
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org