[jira] [Commented] (SPARK-31632) The ApplicationInfo in KVStore may be accessed before it's prepared
[ https://issues.apache.org/jira/browse/SPARK-31632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134655#comment-17134655 ] Apache Spark commented on SPARK-31632: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/28820 > The ApplicationInfo in KVStore may be accessed before it's prepared > --- > > Key: SPARK-31632 > URL: https://issues.apache.org/jira/browse/SPARK-31632 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Xingcan Cui >Assignee: Xingcan Cui >Priority: Minor > Fix For: 2.4.6, 3.0.0 > > > While starting some local tests, I occasionally encountered the following > exceptions for Web UI. > {noformat} > 23:00:29.845 WARN org.eclipse.jetty.server.HttpChannel: /jobs/ > java.util.NoSuchElementException > at java.util.Collections$EmptyIterator.next(Collections.java:4191) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467) > at > org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39) > at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266) > at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89) > at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) > at > org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) > at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) > at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) > at > org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > at org.eclipse.jetty.server.Server.handle(Server.java:505) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) > at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804) > at java.lang.Thread.run(Thread.java:748){noformat} > *Reason* > That is because {{AppStatusStore.applicationInfo()}} accesses an empty view > (iterator) returned by {{InMemoryStore}}. > AppStatusStore > {code:java} > def applicationInfo(): v1.ApplicationInfo = { > store.view(classOf[ApplicationInfoWrapper]).max(1).iterator().next().info > } > {code} > InMemoryStore > {code:java} > public KVStoreView view(Class type){ > InstanceList list = inMemoryLists.get(type); > return list != null ? list.view() : emptyView(); > } > {code} > During the initialization of {{SparkContext}}, it first starts the Web UI > (SparkContext: L475 _ui.foreach(_.bind())) and then setup the > {{LiveListenerBus}} thread (SparkContext: L608 > {{setupAndStartListenerBus()}}) for dispatching the > {{SparkListenerApplicationStart}} event (which will trigger writing the > requested {{ApplicationInfo}} to {{InMemoryStore}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31632) The ApplicationInfo in KVStore may be accessed before it's prepared
[ https://issues.apache.org/jira/browse/SPARK-31632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134654#comment-17134654 ] Apache Spark commented on SPARK-31632: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/28820 > The ApplicationInfo in KVStore may be accessed before it's prepared > --- > > Key: SPARK-31632 > URL: https://issues.apache.org/jira/browse/SPARK-31632 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Xingcan Cui >Assignee: Xingcan Cui >Priority: Minor > Fix For: 2.4.6, 3.0.0 > > > While starting some local tests, I occasionally encountered the following > exceptions for Web UI. > {noformat} > 23:00:29.845 WARN org.eclipse.jetty.server.HttpChannel: /jobs/ > java.util.NoSuchElementException > at java.util.Collections$EmptyIterator.next(Collections.java:4191) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467) > at > org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39) > at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266) > at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89) > at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) > at > org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) > at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) > at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) > at > org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > at org.eclipse.jetty.server.Server.handle(Server.java:505) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) > at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804) > at java.lang.Thread.run(Thread.java:748){noformat} > *Reason* > That is because {{AppStatusStore.applicationInfo()}} accesses an empty view > (iterator) returned by {{InMemoryStore}}. > AppStatusStore > {code:java} > def applicationInfo(): v1.ApplicationInfo = { > store.view(classOf[ApplicationInfoWrapper]).max(1).iterator().next().info > } > {code} > InMemoryStore > {code:java} > public KVStoreView view(Class type){ > InstanceList list = inMemoryLists.get(type); > return list != null ? list.view() : emptyView(); > } > {code} > During the initialization of {{SparkContext}}, it first starts the Web UI > (SparkContext: L475 _ui.foreach(_.bind())) and then setup the > {{LiveListenerBus}} thread (SparkContext: L608 > {{setupAndStartListenerBus()}}) for dispatching the > {{SparkListenerApplicationStart}} event (which will trigger writing the > requested {{ApplicationInfo}} to {{InMemoryStore}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31980) Spark sequence() fails if start and end of range are identical dates
[ https://issues.apache.org/jira/browse/SPARK-31980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134642#comment-17134642 ] Apache Spark commented on SPARK-31980: -- User 'TJX2014' has created a pull request for this issue: https://github.com/apache/spark/pull/28819 > Spark sequence() fails if start and end of range are identical dates > > > Key: SPARK-31980 > URL: https://issues.apache.org/jira/browse/SPARK-31980 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 > Environment: Spark 2.4.4 standalone and on AWS EMR >Reporter: Dave DeCaprio >Priority: Minor > > > The following Spark SQL query throws an exception > {code:java} > select sequence(cast("2011-03-01" as date), cast("2011-03-01" as date), > interval 1 month) > {code} > The error is: > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: > 1java.lang.ArrayIndexOutOfBoundsException: 1 at > scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:92) at > org.apache.spark.sql.catalyst.expressions.Sequence$TemporalSequenceImpl.eval(collectionOperations.scala:2681) > at > org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:2514) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:389){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31980) Spark sequence() fails if start and end of range are identical dates
[ https://issues.apache.org/jira/browse/SPARK-31980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31980: Assignee: Apache Spark > Spark sequence() fails if start and end of range are identical dates > > > Key: SPARK-31980 > URL: https://issues.apache.org/jira/browse/SPARK-31980 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 > Environment: Spark 2.4.4 standalone and on AWS EMR >Reporter: Dave DeCaprio >Assignee: Apache Spark >Priority: Minor > > > The following Spark SQL query throws an exception > {code:java} > select sequence(cast("2011-03-01" as date), cast("2011-03-01" as date), > interval 1 month) > {code} > The error is: > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: > 1java.lang.ArrayIndexOutOfBoundsException: 1 at > scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:92) at > org.apache.spark.sql.catalyst.expressions.Sequence$TemporalSequenceImpl.eval(collectionOperations.scala:2681) > at > org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:2514) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:389){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31980) Spark sequence() fails if start and end of range are identical dates
[ https://issues.apache.org/jira/browse/SPARK-31980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134641#comment-17134641 ] Apache Spark commented on SPARK-31980: -- User 'TJX2014' has created a pull request for this issue: https://github.com/apache/spark/pull/28819 > Spark sequence() fails if start and end of range are identical dates > > > Key: SPARK-31980 > URL: https://issues.apache.org/jira/browse/SPARK-31980 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 > Environment: Spark 2.4.4 standalone and on AWS EMR >Reporter: Dave DeCaprio >Priority: Minor > > > The following Spark SQL query throws an exception > {code:java} > select sequence(cast("2011-03-01" as date), cast("2011-03-01" as date), > interval 1 month) > {code} > The error is: > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: > 1java.lang.ArrayIndexOutOfBoundsException: 1 at > scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:92) at > org.apache.spark.sql.catalyst.expressions.Sequence$TemporalSequenceImpl.eval(collectionOperations.scala:2681) > at > org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:2514) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:389){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31980) Spark sequence() fails if start and end of range are identical dates
[ https://issues.apache.org/jira/browse/SPARK-31980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31980: Assignee: (was: Apache Spark) > Spark sequence() fails if start and end of range are identical dates > > > Key: SPARK-31980 > URL: https://issues.apache.org/jira/browse/SPARK-31980 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 > Environment: Spark 2.4.4 standalone and on AWS EMR >Reporter: Dave DeCaprio >Priority: Minor > > > The following Spark SQL query throws an exception > {code:java} > select sequence(cast("2011-03-01" as date), cast("2011-03-01" as date), > interval 1 month) > {code} > The error is: > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: > 1java.lang.ArrayIndexOutOfBoundsException: 1 at > scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:92) at > org.apache.spark.sql.catalyst.expressions.Sequence$TemporalSequenceImpl.eval(collectionOperations.scala:2681) > at > org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:2514) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:389){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31198) Use graceful decommissioning as part of dynamic scaling
[ https://issues.apache.org/jira/browse/SPARK-31198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134628#comment-17134628 ] Apache Spark commented on SPARK-31198: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/28818 > Use graceful decommissioning as part of dynamic scaling > --- > > Key: SPARK-31198 > URL: https://issues.apache.org/jira/browse/SPARK-31198 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31198) Use graceful decommissioning as part of dynamic scaling
[ https://issues.apache.org/jira/browse/SPARK-31198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31198: Assignee: Apache Spark (was: Holden Karau) > Use graceful decommissioning as part of dynamic scaling > --- > > Key: SPARK-31198 > URL: https://issues.apache.org/jira/browse/SPARK-31198 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31198) Use graceful decommissioning as part of dynamic scaling
[ https://issues.apache.org/jira/browse/SPARK-31198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134627#comment-17134627 ] Apache Spark commented on SPARK-31198: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/28818 > Use graceful decommissioning as part of dynamic scaling > --- > > Key: SPARK-31198 > URL: https://issues.apache.org/jira/browse/SPARK-31198 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31198) Use graceful decommissioning as part of dynamic scaling
[ https://issues.apache.org/jira/browse/SPARK-31198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31198: Assignee: Holden Karau (was: Apache Spark) > Use graceful decommissioning as part of dynamic scaling > --- > > Key: SPARK-31198 > URL: https://issues.apache.org/jira/browse/SPARK-31198 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31197) Exit the executor once all tasks & migrations are finished
[ https://issues.apache.org/jira/browse/SPARK-31197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31197: Assignee: (was: Apache Spark) > Exit the executor once all tasks & migrations are finished > -- > > Key: SPARK-31197 > URL: https://issues.apache.org/jira/browse/SPARK-31197 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31197) Exit the executor once all tasks & migrations are finished
[ https://issues.apache.org/jira/browse/SPARK-31197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31197: Assignee: Apache Spark > Exit the executor once all tasks & migrations are finished > -- > > Key: SPARK-31197 > URL: https://issues.apache.org/jira/browse/SPARK-31197 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31197) Exit the executor once all tasks & migrations are finished
[ https://issues.apache.org/jira/browse/SPARK-31197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134626#comment-17134626 ] Apache Spark commented on SPARK-31197: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/28817 > Exit the executor once all tasks & migrations are finished > -- > > Key: SPARK-31197 > URL: https://issues.apache.org/jira/browse/SPARK-31197 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table
[ https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134610#comment-17134610 ] L. C. Hsieh commented on SPARK-24528: - The bucket info including bucket ids and files in buckets are the details in particular physical plan (FileSourceScanExec). A merge-sort operator needs to know these details in order to do merge-sort. I feel we should extract the iterating logic in FileScanRDD and do merge-sort in there. [~cloud_fan] WDYT? > Missing optimization for Aggregations/Windowing on a bucketed table > --- > > Key: SPARK-24528 > URL: https://issues.apache.org/jira/browse/SPARK-24528 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ohad Raviv >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-24528#Closely related to > SPARK-24410, we're trying to optimize a very common use case we have of > getting the most updated row by id from a fact table. > We're saving the table bucketed to skip the shuffle stage, but we're still > "waste" time on the Sort operator evethough the data is already sorted. > here's a good example: > {code:java} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("key", "t1") > .saveAsTable("a1"){code} > {code:java} > sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain > == Physical Plan == > SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, > key#24L, t1, t1#25L, t2, t2#26L))]) > +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, > t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))]) > +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, > Format: Parquet, Location: ...{code} > > and here's a bad example, but more realistic: > {code:java} > sparkSession.sql("set spark.sql.shuffle.partitions=2") > sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain > == Physical Plan == > SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, > key#32L, t1, t1#33L, t2, t2#34L))]) > +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, > t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))]) > +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0 > +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, > Format: Parquet, Location: ... > {code} > > I've traced the problem to DataSourceScanExec#235: > {code:java} > val sortOrder = if (sortColumns.nonEmpty) { > // In case of bucketing, its possible to have multiple files belonging to > the > // same bucket in a given relation. Each of these files are locally sorted > // but those files combined together are not globally sorted. Given that, > // the RDD partition will not be sorted even if the relation has sort > columns set > // Current solution is to check if all the buckets have a single file in it > val files = selectedPartitions.flatMap(partition => partition.files) > val bucketToFilesGrouping = > files.map(_.getPath.getName).groupBy(file => > BucketingUtils.getBucketId(file)) > val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= > 1){code} > so obviously the code avoids dealing with this situation now.. > could you think of a way to solve this or bypass it? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31967) Loading jobs UI page takes 40 seconds
[ https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-31967: -- Assignee: Gengliang Wang > Loading jobs UI page takes 40 seconds > - > > Key: SPARK-31967 > URL: https://issues.apache.org/jira/browse/SPARK-31967 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Blocker > Attachments: load_time.jpeg, profile.png > > > In the latest master branch, I find that the job list page becomes very slow. > To reproduce in local setup: > {code:java} > spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") > spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") > (1 to 1000).map(_ => spark.sql("select * from t1, t2 where > t1.value=t2.value").show()) > {code} > And that, open live UI: http://localhost:4040/ > The loading time is about 40 seconds. > If we comment out the function call for `drawApplicationTimeline`, then the > loading time is around 1 second. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31967) Loading jobs UI page takes 40 seconds
[ https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134608#comment-17134608 ] Gengliang Wang commented on SPARK-31967: The issue is resoveld in https://github.com/apache/spark/pull/28806 > Loading jobs UI page takes 40 seconds > - > > Key: SPARK-31967 > URL: https://issues.apache.org/jira/browse/SPARK-31967 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Gengliang Wang >Priority: Blocker > Attachments: load_time.jpeg, profile.png > > > In the latest master branch, I find that the job list page becomes very slow. > To reproduce in local setup: > {code:java} > spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") > spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") > (1 to 1000).map(_ => spark.sql("select * from t1, t2 where > t1.value=t2.value").show()) > {code} > And that, open live UI: http://localhost:4040/ > The loading time is about 40 seconds. > If we comment out the function call for `drawApplicationTimeline`, then the > loading time is around 1 second. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31967) Loading jobs UI page takes 40 seconds
[ https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-31967. Resolution: Fixed > Loading jobs UI page takes 40 seconds > - > > Key: SPARK-31967 > URL: https://issues.apache.org/jira/browse/SPARK-31967 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Blocker > Attachments: load_time.jpeg, profile.png > > > In the latest master branch, I find that the job list page becomes very slow. > To reproduce in local setup: > {code:java} > spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") > spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") > (1 to 1000).map(_ => spark.sql("select * from t1, t2 where > t1.value=t2.value").show()) > {code} > And that, open live UI: http://localhost:4040/ > The loading time is about 40 seconds. > If we comment out the function call for `drawApplicationTimeline`, then the > loading time is around 1 second. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31976) use MemoryUsage to control the size of block
[ https://issues.apache.org/jira/browse/SPARK-31976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-31976: --- Priority: Blocker (was: Major) > use MemoryUsage to control the size of block > > > Key: SPARK-31976 > URL: https://issues.apache.org/jira/browse/SPARK-31976 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Blocker > > According to the performance test in > https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is > mainly related to the nnz of block. > So it maybe reasonable to control the size of block by memory usage, instead > of number of rows. > > note1: param blockSize had already used in ALS and MLP to stack vectors > (expected to be dense); > note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models; > > There may be two ways to impl: > 1, compute the sparsity of input vectors ahead of train (this can be computed > with other statistics computation, maybe no extra pass), and infer a > reasonable number of vectors to stack; > 2, stack the input vectors adaptively, by monitoring the memory usage in a > block; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Highman updated SPARK-31962: Description: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against _modificationDate_ and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("filesModifiedAfterDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been modified at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the _spark.sql.execution.datasources_ package. was: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against _modificationDate_ and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("filesModifiedAfterDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been created at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the _spark.sql.execution.datasources_ package. > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > When using stru
[jira] [Resolved] (SPARK-31977) Returns the projected plan directly from NestedColumnAliasing
[ https://issues.apache.org/jira/browse/SPARK-31977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31977. -- Fix Version/s: 3.1.0 Assignee: Hyukjin Kwon Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28812 > Returns the projected plan directly from NestedColumnAliasing > - > > Key: SPARK-31977 > URL: https://issues.apache.org/jira/browse/SPARK-31977 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.1.0 > > > {{NestedColumnAliasing}} and {{GeneratorNestedColumnAliasing}} have a > different usage pattern: > {code} > case p @ NestedColumnAliasing(nestedFieldToAlias, attrToAliases) => > NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, > attrToAliases) > {code} > vs > {code} > case GeneratorNestedColumnAliasing(p) => p > {code} > It should be better to match it -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31950) Extract SQL keywords from the generated parser class in TableIdentifierParserSuite
[ https://issues.apache.org/jira/browse/SPARK-31950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31950. -- Fix Version/s: 3.1.0 Assignee: Takeshi Yamamuro (was: Apache Spark) Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28802 > Extract SQL keywords from the generated parser class in > TableIdentifierParserSuite > -- > > Key: SPARK-31950 > URL: https://issues.apache.org/jira/browse/SPARK-31950 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31980) Spark sequence() fails if start and end of range are identical dates
Dave DeCaprio created SPARK-31980: - Summary: Spark sequence() fails if start and end of range are identical dates Key: SPARK-31980 URL: https://issues.apache.org/jira/browse/SPARK-31980 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Environment: Spark 2.4.4 standalone and on AWS EMR Reporter: Dave DeCaprio The following Spark SQL query throws an exception {code:java} select sequence(cast("2011-03-01" as date), cast("2011-03-01" as date), interval 1 month) {code} The error is: {noformat} java.lang.ArrayIndexOutOfBoundsException: 1java.lang.ArrayIndexOutOfBoundsException: 1 at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:92) at org.apache.spark.sql.catalyst.expressions.Sequence$TemporalSequenceImpl.eval(collectionOperations.scala:2681) at org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:2514) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:389){noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31979) release script should not fail when remove non-existing files
[ https://issues.apache.org/jira/browse/SPARK-31979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31979. --- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 28815 [https://github.com/apache/spark/pull/28815] > release script should not fail when remove non-existing files > - > > Key: SPARK-31979 > URL: https://issues.apache.org/jira/browse/SPARK-31979 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.1, 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30119) Support pagination for spark streaming tab
[ https://issues.apache.org/jira/browse/SPARK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30119. -- Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28748 > Support pagination for spark streaming tab > --- > > Key: SPARK-30119 > URL: https://issues.apache.org/jira/browse/SPARK-30119 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.1.0 >Reporter: jobit mathew >Assignee: Rakesh Raushan >Priority: Minor > Fix For: 3.1.0 > > > Support pagination for spark streaming tab -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31947) Solve string value error about Date/Timestamp in ScriptTransform
[ https://issues.apache.org/jira/browse/SPARK-31947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31947: -- Description: For test case {code:java} test("SPARK-25990: TRANSFORM should handle different data types correctly") { assume(TestUtils.testCommandAvailable("python")) val scriptFilePath = getTestResourcePath("test_script.py") withTempView("v") { val df = Seq( (1, "1", 1.0, BigDecimal(1.0), new Timestamp(1), Date.valueOf("2015-05-21")), (2, "2", 2.0, BigDecimal(2.0), new Timestamp(2), Date.valueOf("2015-05-22")), (3, "3", 3.0, BigDecimal(3.0), new Timestamp(3), Date.valueOf("2015-05-23")) ).toDF("a", "b", "c", "d", "e", "f") // Note column d's data type is Decimal(38, 18) df.createTempView("v") val query = sql( s""" |SELECT |TRANSFORM(a, b, c, d, e, f) |USING 'python $scriptFilePath' AS (a, b, c, d, e, f) |FROM v """.stripMargin) val decimalToString: Column => Column = c => c.cast("string") checkAnswer(query, identity, df.select( 'a.cast("string"), 'b.cast("string"), 'c.cast("string"), decimalToString('d), 'e.cast("string"), 'f.cast("string")).collect()) } } {code} Get wrong result {code:java} [info] - SPARK-25990: TRANSFORM should handle different data types correctly *** FAILED *** (4 seconds, 997 milliseconds) [info] Results do not match for Spark plan: [info]ScriptTransformation [a#19, b#20, c#21, d#22, e#23, f#24], python /Users/angerszhu/Documents/project/AngersZhu/spark/sql/core/target/scala-2.12/test-classes/test_script.py, [a#31, b#32, c#33, d#34, e#35, f#36], org.apache.spark.sql.execution.script.ScriptTransformIOSchema@1ad5a29c [info] +- Project [_1#6 AS a#19, _2#7 AS b#20, _3#8 AS c#21, _4#9 AS d#22, _5#10 AS e#23, _6#11 AS f#24] [info] +- LocalTableScan [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11] [info] [info] [info]== Results == [info]!== Expected Answer - 3 == == Actual Answer - 3 == [info] ![1,1,1.0,1.00,1970-01-01 08:00:00.001,2015-05-21] [1,1,1.0,1.00,1000,16576] [info] ![2,2,2.0,2.00,1970-01-01 08:00:00.002,2015-05-22] [2,2,2.0,2.00,2000,16577] [info] ![3,3,3.0,3.00,1970-01-01 08:00:00.003,2015-05-23] [3,3,3.0,3.00,3000,16578] (SparkPlanTest.scala:95) [ {code} > Solve string value error about Date/Timestamp in ScriptTransform > > > Key: SPARK-31947 > URL: https://issues.apache.org/jira/browse/SPARK-31947 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > For test case > > {code:java} > test("SPARK-25990: TRANSFORM should handle different data types correctly") { > assume(TestUtils.testCommandAvailable("python")) > val scriptFilePath = getTestResourcePath("test_script.py") > withTempView("v") { > val df = Seq( > (1, "1", 1.0, BigDecimal(1.0), new Timestamp(1), > Date.valueOf("2015-05-21")), > (2, "2", 2.0, BigDecimal(2.0), new Timestamp(2), > Date.valueOf("2015-05-22")), > (3, "3", 3.0, BigDecimal(3.0), new Timestamp(3), > Date.valueOf("2015-05-23")) > ).toDF("a", "b", "c", "d", "e", "f") // Note column d's data type is > Decimal(38, 18) > df.createTempView("v") val query = sql( > s""" >|SELECT >|TRANSFORM(a, b, c, d, e, f) >|USING 'python $scriptFilePath' AS (a, b, c, d, e, f) >|FROM v > """.stripMargin) val decimalToString: Column => Column = c => > c.cast("string") checkAnswer(query, identity, df.select( > 'a.cast("string"), > 'b.cast("string"), > 'c.cast("string"), > decimalToString('d), > 'e.cast("string"), > 'f.cast("string")).collect()) > } > } > {code} > > > Get wrong result > {code:java} > [info] - SPARK-25990: TRANSFORM should handle different data types correctly > *** FAILED *** (4 seconds, 997 milliseconds) > [info] Results do not match for Spark plan: > [info]ScriptTransformation [a#19, b#20, c#21, d#22, e#23, f#24], python > /Users/angerszhu/Documents/project/AngersZhu/spark/sql/core/target/scala-2.12/test-classes/test_script.py, > [a#31, b#32, c#33, d#34, e#35, f#36], > org.apache.spark.sql.execution.script.ScriptTransformIOSchema@1ad5a29c > [info] +- Project [_1#6 AS a#19, _2#7 AS b#20, _3#8 AS c#21, _4#9 AS d#22, > _5#10 AS e#23, _6#11 AS f#24] > [info] +- LocalTableScan [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11] > [info] > [info] > [info]== Results == > [info]!== Expect
[jira] [Updated] (SPARK-31969) StreamingJobProgressListener threw an exception java.util.NoSuchElementException for Long Running Streaming Job
[ https://issues.apache.org/jira/browse/SPARK-31969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ThimmeGowda updated SPARK-31969: Description: We are running a long running streaming job and Below exception is seen continuosly after sometime. After the jobs starts all of a sudden our Spark streaming application's batch durations start to increase. At around the same time there starts to appear an error log that does not refer to the application code at all. We couldn't find any other significant errors in the driver logs. Our Application runs fine for several hours(4 or so) , then delay starts to build up and finally every 9-10 hours the application restarts. Any suggestion on what could be the issue ? Refrerred ticket : https://issues.apache.org/jira/browse/SPARK-21065 for similar issue, in our case we are not setting anything for spark.streaming.concurrentJobs and default value is taken. \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", "time":"2020-06-09T04:31:43.918Z", "timezone":"UTC", "class":"spark-listener-group-appStatus", "method":"streaming.scheduler.StreamingListenerBus.logError(91)", "log":"Listener StreamingJobProgressListener threw an exception\u000Ajava.util.NoSuchElementException: key not found: 159167710 ms\u000A\u0009at scala.collection.MapLike$class.default(MapLike.scala:228)\u000A\u0009at scala.collection.AbstractMap.default(Map.scala:59)\u000A\u0009at scala.collection.mutable.HashMap.apply(HashMap.scala:65)\u000A\u0009at org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:134)\u000A\u0009at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)\u000A\u0009at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)\u000A\u0009at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)\u000A\u0009at org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)\u000A\u0009at org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)\u000A\u0009at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:80)\u000A\u0009at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)\u000A\u0009at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)\u000A\u0009at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)\u000A\u0009at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)\u000A\u0009at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)\u000A\u0009at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)\u000A\u0009at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)\u000A\u0009at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)\u000A\u0009at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)\u000A\u0009at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)\u000A\u0009at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)\u000A\u0009at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)\u000A"} java.util.NoSuchElementException: key not found: 159167710 ms at scala.collection.MapLike$class.default(MapLike.scala:228) ~[scala-library-2.11.12.jar:?] at scala.collection.AbstractMap.default(Map.scala:59) ~[scala-library-2.11.12.jar:?] at scala.collection.mutable.HashMap.apply(HashMap.scala:65) ~[scala-library-2.11.12.jar:?] at org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:134) ~[spark-streaming_2.11-2.4.0.jar:2.4.0] at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67) ~[spark-streaming_2.11-2.4.0.jar:2.4.0] at org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29) [spark-streaming_2.11-2.4.0.jar:2.4.0] at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) [spark-core_2.11-2.4.0.jar:2.4.0] at org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29) [spark-streaming_2.11-2.4.0.jar:2.4.0] at org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43) [spark-streaming_2.11-2.4.0.jar:2.4.0]
[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Highman updated SPARK-31962: Description: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against _modificationDate_ and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("filesModifiedAfterDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been created at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the _spark.sql.execution.datasources_ package. was: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against modificationDate and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("filesModifiedAfterDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been created at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the _spark.sql.execution.datasources_ package. > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > When using structur
[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Highman updated SPARK-31962: Description: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against modificationDate and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("filesModifiedAfterDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been created at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the _spark.sql.execution.datasources_ package. was: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds FileStatus objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do an primitive comparison against modificationDate. Without the filter specified, we would be expending less effort than if the filter were applied by itself. Having the ability to provide an option specifying a timestamp by which to begin globbing files would result in quite a bit of less complexity needed on a consumer who leverages the ability to stream from a folder path but does not have an interest in reading what could be thousands of files that are not relevant. One example to could be "filesModifiedAfterDate" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("filesModifiedAfterDate", "2020-05-01 00:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been created at or later than the specified time in order to be consumed for purposes of reading files in general or for purposes of structured streaming. I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the _spark.sql.execution.datasources_ package. > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > When using structured streaming with a FileDataSource, I've encountered a > number of occasio
[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Highman updated SPARK-31962: Description: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds FileStatus objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do an primitive comparison against modificationDate. Without the filter specified, we would be expending less effort than if the filter were applied by itself. Having the ability to provide an option specifying a timestamp by which to begin globbing files would result in quite a bit of less complexity needed on a consumer who leverages the ability to stream from a folder path but does not have an interest in reading what could be thousands of files that are not relevant. One example to could be "filesModifiedAfterDate" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("filesModifiedAfterDate", "2020-05-01 00:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been created at or later than the specified time in order to be consumed for purposes of reading files in general or for purposes of structured streaming. I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the _spark.sql.execution.datasources_ package. was: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical delta files in CSV format. When I start reading from a folder, however, I might only care about files were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala], there is a method, _checkAndGlobPathIfNecessary,_ which appears create an in-memory index of files for a given path. There may a rather clean opportunity to consider options here. Having the ability to provide an option specifying a timestamp by which to begin globbing files would result in quite a bit of less complexity needed on a consumer who leverages the ability to stream from a folder path but does not have an interest in reading what could be thousands of files that are not relevant. One example to could be "createdFileTime" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("createdFileTime", "2020-05-01 00:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been created at or later than the specified time in order to be consumed for purposes of reading the files in general or for purposes of structured streaming. > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > When using structured streaming with a FileDataSource, I've encountered a > number of occasions where I want to be able to stream from a folder > containing any number of historical files in CSV format. When I start > reading from a folder, however, I might only care about files that were > created after a certain time. > {code:java} > spark.readStream > .option("header", "true") > .option("delimiter", "\t") > .format("csv") > .load("/mnt/Deltas") > {code} > In > [https://gith
[jira] [Commented] (SPARK-31979) release script should not fail when remove non-existing files
[ https://issues.apache.org/jira/browse/SPARK-31979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134098#comment-17134098 ] Apache Spark commented on SPARK-31979: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/28815 > release script should not fail when remove non-existing files > - > > Key: SPARK-31979 > URL: https://issues.apache.org/jira/browse/SPARK-31979 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31979) release script should not fail when remove non-existing files
[ https://issues.apache.org/jira/browse/SPARK-31979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31979: Assignee: Apache Spark (was: Wenchen Fan) > release script should not fail when remove non-existing files > - > > Key: SPARK-31979 > URL: https://issues.apache.org/jira/browse/SPARK-31979 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31979) release script should not fail when remove non-existing files
[ https://issues.apache.org/jira/browse/SPARK-31979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31979: Assignee: Wenchen Fan (was: Apache Spark) > release script should not fail when remove non-existing files > - > > Key: SPARK-31979 > URL: https://issues.apache.org/jira/browse/SPARK-31979 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31979) release script should not fail when remove non-existing files
Wenchen Fan created SPARK-31979: --- Summary: release script should not fail when remove non-existing files Key: SPARK-31979 URL: https://issues.apache.org/jira/browse/SPARK-31979 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31978) spark AE 中处理数据倾斜为啥对join type有要求
zhaoming chen created SPARK-31978: - Summary: spark AE 中处理数据倾斜为啥对join type有要求 Key: SPARK-31978 URL: https://issues.apache.org/jira/browse/SPARK-31978 Project: Spark Issue Type: Question Components: SQL Affects Versions: 3.0.0 Reporter: zhaoming chen why can't split right partition when the jointype is left join ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31760) Simplification Based on Containment
[ https://issues.apache.org/jira/browse/SPARK-31760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-31760: --- Assignee: (was: Yuming Wang) > Simplification Based on Containment > --- > > Key: SPARK-31760 > URL: https://issues.apache.org/jira/browse/SPARK-31760 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Labels: starter > > https://docs.teradata.com/reader/Ws7YT1jvRK2vEr1LpVURug/V~FCwD9BL7gY4ac3WwHInw -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31968) write.partitionBy() creates duplicate subdirectories when user provide duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuzhou Qin updated SPARK-31968: --- Summary: write.partitionBy() creates duplicate subdirectories when user provide duplicate columns (was: write.partitionBy() creates duplicated subdirectories when user provide duplicated columns) > write.partitionBy() creates duplicate subdirectories when user provide > duplicate columns > > > Key: SPARK-31968 > URL: https://issues.apache.org/jira/browse/SPARK-31968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Xuzhou Qin >Priority: Major > > I recently remarked that if there are duplicated elements in the argument of > write.partitionBy(), then the same partition subdirectory will be created > multiple times. > For example: > {code:java} > import spark.implicits._ > val df: DataFrame = Seq( > (1, "p1", "c1", 1L), > (2, "p2", "c2", 2L), > (2, "p1", "c2", 2L), > (3, "p3", "c3", 3L), > (3, "p2", "c3", 3L), > (3, "p3", "c3", 3L) > ).toDF("col1", "col2", "col3", "col4") > df.write > .partitionBy("col1", "col1") // we have "col1" twice > .mode(SaveMode.Overwrite) > .csv("output_dir"){code} > The above code will produce an output directory with this structure: > > {code:java} > output_dir > | > |--col1=1 > ||--col1=1 > | > |--col1=2 > ||--col1=2 > | > |--col1=3 >|--col1=3{code} > And we won't be able to read the output > > {code:java} > spark.read.csv("output_dir").show() > // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found > duplicate column(s) in the partition schema: `col1`;{code} > > I am not sure if partitioning a dataframe twice by the same column make sense > in some real-world applications, but it will cause schema inference problems > in tools like AWS Glue crawler. > Should Spark handle the deduplication of the partition columns? Or maybe > throw an exception when duplicated columns are detected? > If this behaviour is unexpected, I will work on a fix. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31968) write.partitionBy() creates duplicated subdirectories when user provide duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31968: Assignee: (was: Apache Spark) > write.partitionBy() creates duplicated subdirectories when user provide > duplicated columns > -- > > Key: SPARK-31968 > URL: https://issues.apache.org/jira/browse/SPARK-31968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Xuzhou Qin >Priority: Major > > I recently remarked that if there are duplicated elements in the argument of > write.partitionBy(), then the same partition subdirectory will be created > multiple times. > For example: > {code:java} > import spark.implicits._ > val df: DataFrame = Seq( > (1, "p1", "c1", 1L), > (2, "p2", "c2", 2L), > (2, "p1", "c2", 2L), > (3, "p3", "c3", 3L), > (3, "p2", "c3", 3L), > (3, "p3", "c3", 3L) > ).toDF("col1", "col2", "col3", "col4") > df.write > .partitionBy("col1", "col1") // we have "col1" twice > .mode(SaveMode.Overwrite) > .csv("output_dir"){code} > The above code will produce an output directory with this structure: > > {code:java} > output_dir > | > |--col1=1 > ||--col1=1 > | > |--col1=2 > ||--col1=2 > | > |--col1=3 >|--col1=3{code} > And we won't be able to read the output > > {code:java} > spark.read.csv("output_dir").show() > // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found > duplicate column(s) in the partition schema: `col1`;{code} > > I am not sure if partitioning a dataframe twice by the same column make sense > in some real-world applications, but it will cause schema inference problems > in tools like AWS Glue crawler. > Should Spark handle the deduplication of the partition columns? Or maybe > throw an exception when duplicated columns are detected? > If this behaviour is unexpected, I will work on a fix. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31968) write.partitionBy() creates duplicated subdirectories when user provide duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31968: Assignee: (was: Apache Spark) > write.partitionBy() creates duplicated subdirectories when user provide > duplicated columns > -- > > Key: SPARK-31968 > URL: https://issues.apache.org/jira/browse/SPARK-31968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Xuzhou Qin >Priority: Major > > I recently remarked that if there are duplicated elements in the argument of > write.partitionBy(), then the same partition subdirectory will be created > multiple times. > For example: > {code:java} > import spark.implicits._ > val df: DataFrame = Seq( > (1, "p1", "c1", 1L), > (2, "p2", "c2", 2L), > (2, "p1", "c2", 2L), > (3, "p3", "c3", 3L), > (3, "p2", "c3", 3L), > (3, "p3", "c3", 3L) > ).toDF("col1", "col2", "col3", "col4") > df.write > .partitionBy("col1", "col1") // we have "col1" twice > .mode(SaveMode.Overwrite) > .csv("output_dir"){code} > The above code will produce an output directory with this structure: > > {code:java} > output_dir > | > |--col1=1 > ||--col1=1 > | > |--col1=2 > ||--col1=2 > | > |--col1=3 >|--col1=3{code} > And we won't be able to read the output > > {code:java} > spark.read.csv("output_dir").show() > // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found > duplicate column(s) in the partition schema: `col1`;{code} > > I am not sure if partitioning a dataframe twice by the same column make sense > in some real-world applications, but it will cause schema inference problems > in tools like AWS Glue crawler. > Should Spark handle the deduplication of the partition columns? Or maybe > throw an exception when duplicated columns are detected? > If this behaviour is unexpected, I will work on a fix. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31968) write.partitionBy() creates duplicated subdirectories when user provide duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31968: Assignee: Apache Spark > write.partitionBy() creates duplicated subdirectories when user provide > duplicated columns > -- > > Key: SPARK-31968 > URL: https://issues.apache.org/jira/browse/SPARK-31968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Xuzhou Qin >Assignee: Apache Spark >Priority: Major > > I recently remarked that if there are duplicated elements in the argument of > write.partitionBy(), then the same partition subdirectory will be created > multiple times. > For example: > {code:java} > import spark.implicits._ > val df: DataFrame = Seq( > (1, "p1", "c1", 1L), > (2, "p2", "c2", 2L), > (2, "p1", "c2", 2L), > (3, "p3", "c3", 3L), > (3, "p2", "c3", 3L), > (3, "p3", "c3", 3L) > ).toDF("col1", "col2", "col3", "col4") > df.write > .partitionBy("col1", "col1") // we have "col1" twice > .mode(SaveMode.Overwrite) > .csv("output_dir"){code} > The above code will produce an output directory with this structure: > > {code:java} > output_dir > | > |--col1=1 > ||--col1=1 > | > |--col1=2 > ||--col1=2 > | > |--col1=3 >|--col1=3{code} > And we won't be able to read the output > > {code:java} > spark.read.csv("output_dir").show() > // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found > duplicate column(s) in the partition schema: `col1`;{code} > > I am not sure if partitioning a dataframe twice by the same column make sense > in some real-world applications, but it will cause schema inference problems > in tools like AWS Glue crawler. > Should Spark handle the deduplication of the partition columns? Or maybe > throw an exception when duplicated columns are detected? > If this behaviour is unexpected, I will work on a fix. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31968) write.partitionBy() creates duplicated subdirectories when user provide duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134079#comment-17134079 ] Apache Spark commented on SPARK-31968: -- User 'TJX2014' has created a pull request for this issue: https://github.com/apache/spark/pull/28814 > write.partitionBy() creates duplicated subdirectories when user provide > duplicated columns > -- > > Key: SPARK-31968 > URL: https://issues.apache.org/jira/browse/SPARK-31968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Xuzhou Qin >Priority: Major > > I recently remarked that if there are duplicated elements in the argument of > write.partitionBy(), then the same partition subdirectory will be created > multiple times. > For example: > {code:java} > import spark.implicits._ > val df: DataFrame = Seq( > (1, "p1", "c1", 1L), > (2, "p2", "c2", 2L), > (2, "p1", "c2", 2L), > (3, "p3", "c3", 3L), > (3, "p2", "c3", 3L), > (3, "p3", "c3", 3L) > ).toDF("col1", "col2", "col3", "col4") > df.write > .partitionBy("col1", "col1") // we have "col1" twice > .mode(SaveMode.Overwrite) > .csv("output_dir"){code} > The above code will produce an output directory with this structure: > > {code:java} > output_dir > | > |--col1=1 > ||--col1=1 > | > |--col1=2 > ||--col1=2 > | > |--col1=3 >|--col1=3{code} > And we won't be able to read the output > > {code:java} > spark.read.csv("output_dir").show() > // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found > duplicate column(s) in the partition schema: `col1`;{code} > > I am not sure if partitioning a dataframe twice by the same column make sense > in some real-world applications, but it will cause schema inference problems > in tools like AWS Glue crawler. > Should Spark handle the deduplication of the partition columns? Or maybe > throw an exception when duplicated columns are detected? > If this behaviour is unexpected, I will work on a fix. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31968) write.partitionBy() creates duplicated subdirectories when user provide duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuzhou Qin updated SPARK-31968: --- Summary: write.partitionBy() creates duplicated subdirectories when user provide duplicated columns (was: write.partitionBy() creates duplicated subdirectory when user give duplicated columns) > write.partitionBy() creates duplicated subdirectories when user provide > duplicated columns > -- > > Key: SPARK-31968 > URL: https://issues.apache.org/jira/browse/SPARK-31968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Xuzhou Qin >Priority: Major > > I recently remarked that if there are duplicated elements in the argument of > write.partitionBy(), then the same partition subdirectory will be created > multiple times. > For example: > {code:java} > import spark.implicits._ > val df: DataFrame = Seq( > (1, "p1", "c1", 1L), > (2, "p2", "c2", 2L), > (2, "p1", "c2", 2L), > (3, "p3", "c3", 3L), > (3, "p2", "c3", 3L), > (3, "p3", "c3", 3L) > ).toDF("col1", "col2", "col3", "col4") > df.write > .partitionBy("col1", "col1") // we have "col1" twice > .mode(SaveMode.Overwrite) > .csv("output_dir"){code} > The above code will produce an output directory with this structure: > > {code:java} > output_dir > | > |--col1=1 > ||--col1=1 > | > |--col1=2 > ||--col1=2 > | > |--col1=3 >|--col1=3{code} > And we won't be able to read the output > > {code:java} > spark.read.csv("output_dir").show() > // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found > duplicate column(s) in the partition schema: `col1`;{code} > > I am not sure if partitioning a dataframe twice by the same column make sense > in some real-world applications, but it will cause schema inference problems > in tools like AWS Glue crawler. > Should Spark handle the deduplication of the partition columns? Or maybe > throw an exception when duplicated columns are detected? > If this behaviour is unexpected, I will work on a fix. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31968) write.partitionBy() creates duplicated subdirectory when user give duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuzhou Qin updated SPARK-31968: --- Description: I recently remarked that if there are duplicated elements in the argument of write.partitionBy(), then the same partition subdirectory will be created multiple times. For example: {code:java} import spark.implicits._ val df: DataFrame = Seq( (1, "p1", "c1", 1L), (2, "p2", "c2", 2L), (2, "p1", "c2", 2L), (3, "p3", "c3", 3L), (3, "p2", "c3", 3L), (3, "p3", "c3", 3L) ).toDF("col1", "col2", "col3", "col4") df.write .partitionBy("col1", "col1") // we have "col1" twice .mode(SaveMode.Overwrite) .csv("output_dir"){code} The above code will produce an output directory with this structure: {code:java} output_dir | |--col1=1 ||--col1=1 | |--col1=2 ||--col1=2 | |--col1=3 |--col1=3{code} And we won't be able to read the output {code:java} spark.read.csv("output_dir").show() // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `col1`;{code} I am not sure if partitioning a dataframe twice by the same column make sense in some real-world applications, but it will cause schema inference problems in tools like AWS Glue crawler. Should Spark handle the deduplication of the partition columns? Or maybe throw an exception when duplicated columns are detected? If this behaviour is unexpected, I will work on a fix. was: I recently remarked that if there are duplicated elements in the argument of write.partitionBy(), then the same partition subdirectory will be created multiple times. For example: {code:java} import spark.implicits._ val df: DataFrame = Seq( (1, "p1", "c1", 1L), (2, "p2", "c2", 2L), (2, "p1", "c2", 2L), (3, "p3", "c3", 3L), (3, "p2", "c3", 3L), (3, "p3", "c3", 3L) ).toDF("col1", "col2", "col3", "col4") df.write .partitionBy("col1", "col1") // we have "col1" twice .mode(SaveMode.Overwrite) .csv("output_dir"){code} The above code will produce an output directory with this structure: {code:java} output_dir | |--col1=1 ||--col1=1 | |--col1=2 ||--col1=2 | |--col1=3 |--col1=3{code} And we won't be able to read the output {code:java} spark.read.csv("output_dir").show() // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `col1`;{code} I am not sure if partitioning a dataframe twice by the same column make sense in some real world applications, but it will cause schema inference problems in tools like AWS Glue crawler. Should Spark handle the deduplication of the partition columns? Or maybe throw an exception when duplicated columns are detected? If this is an unexpected behaviour, I will working on a fix. > write.partitionBy() creates duplicated subdirectory when user give duplicated > columns > - > > Key: SPARK-31968 > URL: https://issues.apache.org/jira/browse/SPARK-31968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Xuzhou Qin >Priority: Major > > I recently remarked that if there are duplicated elements in the argument of > write.partitionBy(), then the same partition subdirectory will be created > multiple times. > For example: > {code:java} > import spark.implicits._ > val df: DataFrame = Seq( > (1, "p1", "c1", 1L), > (2, "p2", "c2", 2L), > (2, "p1", "c2", 2L), > (3, "p3", "c3", 3L), > (3, "p2", "c3", 3L), > (3, "p3", "c3", 3L) > ).toDF("col1", "col2", "col3", "col4") > df.write > .partitionBy("col1", "col1") // we have "col1" twice > .mode(SaveMode.Overwrite) > .csv("output_dir"){code} > The above code will produce an output directory with this structure: > > {code:java} > output_dir > | > |--col1=1 > ||--col1=1 > | > |--col1=2 > ||--col1=2 > | > |--col1=3 >|--col1=3{code} > And we won't be able to read the output > > {code:java} > spark.read.csv("output_dir").show() > // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found > duplicate column(s) in the partition schema: `col1`;{code} > > I am not sure if partitioning a dataframe twice by the same column make sense > in some real-world applications, but it will cause schema inference problems > in tools like AWS Glue crawler. > Should Spark handle the deduplication of the partition columns? Or maybe > throw an exception when duplicated columns are detected? > If this behaviour is unexpected, I will work on a fix. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsub
[jira] [Commented] (SPARK-31977) Returns the projected plan directly from NestedColumnAliasing
[ https://issues.apache.org/jira/browse/SPARK-31977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134051#comment-17134051 ] Apache Spark commented on SPARK-31977: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28812 > Returns the projected plan directly from NestedColumnAliasing > - > > Key: SPARK-31977 > URL: https://issues.apache.org/jira/browse/SPARK-31977 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > {{NestedColumnAliasing}} and {{GeneratorNestedColumnAliasing}} have a > different usage pattern: > {code} > case p @ NestedColumnAliasing(nestedFieldToAlias, attrToAliases) => > NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, > attrToAliases) > {code} > vs > {code} > case GeneratorNestedColumnAliasing(p) => p > {code} > It should be better to match it -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31977) Returns the projected plan directly from NestedColumnAliasing
[ https://issues.apache.org/jira/browse/SPARK-31977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31977: Assignee: (was: Apache Spark) > Returns the projected plan directly from NestedColumnAliasing > - > > Key: SPARK-31977 > URL: https://issues.apache.org/jira/browse/SPARK-31977 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > {{NestedColumnAliasing}} and {{GeneratorNestedColumnAliasing}} have a > different usage pattern: > {code} > case p @ NestedColumnAliasing(nestedFieldToAlias, attrToAliases) => > NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, > attrToAliases) > {code} > vs > {code} > case GeneratorNestedColumnAliasing(p) => p > {code} > It should be better to match it -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31977) Returns the projected plan directly from NestedColumnAliasing
[ https://issues.apache.org/jira/browse/SPARK-31977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31977: Assignee: Apache Spark > Returns the projected plan directly from NestedColumnAliasing > - > > Key: SPARK-31977 > URL: https://issues.apache.org/jira/browse/SPARK-31977 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > {{NestedColumnAliasing}} and {{GeneratorNestedColumnAliasing}} have a > different usage pattern: > {code} > case p @ NestedColumnAliasing(nestedFieldToAlias, attrToAliases) => > NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, > attrToAliases) > {code} > vs > {code} > case GeneratorNestedColumnAliasing(p) => p > {code} > It should be better to match it -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31968) write.partitionBy() creates duplicated subdirectory when user give duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuzhou Qin updated SPARK-31968: --- Description: I recently remarked that if there are duplicated elements in the argument of write.partitionBy(), then the same partition subdirectory will be created multiple times. For example: {code:java} import spark.implicits._ val df: DataFrame = Seq( (1, "p1", "c1", 1L), (2, "p2", "c2", 2L), (2, "p1", "c2", 2L), (3, "p3", "c3", 3L), (3, "p2", "c3", 3L), (3, "p3", "c3", 3L) ).toDF("col1", "col2", "col3", "col4") df.write .partitionBy("col1", "col1") // we have "col1" twice .mode(SaveMode.Overwrite) .csv("output_dir"){code} The above code will produce an output directory with this structure: {code:java} output_dir | |--col1=1 ||--col1=1 | |--col1=2 ||--col1=2 | |--col1=3 |--col1=3{code} And we won't be able to read the output {code:java} spark.read.csv("output_dir").show() // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `col1`;{code} I am not sure if partitioning a dataframe twice by the same column make sense in some real world applications, but it will cause schema inference problems in tools like AWS Glue crawler. Should Spark handle the deduplication of the partition columns? Or maybe throw an exception when duplicated columns are detected? If this is an unexpected behaviour, I will working on a fix. was: I recently remarked that if there are duplicated elements in the argument of write.partitionBy(), then the same partition subdirectory will be created multiple times. For example: {code:java} import spark.implicits._ val df: DataFrame = Seq( (1, "p1", "c1", 1L), (2, "p2", "c2", 2L), (2, "p1", "c2", 2L), (3, "p3", "c3", 3L), (3, "p2", "c3", 3L), (3, "p3", "c3", 3L) ).toDF("col1", "col2", "col3", "col4") df.write .partitionBy("col1", "col1") // we have "col1" twice .mode(SaveMode.Overwrite) .csv("output_dir"){code} The above code will produce an output directory with this structure: {code:java} output_dir | |--col1=1 ||--col1=1 | |--col1=2 ||--col1=2 | |--col1=3 |--col1=3{code} And we won't be able to read the output {code:java} spark.read.csv("output_dir").show() // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `col1`;{code} I am not sure if partitioning a dataframe twice by the same column make sense in some real world applications, but it will cause schema inference problems in tools like AWS Glue crawler. Should Spark handle the deduplication of the partition columns? Or maybe throw an exception when duplicated columns are detected? > write.partitionBy() creates duplicated subdirectory when user give duplicated > columns > - > > Key: SPARK-31968 > URL: https://issues.apache.org/jira/browse/SPARK-31968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Xuzhou Qin >Priority: Major > > I recently remarked that if there are duplicated elements in the argument of > write.partitionBy(), then the same partition subdirectory will be created > multiple times. > For example: > {code:java} > import spark.implicits._ > val df: DataFrame = Seq( > (1, "p1", "c1", 1L), > (2, "p2", "c2", 2L), > (2, "p1", "c2", 2L), > (3, "p3", "c3", 3L), > (3, "p2", "c3", 3L), > (3, "p3", "c3", 3L) > ).toDF("col1", "col2", "col3", "col4") > df.write > .partitionBy("col1", "col1") // we have "col1" twice > .mode(SaveMode.Overwrite) > .csv("output_dir"){code} > The above code will produce an output directory with this structure: > > {code:java} > output_dir > | > |--col1=1 > ||--col1=1 > | > |--col1=2 > ||--col1=2 > | > |--col1=3 >|--col1=3{code} > And we won't be able to read the output > > {code:java} > spark.read.csv("output_dir").show() > // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found > duplicate column(s) in the partition schema: `col1`;{code} > > I am not sure if partitioning a dataframe twice by the same column make sense > in some real world applications, but it will cause schema inference problems > in tools like AWS Glue crawler. > Should Spark handle the deduplication of the partition columns? Or maybe > throw an exception when duplicated columns are detected? > If this is an unexpected behaviour, I will working on a fix. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional comman
[jira] [Commented] (SPARK-31967) Loading jobs UI page takes 40 seconds
[ https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134050#comment-17134050 ] Apache Spark commented on SPARK-31967: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/28813 > Loading jobs UI page takes 40 seconds > - > > Key: SPARK-31967 > URL: https://issues.apache.org/jira/browse/SPARK-31967 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Gengliang Wang >Priority: Blocker > Attachments: load_time.jpeg, profile.png > > > In the latest master branch, I find that the job list page becomes very slow. > To reproduce in local setup: > {code:java} > spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") > spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") > (1 to 1000).map(_ => spark.sql("select * from t1, t2 where > t1.value=t2.value").show()) > {code} > And that, open live UI: http://localhost:4040/ > The loading time is about 40 seconds. > If we comment out the function call for `drawApplicationTimeline`, then the > loading time is around 1 second. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31967) Loading jobs UI page takes 40 seconds
[ https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134046#comment-17134046 ] Apache Spark commented on SPARK-31967: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/28811 > Loading jobs UI page takes 40 seconds > - > > Key: SPARK-31967 > URL: https://issues.apache.org/jira/browse/SPARK-31967 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Gengliang Wang >Priority: Blocker > Attachments: load_time.jpeg, profile.png > > > In the latest master branch, I find that the job list page becomes very slow. > To reproduce in local setup: > {code:java} > spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") > spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") > (1 to 1000).map(_ => spark.sql("select * from t1, t2 where > t1.value=t2.value").show()) > {code} > And that, open live UI: http://localhost:4040/ > The loading time is about 40 seconds. > If we comment out the function call for `drawApplicationTimeline`, then the > loading time is around 1 second. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31967) Loading jobs UI page takes 40 seconds
[ https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134043#comment-17134043 ] Apache Spark commented on SPARK-31967: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/28811 > Loading jobs UI page takes 40 seconds > - > > Key: SPARK-31967 > URL: https://issues.apache.org/jira/browse/SPARK-31967 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Gengliang Wang >Priority: Blocker > Attachments: load_time.jpeg, profile.png > > > In the latest master branch, I find that the job list page becomes very slow. > To reproduce in local setup: > {code:java} > spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") > spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") > (1 to 1000).map(_ => spark.sql("select * from t1, t2 where > t1.value=t2.value").show()) > {code} > And that, open live UI: http://localhost:4040/ > The loading time is about 40 seconds. > If we comment out the function call for `drawApplicationTimeline`, then the > loading time is around 1 second. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31705) Rewrite join condition to conjunctive normal form
[ https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134039#comment-17134039 ] Apache Spark commented on SPARK-31705: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/28810 > Rewrite join condition to conjunctive normal form > - > > Key: SPARK-31705 > URL: https://issues.apache.org/jira/browse/SPARK-31705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.1.0 > > > Rewrite join condition to [conjunctive normal > form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more > conditions to filter. > PostgreSQL: > {code:sql} > CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, > > l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0), > > l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), > > l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate > DATE, > l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255)); > > CREATE TABLE orders ( > o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255), > o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255), > o_clerk varchar(255), o_shippriority INT, o_comment varchar(255)); > EXPLAIN > SELECT Count(*) > FROM lineitem, >orders > WHERE l_orderkey = o_orderkey >AND ( ( l_suppkey > 3 >AND o_custkey > 13 ) > OR ( l_suppkey > 1 >AND o_custkey > 11 ) ) >AND l_partkey > 19; > EXPLAIN > SELECT Count(*) > FROM lineitem >JOIN orders > ON l_orderkey = o_orderkey > AND ( ( l_suppkey > 3 > AND o_custkey > 13 ) >OR ( l_suppkey > 1 > AND o_custkey > 11 ) ) > AND l_partkey > 19; > EXPLAIN > SELECT Count(*) > FROM lineitem, >orders > WHERE l_orderkey = o_orderkey >AND NOT ( ( l_suppkey > 3 >AND ( l_suppkey > 2 > OR o_custkey > 13 ) ) > OR ( l_suppkey > 1 >AND o_custkey > 11 ) ) >AND l_partkey > 19; > {code} > {noformat} > postgres=# EXPLAIN > postgres-# SELECT Count(*) > postgres-# FROM lineitem, > postgres-#orders > postgres-# WHERE l_orderkey = o_orderkey > postgres-#AND ( ( l_suppkey > 3 > postgres(#AND o_custkey > 13 ) > postgres(# OR ( l_suppkey > 1 > postgres(#AND o_custkey > 11 ) ) > postgres-#AND l_partkey > 19; >QUERY PLAN > - > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) > OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >Filter: ((o_custkey > 13) OR (o_custkey > 11)) > -> Hash (cost=10.53..10.53 rows=6 width=16) >-> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) > Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR > (l_suppkey > 1))) > (9 rows) > postgres=# EXPLAIN > postgres-# SELECT Count(*) > postgres-# FROM lineitem > postgres-#JOIN orders > postgres-# ON l_orderkey = o_orderkey > postgres-# AND ( ( l_suppkey > 3 > postgres(# AND o_custkey > 13 ) > postgres(#OR ( l_suppkey > 1 > postgres(# AND o_custkey > 11 ) ) > postgres-# AND l_partkey > 19; >QUERY PLAN > - > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) > OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >
[jira] [Created] (SPARK-31977) Returns the projected plan directly from NestedColumnAliasing
Hyukjin Kwon created SPARK-31977: Summary: Returns the projected plan directly from NestedColumnAliasing Key: SPARK-31977 URL: https://issues.apache.org/jira/browse/SPARK-31977 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon {{NestedColumnAliasing}} and {{GeneratorNestedColumnAliasing}} have a different usage pattern: {code} case p @ NestedColumnAliasing(nestedFieldToAlias, attrToAliases) => NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, attrToAliases) {code} vs {code} case GeneratorNestedColumnAliasing(p) => p {code} It should be better to match it -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31959) Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian to Julian"
[ https://issues.apache.org/jira/browse/SPARK-31959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134032#comment-17134032 ] Apache Spark commented on SPARK-31959: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28809 > Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian > to Julian" > > > Key: SPARK-31959 > URL: https://issues.apache.org/jira/browse/SPARK-31959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > See > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123688/testReport/org.apache.spark.sql.catalyst.util/RebaseDateTimeSuite/optimization_of_micros_rebasing___Gregorian_to_Julian/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31736) Nested column pruning for other operators
[ https://issues.apache.org/jira/browse/SPARK-31736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31736. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28556 [https://github.com/apache/spark/pull/28556] > Nested column pruning for other operators > - > > Key: SPARK-31736 > URL: https://issues.apache.org/jira/browse/SPARK-31736 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > > Currently we only push nested column pruning through a few operators such as > LIMIT, SAMPLE, etc. This is the ticket for supporting other operators for > nested column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31976) use MemoryUsage to control the size of block
zhengruifeng created SPARK-31976: Summary: use MemoryUsage to control the size of block Key: SPARK-31976 URL: https://issues.apache.org/jira/browse/SPARK-31976 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.1.0 Reporter: zhengruifeng According to the performance test in https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is mainly related to the nnz of block. So it maybe reasonable to control the size of block by memory usage, instead of number of rows. note1: param blockSize had already used in ALS and MLP to stack vectors (expected to be dense); note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models; There may be two ways to impl: 1, compute the sparsity of input vectors ahead of train (this can be computed with other statistics computation, maybe no extra pass), and infer a reasonable number of vectors to stack; 2, stack the input vectors adaptively, by counting the memory usage in a block; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31976) use MemoryUsage to control the size of block
[ https://issues.apache.org/jira/browse/SPARK-31976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-31976: - Description: According to the performance test in https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is mainly related to the nnz of block. So it maybe reasonable to control the size of block by memory usage, instead of number of rows. note1: param blockSize had already used in ALS and MLP to stack vectors (expected to be dense); note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models; There may be two ways to impl: 1, compute the sparsity of input vectors ahead of train (this can be computed with other statistics computation, maybe no extra pass), and infer a reasonable number of vectors to stack; 2, stack the input vectors adaptively, by monitoring the memory usage in a block; was: According to the performance test in https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is mainly related to the nnz of block. So it maybe reasonable to control the size of block by memory usage, instead of number of rows. note1: param blockSize had already used in ALS and MLP to stack vectors (expected to be dense); note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models; There may be two ways to impl: 1, compute the sparsity of input vectors ahead of train (this can be computed with other statistics computation, maybe no extra pass), and infer a reasonable number of vectors to stack; 2, stack the input vectors adaptively, by counting the memory usage in a block; > use MemoryUsage to control the size of block > > > Key: SPARK-31976 > URL: https://issues.apache.org/jira/browse/SPARK-31976 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Major > > According to the performance test in > https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is > mainly related to the nnz of block. > So it maybe reasonable to control the size of block by memory usage, instead > of number of rows. > > note1: param blockSize had already used in ALS and MLP to stack vectors > (expected to be dense); > note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models; > > There may be two ways to impl: > 1, compute the sparsity of input vectors ahead of train (this can be computed > with other statistics computation, maybe no extra pass), and infer a > reasonable number of vectors to stack; > 2, stack the input vectors adaptively, by monitoring the memory usage in a > block; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31975) Throw user facing error when use WindowFunction directly
[ https://issues.apache.org/jira/browse/SPARK-31975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134000#comment-17134000 ] Apache Spark commented on SPARK-31975: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/28808 > Throw user facing error when use WindowFunction directly > > > Key: SPARK-31975 > URL: https://issues.apache.org/jira/browse/SPARK-31975 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31975) Throw user facing error when use WindowFunction directly
[ https://issues.apache.org/jira/browse/SPARK-31975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31975: Assignee: (was: Apache Spark) > Throw user facing error when use WindowFunction directly > > > Key: SPARK-31975 > URL: https://issues.apache.org/jira/browse/SPARK-31975 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31975) Throw user facing error when use WindowFunction directly
[ https://issues.apache.org/jira/browse/SPARK-31975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133999#comment-17133999 ] Apache Spark commented on SPARK-31975: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/28808 > Throw user facing error when use WindowFunction directly > > > Key: SPARK-31975 > URL: https://issues.apache.org/jira/browse/SPARK-31975 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31975) Throw user facing error when use WindowFunction directly
[ https://issues.apache.org/jira/browse/SPARK-31975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31975: Assignee: Apache Spark > Throw user facing error when use WindowFunction directly > > > Key: SPARK-31975 > URL: https://issues.apache.org/jira/browse/SPARK-31975 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org