[jira] [Issue Comment Deleted] (SPARK-33090) Upgrade Google Guava
[ https://issues.apache.org/jira/browse/SPARK-33090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Coy updated SPARK-33090: Comment: was deleted (was: Created PR) > Upgrade Google Guava > > > Key: SPARK-33090 > URL: https://issues.apache.org/jira/browse/SPARK-33090 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.1 >Reporter: Stephen Coy >Priority: Major > > Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using > features from newer versions of Google Guava. > This leads to MethodNotFound exceptions, etc in Spark builds that specify > newer versions of Hadoop. I believe this is due to the use of new methods in > com.google.common.base.Preconditions. > The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently > glued to guava-14.0.1. > I have been running a Spark cluster with the version bumped to guava-29.0-jre > without issue. > Partly due to the way Spark is built, this change is a little more > complicated that just changing the version, because newer versions of guava > have a new dependency on com.google.guava:failureaccess:1.0. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33071) Join with ambiguous column succeeding but giving wrong output
[ https://issues.apache.org/jira/browse/SPARK-33071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212862#comment-17212862 ] angerszhu commented on SPARK-33071: --- it is wrong, I am working on this will rase a pr > Join with ambiguous column succeeding but giving wrong output > - > > Key: SPARK-33071 > URL: https://issues.apache.org/jira/browse/SPARK-33071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 >Reporter: George >Priority: Major > Labels: correctness > > When joining two datasets where one column in each dataset is sourced from > the same input dataset, the join successfully runs, but does not select the > correct columns, leading to incorrect output. > Repro using pyspark: > {code:java} > sc.version > import pyspark.sql.functions as F > d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 'units' > : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 1, > 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > input_df = spark.createDataFrame(d) > df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > df1 = df1.filter(F.col("key") != F.lit("c")) > df2 = df2.filter(F.col("key") != F.lit("d")) > ret = df1.join(df2, df1.key == df2.key, "full").select( > df1["key"].alias("df1_key"), > df2["key"].alias("df2_key"), > df1["sales"], > df2["units"], > F.coalesce(df1["key"], df2["key"]).alias("key")) > ret.show() > ret.explain(){code} > output for 2.4.4: > {code:java} > >>> sc.version > u'2.4.4' > >>> import pyspark.sql.functions as F > >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, > >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', > >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > >>> input_df = spark.createDataFrame(d) > >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > >>> df1 = df1.filter(F.col("key") != F.lit("c")) > >>> df2 = df2.filter(F.col("key") != F.lit("d")) > >>> ret = df1.join(df2, df1.key == df2.key, "full").select( > ... df1["key"].alias("df1_key"), > ... df2["key"].alias("df2_key"), > ... df1["sales"], > ... df2["units"], > ... F.coalesce(df1["key"], df2["key"]).alias("key")) > 20/10/05 15:46:14 WARN Column: Constructing trivially true equals predicate, > 'key#213 = key#213'. Perhaps you need to use aliases. > >>> ret.show() > +---+---+-+-++ > |df1_key|df2_key|sales|units| key| > +---+---+-+-++ > | d| d|3| null| d| > | null| null| null|2|null| > | b| b|5| 10| b| > | a| a|3|6| a| > +---+---+-+-++>>> ret.explain() > == Physical Plan == > *(5) Project [key#213 AS df1_key#258, key#213 AS df2_key#259, sales#223L, > units#230L, coalesce(key#213, key#213) AS key#260] > +- SortMergeJoin [key#213], [key#237], FullOuter >:- *(2) Sort [key#213 ASC NULLS FIRST], false, 0 >: +- *(2) HashAggregate(keys=[key#213], functions=[sum(sales#214L)]) >: +- Exchange hashpartitioning(key#213, 200) >:+- *(1) HashAggregate(keys=[key#213], > functions=[partial_sum(sales#214L)]) >: +- *(1) Project [key#213, sales#214L] >: +- *(1) Filter (isnotnull(key#213) && NOT (key#213 = c)) >: +- Scan ExistingRDD[key#213,sales#214L,units#215L] >+- *(4) Sort [key#237 ASC NULLS FIRST], false, 0 > +- *(4) HashAggregate(keys=[key#237], functions=[sum(units#239L)]) > +- Exchange hashpartitioning(key#237, 200) > +- *(3) HashAggregate(keys=[key#237], > functions=[partial_sum(units#239L)]) >+- *(3) Project [key#237, units#239L] > +- *(3) Filter (isnotnull(key#237) && NOT (key#237 = d)) > +- Scan ExistingRDD[key#237,sales#238L,units#239L] > {code} > output for 3.0.1: > {code:java} > // code placeholder > >>> sc.version > u'3.0.1' > >>> import pyspark.sql.functions as F > >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, > >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', > >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > >>> input_df = spark.createDataFrame(d) > /usr/local/lib/python2.7/site-packages/pyspark/sql/session.py:381: > UserWarning: inferring schema from dict is deprecated,please use > pyspark.sql.Row instead > warnings.warn("inferring schema from dict is deprecated," > >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > >>> df1 =
[jira] [Updated] (SPARK-33121) Spark does not shutdown gracefully
[ https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Tverdokhleb updated SPARK-33121: --- Description: Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in graceful shutdown. Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true". Here is the code: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions } .filter { // Some operations filter } .groupBy { // Some operatons groupBy } } {code} I send a SIGTERM signal to stop the spark streaming, but exception arrises: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) at org.apache.spark.executor.Executor.launchTask(Executor.scala:230) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91) at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){noformat} A piece of logs: {noformat} ... Calling rdd.mapPartitions ... Sending SIGTERM signal ... 2020-10-12 14:12:22 INFO MyProject - Shutdown hook called 2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler 2020-10-12 14:12:22 INFO ReceiverTracker - ReceiverTracker stopped 2020-10-12 14:12:22 INFO JobGenerator - Stopping JobGenerator gracefully 2020-10-12 14:12:22 INFO JobGenerator - Waiting for all received blocks to be consumed for job generation 2020-10-12 14:12:22 INFO JobGenerator - Waited for all received blocks to be consumed for job generation ... Calling rdd.filter 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6]{noformat} After this exception, streaming freezes and halts by timeout (Config parameter "hadoop.service.shutdown.timeout"). Pay attention, this exception arises only for RDD operations (Like map, filter, etc.), business logic is processing normally without any errors. Besides, there is no problem with graceful shutdown in spark 2.4.5. was: Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in graceful shutdown. Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true". Here is the code: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions } .filter { // Some operations filter } .groupBy { // Some operatons groupBy } } {code} I send a SIGTERM signal to stop the spark streaming, but exception arrises: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.Thre
[jira] [Updated] (SPARK-33121) Spark does not shutdown gracefully
[ https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Tverdokhleb updated SPARK-33121: --- Description: Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in graceful shutdown. Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true". Here is the code: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions } .filter { // Some operations filter } .groupBy { // Some operatons groupBy } } {code} I send a SIGTERM signal to stop the spark streaming, but exception arrises: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) at org.apache.spark.executor.Executor.launchTask(Executor.scala:230) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91) at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){noformat} A piece of logs: {noformat} ... Calling rdd.mapPartitions ... Sending SIGTERM signal ... 2020-10-12 14:12:22 INFO MyProject - Shutdown hook called 2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler 2020-10-12 14:12:22 INFO ReceiverTracker - ReceiverTracker stopped 2020-10-12 14:12:22 INFO JobGenerator - Stopping JobGenerator gracefully 2020-10-12 14:12:22 INFO JobGenerator - Waiting for all received blocks to be consumed for job generation 2020-10-12 14:12:22 INFO JobGenerator - Waited for all received blocks to be consumed for job generation ... Calling rdd.filter 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6]{noformat} This exception arises only for RDD operations (Like map, filter, etc.), not business logic. Besides, there is no problem with graceful shutdown in spark 2.4.5. was: Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in graceful shutdown. Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true". Here is the code: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions } .filter { // Some operations filter } .groupBy { // Some operatons groupBy } } {code} I send a SIGTERM signal to stop the spark streaming, but exception arrises: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortPo
[jira] [Updated] (SPARK-33121) Spark does not shutdown gracefully
[ https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Tverdokhleb updated SPARK-33121: --- Description: Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in graceful shutdown. Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true". Here is the code: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions } .filter { // Some operations filter } .groupBy { // Some operatons groupBy } } {code} I send a SIGTERM signal to stop the spark streaming, but exception arrises: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) at org.apache.spark.executor.Executor.launchTask(Executor.scala:230) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91) at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){noformat} A piece of logs: {noformat} ... Calling rdd.mapPartitions ... Sending SIGTERM signal ... 2020-10-12 14:12:22 INFO MyProject - Shutdown hook called 2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler 2020-10-12 14:12:22 INFO ReceiverTracker - ReceiverTracker stopped 2020-10-12 14:12:22 INFO JobGenerator - Stopping JobGenerator gracefully 2020-10-12 14:12:22 INFO JobGenerator - Waiting for all received blocks to be consumed for job generation 2020-10-12 14:12:22 INFO JobGenerator - Waited for all received blocks to be consumed for job generation ... Calling rdd.filter 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6]{noformat} This exception arises only for RDD operations (Like map, filter, etc.), not business logic. Besides, there is no problem with graceful shutdown in spark 2.4.5. was: Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in graceful shutdown. Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true". Here is the code: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions } .filter { // Some operations filter } .groupBy { // Some operatons groupBy } } {code} I send a SIGTERM signal to stop the spark streaming, but exception arrises: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortP
[jira] [Updated] (SPARK-33121) Spark does not shutdown gracefully
[ https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Tverdokhleb updated SPARK-33121: --- Description: Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in graceful shutdown. Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true". Here is the code: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions } .filter { // Some operations filter } .groupBy { // Some operatons groupBy } } {code} I send a SIGTERM signal to stop the spark streaming, but exception arrises: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) at org.apache.spark.executor.Executor.launchTask(Executor.scala:230) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91) at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){noformat} A piece of logs: {noformat} ... Calling rdd.mapPartitions ... 2020-10-12 14:12:22 INFO MyProject - Shutdown hook called 2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler 2020-10-12 14:12:22 INFO ReceiverTracker - ReceiverTracker stopped 2020-10-12 14:12:22 INFO JobGenerator - Stopping JobGenerator gracefully 2020-10-12 14:12:22 INFO JobGenerator - Waiting for all received blocks to be consumed for job generation 2020-10-12 14:12:22 INFO JobGenerator - Waited for all received blocks to be consumed for job generation ... Calling rdd.filter 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6]{noformat} This exception arises only for RDD operations (Like map, filter, etc.), not business logic. Besides, there is no problem with graceful shutdown in spark 2.4.5. was: Hi. I have a spark streaming code, like: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions } .filter { // Some operations filter } .groupBy { // Some operatons groupBy } } {code} I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as "*true*" and send a SIGTERM signal to stop the spark streaming, but an exception arrises: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util.concurrent.ThreadPo
[jira] [Updated] (SPARK-33121) Spark does not shutdown gracefully
[ https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Tverdokhleb updated SPARK-33121: --- Summary: Spark does not shutdown gracefully (was: Spark does not stop gracefully) > Spark does not shutdown gracefully > -- > > Key: SPARK-33121 > URL: https://issues.apache.org/jira/browse/SPARK-33121 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 3.0.1 >Reporter: Dmitry Tverdokhleb >Priority: Major > > Hi. I have a spark streaming code, like: > {code:java} > inputStream.foreachRDD { > rdd => > rdd > .mapPartitions { > // Some operations mapPartitions > } > .filter { > // Some operations filter > } > .groupBy { > // Some operatons groupBy > } > } > {code} > I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as > "*true*" and send a SIGTERM signal to stop the spark streaming, but an > exception arrises: > {noformat} > 2020-10-12 14:12:29 ERROR Inbox - Ignoring error > java.util.concurrent.RejectedExecutionException: Task > org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from > java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, > active threads = 0, queued tasks = 0, completed tasks = 6] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) > at org.apache.spark.executor.Executor.launchTask(Executor.scala:230) > at > org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93) > at > org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91) > at > org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68) > at > org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){noformat} > Logs: > {noformat} > ... > Calling rdd.mapPartitions > ... > 2020-10-12 14:12:22 INFO MyProject - Shutdown hook called > 2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler > 2020-10-12 14:12:22 INFO ReceiverTracker - ReceiverTracker stopped > 2020-10-12 14:12:22 INFO JobGenerator - Stopping JobGenerator gracefully > 2020-10-12 14:12:22 INFO JobGenerator - Waiting for all received blocks to > be consumed for job generation > 2020-10-12 14:12:22 INFO JobGenerator - Waited for all received blocks to be > consumed for job generation > ... > Calling rdd.filter > 2020-10-12 14:12:29 ERROR Inbox - Ignoring error > java.util.concurrent.RejectedExecutionException: Task > org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from > java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, > active threads = 0, queued tasks = 0, completed tasks = 6]{noformat} > This exception arises only for RDD operations (Like map, filter, etc.), not > business logic. > Besides, there is no problem with graceful shutdown in spark 2.4.5. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33090) Upgrade Google Guava
[ https://issues.apache.org/jira/browse/SPARK-33090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212826#comment-17212826 ] Stephen Coy commented on SPARK-33090: - Created PR > Upgrade Google Guava > > > Key: SPARK-33090 > URL: https://issues.apache.org/jira/browse/SPARK-33090 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.1 >Reporter: Stephen Coy >Priority: Major > > Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using > features from newer versions of Google Guava. > This leads to MethodNotFound exceptions, etc in Spark builds that specify > newer versions of Hadoop. I believe this is due to the use of new methods in > com.google.common.base.Preconditions. > The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently > glued to guava-14.0.1. > I have been running a Spark cluster with the version bumped to guava-29.0-jre > without issue. > Partly due to the way Spark is built, this change is a little more > complicated that just changing the version, because newer versions of guava > have a new dependency on com.google.guava:failureaccess:1.0. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33090) Upgrade Google Guava
[ https://issues.apache.org/jira/browse/SPARK-33090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212825#comment-17212825 ] Apache Spark commented on SPARK-33090: -- User 'sfcoy' has created a pull request for this issue: https://github.com/apache/spark/pull/30022 > Upgrade Google Guava > > > Key: SPARK-33090 > URL: https://issues.apache.org/jira/browse/SPARK-33090 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.1 >Reporter: Stephen Coy >Priority: Major > > Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using > features from newer versions of Google Guava. > This leads to MethodNotFound exceptions, etc in Spark builds that specify > newer versions of Hadoop. I believe this is due to the use of new methods in > com.google.common.base.Preconditions. > The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently > glued to guava-14.0.1. > I have been running a Spark cluster with the version bumped to guava-29.0-jre > without issue. > Partly due to the way Spark is built, this change is a little more > complicated that just changing the version, because newer versions of guava > have a new dependency on com.google.guava:failureaccess:1.0. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33090) Upgrade Google Guava
[ https://issues.apache.org/jira/browse/SPARK-33090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33090: Assignee: (was: Apache Spark) > Upgrade Google Guava > > > Key: SPARK-33090 > URL: https://issues.apache.org/jira/browse/SPARK-33090 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.1 >Reporter: Stephen Coy >Priority: Major > > Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using > features from newer versions of Google Guava. > This leads to MethodNotFound exceptions, etc in Spark builds that specify > newer versions of Hadoop. I believe this is due to the use of new methods in > com.google.common.base.Preconditions. > The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently > glued to guava-14.0.1. > I have been running a Spark cluster with the version bumped to guava-29.0-jre > without issue. > Partly due to the way Spark is built, this change is a little more > complicated that just changing the version, because newer versions of guava > have a new dependency on com.google.guava:failureaccess:1.0. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33090) Upgrade Google Guava
[ https://issues.apache.org/jira/browse/SPARK-33090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33090: Assignee: Apache Spark > Upgrade Google Guava > > > Key: SPARK-33090 > URL: https://issues.apache.org/jira/browse/SPARK-33090 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.1 >Reporter: Stephen Coy >Assignee: Apache Spark >Priority: Major > > Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using > features from newer versions of Google Guava. > This leads to MethodNotFound exceptions, etc in Spark builds that specify > newer versions of Hadoop. I believe this is due to the use of new methods in > com.google.common.base.Preconditions. > The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently > glued to guava-14.0.1. > I have been running a Spark cluster with the version bumped to guava-29.0-jre > without issue. > Partly due to the way Spark is built, this change is a little more > complicated that just changing the version, because newer versions of guava > have a new dependency on com.google.guava:failureaccess:1.0. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame
[ https://issues.apache.org/jira/browse/SPARK-33125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212805#comment-17212805 ] Apache Spark commented on SPARK-33125: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/30021 > Improve the error when Lead and Lag are not allowed to specify window frame > --- > > Key: SPARK-33125 > URL: https://issues.apache.org/jira/browse/SPARK-33125 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > Except for Postgresql, other data sources (for example: vertica, oracle, > redshift, mysql, presto) are not allowed to specify window frame for the Lead > and Lag functions. > But the current error message is not clear enough. > {code:java} > Window Frame $f must match the required frame > {code} > The following error message is better. > {code:java} > Cannot specify window frame for lead function > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame
[ https://issues.apache.org/jira/browse/SPARK-33125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33125: Assignee: Apache Spark > Improve the error when Lead and Lag are not allowed to specify window frame > --- > > Key: SPARK-33125 > URL: https://issues.apache.org/jira/browse/SPARK-33125 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Except for Postgresql, other data sources (for example: vertica, oracle, > redshift, mysql, presto) are not allowed to specify window frame for the Lead > and Lag functions. > But the current error message is not clear enough. > {code:java} > Window Frame $f must match the required frame > {code} > The following error message is better. > {code:java} > Cannot specify window frame for lead function > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212804#comment-17212804 ] Chao Sun commented on SPARK-27733: -- [~iemejia] I can help with Hive releases if you can come up with fixes that proved workable for Spark side. Just got the CI working for branch-2.3 so merging PR should be relatively easier now. > Upgrade to Avro 1.10.0 > -- > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.1.0 >Reporter: Ismaël Mejía >Priority: Minor > > Avro 1.9.2 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranamer, no shaded guava, security > updates, so probably a worth upgrade. > Avro 1.10.0 was released and this is still not done. > There is at the moment (2020/08) still a blocker because of Hive related > transitive dependencies bringing older versions of Avro, so we could say that > this is somehow still blocked until HIVE-21737 is solved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame
[ https://issues.apache.org/jira/browse/SPARK-33125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212803#comment-17212803 ] Apache Spark commented on SPARK-33125: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/30021 > Improve the error when Lead and Lag are not allowed to specify window frame > --- > > Key: SPARK-33125 > URL: https://issues.apache.org/jira/browse/SPARK-33125 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > Except for Postgresql, other data sources (for example: vertica, oracle, > redshift, mysql, presto) are not allowed to specify window frame for the Lead > and Lag functions. > But the current error message is not clear enough. > {code:java} > Window Frame $f must match the required frame > {code} > The following error message is better. > {code:java} > Cannot specify window frame for lead function > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame
[ https://issues.apache.org/jira/browse/SPARK-33125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33125: Assignee: (was: Apache Spark) > Improve the error when Lead and Lag are not allowed to specify window frame > --- > > Key: SPARK-33125 > URL: https://issues.apache.org/jira/browse/SPARK-33125 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > Except for Postgresql, other data sources (for example: vertica, oracle, > redshift, mysql, presto) are not allowed to specify window frame for the Lead > and Lag functions. > But the current error message is not clear enough. > {code:java} > Window Frame $f must match the required frame > {code} > The following error message is better. > {code:java} > Cannot specify window frame for lead function > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame
jiaan.geng created SPARK-33125: -- Summary: Improve the error when Lead and Lag are not allowed to specify window frame Key: SPARK-33125 URL: https://issues.apache.org/jira/browse/SPARK-33125 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: jiaan.geng Except for Postgresql, other data sources (for example: vertica, oracle, redshift, mysql, presto) are not allowed to specify window frame for the Lead and Lag functions. But the current error message is not clear enough. {code:java} Window Frame $f must match the required frame {code} The following error message is better. {code:java} Cannot specify window frame for lead function {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30893) Expressions should not change its data type/nullability after it's created
[ https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212749#comment-17212749 ] Anurag Mantripragada commented on SPARK-30893: -- [~maropu], [~dongjoon] - I went through the PRs for individual issues in this Umbrella and looked at the code changes. Only [SPARK-30894|https://issues.apache.org/jira/browse/SPARK-30894] seems to affect branch-2.4. I've commented on that Jira separately asking the original author if we can backport this to branch-2.4. > Expressions should not change its data type/nullability after it's created > -- > > Key: SPARK-30893 > URL: https://issues.apache.org/jira/browse/SPARK-30893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Critical > Fix For: 3.0.0 > > > This is a problem because the configuration can change between different > phases of planning, and this can silently break a query plan which can lead > to crashes or data corruption, if data type/nullability gets changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30894) The nullability of Size function should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212748#comment-17212748 ] Anurag Mantripragada commented on SPARK-30894: -- [~maxgekk] - Looking at the code in branch-2.4, looks like this could be an issue there - [https://github.com/apache/spark/blob/652e5746019b95b78af4d36c23ec5155bb22325b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L94] Should we backport this to branch-2.4 since it is LTS? > The nullability of Size function should not depend on SQLConf.get > - > > Key: SPARK-30894 > URL: https://issues.apache.org/jira/browse/SPARK-30894 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Maxim Gekk >Priority: Blocker > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27292) Spark Job Fails with Unknown Error writing to S3 from AWS EMR
[ https://issues.apache.org/jira/browse/SPARK-27292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212746#comment-17212746 ] Venkata commented on SPARK-27292: - [~hyukjin.kwon] Could you let me know how you resolved this issue? > Spark Job Fails with Unknown Error writing to S3 from AWS EMR > - > > Key: SPARK-27292 > URL: https://issues.apache.org/jira/browse/SPARK-27292 > Project: Spark > Issue Type: Question > Components: Input/Output >Affects Versions: 2.3.2 >Reporter: Olalekan Elesin >Priority: Major > > I am currently experiencing issues writing data to S3 from my Spark Job > running on AWS EMR. > The job writings to some staging path in S3 e.g > \{{.spark-random-alphanumeric}}. After which it fails with this error: > {code:java} > 9/03/26 10:54:07 WARN AsyncEventQueue: Dropped 196300 events from appStatus > since Tue Mar 26 10:52:05 UTC 2019. > 19/03/26 10:55:07 WARN AsyncEventQueue: Dropped 211186 events from appStatus > since Tue Mar 26 10:54:07 UTC 2019. > 19/03/26 11:37:09 WARN DataStreamer: Exception for > BP-312054361-10.41.97.71-1553586781241:blk_1073742995_2172 > java.io.EOFException: Unexpected EOF while trying to read response from server > at > org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:402) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) > at > org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1073) > 19/03/26 11:37:09 WARN DataStreamer: Error Recovery for > BP-312054361-10.41.97.71-1553586781241:blk_1073742995_2172 in pipeline > [DatanodeInfoWithStorage[10.41.121.135:50010,DS-cba2a850-fa30-4933-af2a-05b40b58fdb5,DISK], > > DatanodeInfoWithStorage[10.41.71.181:50010,DS-c90a1d87-b40a-4928-a709-1aef027db65a,DISK]]: > datanode > 0(DatanodeInfoWithStorage[10.41.121.135:50010,DS-cba2a850-fa30-4933-af2a-05b40b58fdb5,DISK]) > is bad. > 19/03/26 11:50:34 WARN AsyncEventQueue: Dropped 157572 events from appStatus > since Tue Mar 26 10:55:07 UTC 2019. > 19/03/26 11:51:34 WARN AsyncEventQueue: Dropped 785 events from appStatus > since Tue Mar 26 11:50:34 UTC 2019. > 19/03/26 11:52:34 WARN AsyncEventQueue: Dropped 656 events from appStatus > since Tue Mar 26 11:51:34 UTC 2019. > 19/03/26 11:53:35 WARN AsyncEventQueue: Dropped 1335 events from appStatus > since Tue Mar 26 11:52:34 UTC 2019. > 19/03/26 11:54:35 WARN AsyncEventQueue: Dropped 1087 events from appStatus > since Tue Mar 26 11:53:35 UTC 2019. > ... > 19/03/26 13:39:39 WARN TaskSetManager: Lost task 33302.0 in stage 1444.0 (TID > 1324427, ip-10-41-122-224.eu-west-1.compute.internal, executor 18): > org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:254) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > Your socket connection to the server was not read from or written to within > the timeout period. Idle connections will be closed. (Service: Amazon S3; > Status Code: 400; Error Code: RequestTimeout; Request ID: 4E2E351899CDFB89; > S3 Extended Request ID: > iQhU4xTloYk9aTvO2FmDXk03M1pYCRQl539bG6PqEOeZrtw4KeAGRZDek9RugJywREfPmAC99FE=), > S3 Extended Request ID: > iQhU4xTloYk9aTvO2FmDXk03M1pYCRQl539bG6PqEOeZrtw4KeAGRZDek9RugJywREfPmAC99FE= > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1658) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1322) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpCl
[jira] [Commented] (SPARK-30893) Expressions should not change its data type/nullability after it's created
[ https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212731#comment-17212731 ] Takeshi Yamamuro commented on SPARK-30893: -- You find any data corruption in branch-2.4? If so, I think we should backport the PRs that are necessary to fix it. > Expressions should not change its data type/nullability after it's created > -- > > Key: SPARK-30893 > URL: https://issues.apache.org/jira/browse/SPARK-30893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Critical > Fix For: 3.0.0 > > > This is a problem because the configuration can change between different > phases of planning, and this can silently break a query plan which can lead > to crashes or data corruption, if data type/nullability gets changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33124) Adds a group tag in all the expressions for built-in functions
Takeshi Yamamuro created SPARK-33124: Summary: Adds a group tag in all the expressions for built-in functions Key: SPARK-33124 URL: https://issues.apache.org/jira/browse/SPARK-33124 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.1.0 Reporter: Takeshi Yamamuro We've added the script to automatically generate documents for built-in functions in SPARK-31429. To generate docs, we need to add a `group` tag in `ExpressionDescription`. Currently, a part of built-in functions has the tag, so we need to finish adding it to all the builtin-functions for better documentations. This ticket comes from the talk in https://github.com/apache/spark/pull/28224#issuecomment-707195753 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30893) Expressions should not change its data type/nullability after it's created
[ https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212717#comment-17212717 ] Anurag Mantripragada commented on SPARK-30893: -- Sorry for not being clear. I was referring to this comment. ??For data type and nullability, I think we should fix before 3.0, as they can lead to data corruption.?? ??For other behaviors, we can have more discussion and wait for 3.1?? > Expressions should not change its data type/nullability after it's created > -- > > Key: SPARK-30893 > URL: https://issues.apache.org/jira/browse/SPARK-30893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Critical > Fix For: 3.0.0 > > > This is a problem because the configuration can change between different > phases of planning, and this can silently break a query plan which can lead > to crashes or data corruption, if data type/nullability gets changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30893) Expressions should not change its data type/nullability after it's created
[ https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212715#comment-17212715 ] Dongjoon Hyun commented on SPARK-30893: --- [~anuragmantri]. Which comment are you referring specifically? (cc [~dbtsai]) > Expressions should not change its data type/nullability after it's created > -- > > Key: SPARK-30893 > URL: https://issues.apache.org/jira/browse/SPARK-30893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Critical > Fix For: 3.0.0 > > > This is a problem because the configuration can change between different > phases of planning, and this can silently break a query plan which can lead > to crashes or data corruption, if data type/nullability gets changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212711#comment-17212711 ] Prakash Rajendran commented on SPARK-24434: --- Hello experts, I have requirement to spin a sidecar container through spark-submit in k8s. Is there a way to create a sidecar using *spark.kubernetes.driver.podTemplateFile ?* or anyother way to spin sidecar container through spark-submit? Appreciate your inputs on this. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Onur Satici >Priority: Major > Fix For: 3.0.0 > > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33123) Ignore `GitHub Action file` change in Amplab Jenkins
[ https://issues.apache.org/jira/browse/SPARK-33123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212708#comment-17212708 ] Apache Spark commented on SPARK-33123: -- User 'williamhyun' has created a pull request for this issue: https://github.com/apache/spark/pull/30020 > Ignore `GitHub Action file` change in Amplab Jenkins > > > Key: SPARK-33123 > URL: https://issues.apache.org/jira/browse/SPARK-33123 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.1.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33123) Ignore `GitHub Action file` change in Amplab Jenkins
[ https://issues.apache.org/jira/browse/SPARK-33123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33123: Assignee: Apache Spark > Ignore `GitHub Action file` change in Amplab Jenkins > > > Key: SPARK-33123 > URL: https://issues.apache.org/jira/browse/SPARK-33123 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.1.0 >Reporter: William Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33123) Ignore `GitHub Action file` change in Amplab Jenkins
[ https://issues.apache.org/jira/browse/SPARK-33123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33123: Assignee: (was: Apache Spark) > Ignore `GitHub Action file` change in Amplab Jenkins > > > Key: SPARK-33123 > URL: https://issues.apache.org/jira/browse/SPARK-33123 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.1.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33123) Ignore `GitHub Action file` change in Amplab Jenkins
[ https://issues.apache.org/jira/browse/SPARK-33123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212707#comment-17212707 ] Apache Spark commented on SPARK-33123: -- User 'williamhyun' has created a pull request for this issue: https://github.com/apache/spark/pull/30020 > Ignore `GitHub Action file` change in Amplab Jenkins > > > Key: SPARK-33123 > URL: https://issues.apache.org/jira/browse/SPARK-33123 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.1.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33123) Ignore `GitHub Action file` change in Amplab Jenkins
William Hyun created SPARK-33123: Summary: Ignore `GitHub Action file` change in Amplab Jenkins Key: SPARK-33123 URL: https://issues.apache.org/jira/browse/SPARK-33123 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.1.0 Reporter: William Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32381) Expose the ability for users to use parallel file & avoid location information discovery in RDDs
[ https://issues.apache.org/jira/browse/SPARK-32381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212684#comment-17212684 ] Apache Spark commented on SPARK-32381: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/30019 > Expose the ability for users to use parallel file & avoid location > information discovery in RDDs > > > Key: SPARK-32381 > URL: https://issues.apache.org/jira/browse/SPARK-32381 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Chao Sun >Priority: Major > Fix For: 3.1.0 > > > We already have this in SQL so it's mostly a matter of re-organizing the code > a bit and agreeing on how to best expose this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33120) Lazy Load of SparkContext.addFiles
[ https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212679#comment-17212679 ] Taylor Smock commented on SPARK-33120: -- I'd like to avoid using excess network resources and disk resources. For example, if I only have 5 GiB of space left on a node, and I've got 10 GiB of data, I don't want to send anything that that node doesn't need. For example, I'm doing something geographically, and I've got a set of binary data files for the whole world (from the NASA SRTM elevation data, if you are interested). The (current) binary files have a naming scheme like `(N/S)(E/W).ext` . I can work around that, but I've been trying to make the methodology generic enough for future binary data files. I think the best solution would be a lazy load for the `addFiles` function (each file is used by relatively few jobs). I could be going at the problem in a sub-optimal fashion though. This isn't (currently) high priority for me (hence `minor`), since the total size of the data files is currently < 10 GiB (there are 9576 elevation files). Hopefully this answered your question. > Lazy Load of SparkContext.addFiles > -- > > Key: SPARK-33120 > URL: https://issues.apache.org/jira/browse/SPARK-33120 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 > Environment: Mac OS X (2 systems), workload to eventually be run on > Amazon EMR. > Java 11 application. >Reporter: Taylor Smock >Priority: Minor > > In my spark job, I may have various random files that may or may not be used > by each task. > I would like to avoid copying all of the files to every executor until it is > actually needed. > > What I've tried: > * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were > distributed to all clients. > * Broadcast variables. Since I _don't_ know what files I'm going to need > until I have started the task, I have to broadcast all the data at once, > which leads to nodes getting data, and then caching it to disk. In short, the > same issues as SparkContext.addFiles, but with the added benefit of having > the ability to create a mapping of paths to files. > What I would like to see: > * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, > Enum.WaitForAvailability) or Future future = SparkFiles.get(file) > > > Notes: > https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346 > indicated that `SparkFiles.get` would be required to get the data on the > local driver, but in my testing that did not appear to be the case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
[ https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33118: - Assignee: Pablo Langa Blanco (was: Apache Spark) > CREATE TEMPORARY TABLE fails with location > -- > > Key: SPARK-33118 > URL: https://issues.apache.org/jira/browse/SPARK-33118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Pablo Langa Blanco >Assignee: Pablo Langa Blanco >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION > > {code:java} > spark.range(3).write.parquet("/data/tmp/testspark1") > spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path > '/data/tmp/testspark1')") > spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION > '/data/tmp/testspark1'") > {code} > The error message in both cases is > {code:java} > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
[ https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33118. --- Fix Version/s: 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 30014 [https://github.com/apache/spark/pull/30014] > CREATE TEMPORARY TABLE fails with location > -- > > Key: SPARK-33118 > URL: https://issues.apache.org/jira/browse/SPARK-33118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Pablo Langa Blanco >Assignee: Apache Spark >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION > > {code:java} > spark.range(3).write.parquet("/data/tmp/testspark1") > spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path > '/data/tmp/testspark1')") > spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION > '/data/tmp/testspark1'") > {code} > The error message in both cases is > {code:java} > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212675#comment-17212675 ] Jan Berkel commented on SPARK-25390: I'm in a similar situation. [~Kyrdan] asked on the mailing list as directed, but nobody replied. It's strange that such a central API is completely undocumented. The new iteration of the datasource API doesn't look remotely like v2, it might as well have been called v3. If it's not possible to provide the documentation, put at least some notes/warnings in the migration guide or changelog indicating that Spark3's datasource API has changed completely. And, as far as I can tell at the moment, it doesn't seem to be possible to implement the new Datasource V2 using plain Java classes. > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
[ https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33118: -- Affects Version/s: 3.1.0 > CREATE TEMPORARY TABLE fails with location > -- > > Key: SPARK-33118 > URL: https://issues.apache.org/jira/browse/SPARK-33118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Pablo Langa Blanco >Assignee: Apache Spark >Priority: Major > > The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION > > {code:java} > spark.range(3).write.parquet("/data/tmp/testspark1") > spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path > '/data/tmp/testspark1')") > spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION > '/data/tmp/testspark1'") > {code} > The error message in both cases is > {code:java} > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
[ https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212672#comment-17212672 ] Dongjoon Hyun commented on SPARK-33118: --- Thank you for your contribution, [~planga82]. > CREATE TEMPORARY TABLE fails with location > -- > > Key: SPARK-33118 > URL: https://issues.apache.org/jira/browse/SPARK-33118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Pablo Langa Blanco >Assignee: Apache Spark >Priority: Major > > The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION > > {code:java} > spark.range(3).write.parquet("/data/tmp/testspark1") > spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path > '/data/tmp/testspark1')") > spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION > '/data/tmp/testspark1'") > {code} > The error message in both cases is > {code:java} > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33120) Lazy Load of SparkContext.addFiles
[ https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212670#comment-17212670 ] Dongjoon Hyun edited comment on SPARK-33120 at 10/12/20, 9:00 PM: -- Hi, [~tsmock]. What is the benefit you need here? bq. I would like to avoid copying all of the files to every executor until it is actually needed. was (Author: dongjoon): Hi, [~tsmock]. What is the benefit you need here? > I would like to avoid copying all of the files to every executor until it is > actually needed. > Lazy Load of SparkContext.addFiles > -- > > Key: SPARK-33120 > URL: https://issues.apache.org/jira/browse/SPARK-33120 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 > Environment: Mac OS X (2 systems), workload to eventually be run on > Amazon EMR. > Java 11 application. >Reporter: Taylor Smock >Priority: Minor > > In my spark job, I may have various random files that may or may not be used > by each task. > I would like to avoid copying all of the files to every executor until it is > actually needed. > > What I've tried: > * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were > distributed to all clients. > * Broadcast variables. Since I _don't_ know what files I'm going to need > until I have started the task, I have to broadcast all the data at once, > which leads to nodes getting data, and then caching it to disk. In short, the > same issues as SparkContext.addFiles, but with the added benefit of having > the ability to create a mapping of paths to files. > What I would like to see: > * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, > Enum.WaitForAvailability) or Future future = SparkFiles.get(file) > > > Notes: > https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346 > indicated that `SparkFiles.get` would be required to get the data on the > local driver, but in my testing that did not appear to be the case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33120) Lazy Load of SparkContext.addFiles
[ https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212670#comment-17212670 ] Dongjoon Hyun commented on SPARK-33120: --- Hi, [~tsmock]. What is the benefit you need here? > I would like to avoid copying all of the files to every executor until it is > actually needed. > Lazy Load of SparkContext.addFiles > -- > > Key: SPARK-33120 > URL: https://issues.apache.org/jira/browse/SPARK-33120 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 > Environment: Mac OS X (2 systems), workload to eventually be run on > Amazon EMR. > Java 11 application. >Reporter: Taylor Smock >Priority: Minor > > In my spark job, I may have various random files that may or may not be used > by each task. > I would like to avoid copying all of the files to every executor until it is > actually needed. > > What I've tried: > * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were > distributed to all clients. > * Broadcast variables. Since I _don't_ know what files I'm going to need > until I have started the task, I have to broadcast all the data at once, > which leads to nodes getting data, and then caching it to disk. In short, the > same issues as SparkContext.addFiles, but with the added benefit of having > the ability to create a mapping of paths to files. > What I would like to see: > * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, > Enum.WaitForAvailability) or Future future = SparkFiles.get(file) > > > Notes: > https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346 > indicated that `SparkFiles.get` would be required to get the data on the > local driver, but in my testing that did not appear to be the case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33122) Remove redundant aggregates in the Optimzier
[ https://issues.apache.org/jira/browse/SPARK-33122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212669#comment-17212669 ] Dongjoon Hyun commented on SPARK-33122: --- Hi, [~tanelk]. Since this is an `Improvement` JIRA, I set the affected version to `3.1.0` because Apache Spark doesn't allow backporting improvement or new feature patches. Please use `3.1.0` next time when you propose an improvement. > Remove redundant aggregates in the Optimzier > > > Key: SPARK-33122 > URL: https://issues.apache.org/jira/browse/SPARK-33122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Major > > It is possible to have two or more consecutive aggregates whose sole purpose > is to keep only distinct values (for example TPCDS q87). We can remove all > but the last one do improve performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33122) Remove redundant aggregates in the Optimzier
[ https://issues.apache.org/jira/browse/SPARK-33122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33122: -- Affects Version/s: (was: 3.0.1) 3.1.0 > Remove redundant aggregates in the Optimzier > > > Key: SPARK-33122 > URL: https://issues.apache.org/jira/browse/SPARK-33122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Major > > It is possible to have two or more consecutive aggregates whose sole purpose > is to keep only distinct values (for example TPCDS q87). We can remove all > but the last one do improve performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25271) Creating parquet table with all the column null throws exception
[ https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25271: -- Fix Version/s: 2.4.8 > Creating parquet table with all the column null throws exception > > > Key: SPARK-25271 > URL: https://issues.apache.org/jira/browse/SPARK-25271 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Shivu Sondur >Assignee: L. C. Hsieh >Priority: Critical > Fix For: 2.4.8, 3.0.0 > > Attachments: image-2018-09-07-09-12-34-944.png, > image-2018-09-07-09-29-33-370.png, image-2018-09-07-09-29-52-899.png, > image-2018-09-07-09-32-43-892.png, image-2018-09-07-09-33-03-095.png > > > {code:java} > 1)cat /data/parquet.dat > 1$abc2$pqr:3$xyz > null{code} > > {code:java} > 2)spark.sql("create table vp_reader_temp (projects map) ROW > FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' > MAP KEYS TERMINATED BY '$'") > {code} > {code:java} > 3)spark.sql(" > LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp") > {code} > {code:java} > 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from > vp_reader_temp") > {code} > *Result :* Throwing exception (Working fine with spark 2.2.1) > {code:java} > java.lang.RuntimeException: Parquet record is malformed: empty fields are > illegal, the field should be ommited completely instead > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123) > at > org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180) > at > org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46) > at > org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112) > at > org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are > illegal, the field should be ommited completely instead > at > org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320) > at > org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60) > ... 2
[jira] [Commented] (SPARK-33122) Remove redundant aggregates in the Optimzier
[ https://issues.apache.org/jira/browse/SPARK-33122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212660#comment-17212660 ] Apache Spark commented on SPARK-33122: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/30018 > Remove redundant aggregates in the Optimzier > > > Key: SPARK-33122 > URL: https://issues.apache.org/jira/browse/SPARK-33122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Tanel Kiis >Priority: Major > > It is possible to have two or more consecutive aggregates whose sole purpose > is to keep only distinct values (for example TPCDS q87). We can remove all > but the last one do improve performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33122) Remove redundant aggregates in the Optimzier
[ https://issues.apache.org/jira/browse/SPARK-33122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212661#comment-17212661 ] Apache Spark commented on SPARK-33122: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/30018 > Remove redundant aggregates in the Optimzier > > > Key: SPARK-33122 > URL: https://issues.apache.org/jira/browse/SPARK-33122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Tanel Kiis >Priority: Major > > It is possible to have two or more consecutive aggregates whose sole purpose > is to keep only distinct values (for example TPCDS q87). We can remove all > but the last one do improve performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33122) Remove redundant aggregates in the Optimzier
[ https://issues.apache.org/jira/browse/SPARK-33122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33122: Assignee: (was: Apache Spark) > Remove redundant aggregates in the Optimzier > > > Key: SPARK-33122 > URL: https://issues.apache.org/jira/browse/SPARK-33122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Tanel Kiis >Priority: Major > > It is possible to have two or more consecutive aggregates whose sole purpose > is to keep only distinct values (for example TPCDS q87). We can remove all > but the last one do improve performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33122) Remove redundant aggregates in the Optimzier
[ https://issues.apache.org/jira/browse/SPARK-33122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33122: Assignee: Apache Spark > Remove redundant aggregates in the Optimzier > > > Key: SPARK-33122 > URL: https://issues.apache.org/jira/browse/SPARK-33122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Tanel Kiis >Assignee: Apache Spark >Priority: Major > > It is possible to have two or more consecutive aggregates whose sole purpose > is to keep only distinct values (for example TPCDS q87). We can remove all > but the last one do improve performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33122) Remove redundant aggregates in the Optimzier
Tanel Kiis created SPARK-33122: -- Summary: Remove redundant aggregates in the Optimzier Key: SPARK-33122 URL: https://issues.apache.org/jira/browse/SPARK-33122 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.1 Reporter: Tanel Kiis It is possible to have two or more consecutive aggregates whose sole purpose is to keep only distinct values (for example TPCDS q87). We can remove all but the last one do improve performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30893) Expressions should not change its data type/nullability after it's created
[ https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212635#comment-17212635 ] Anurag Mantripragada edited comment on SPARK-30893 at 10/12/20, 7:56 PM: - As mentioned above in [~cloud_fan]'s comment, should we backport the nullability and datatype issues from this umbrella to branch-2.4 as they may cause corruption? CC: [~viirya], [~dongjoon] was (Author: anuragmantri): As mentioned here [#https://issues.apache.org/jira/browse/SPARK-30893?focusedCommentId=17041618&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17041618], should we backport the nullability and datatype issues from this umbrella to branch-2.4 as they may cause corruption? CC: [~viirya], [~dongjoon] > Expressions should not change its data type/nullability after it's created > -- > > Key: SPARK-30893 > URL: https://issues.apache.org/jira/browse/SPARK-30893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Critical > Fix For: 3.0.0 > > > This is a problem because the configuration can change between different > phases of planning, and this can silently break a query plan which can lead > to crashes or data corruption, if data type/nullability gets changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30893) Expressions should not change its data type/nullability after it's created
[ https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212635#comment-17212635 ] Anurag Mantripragada commented on SPARK-30893: -- As mentioned here [#https://issues.apache.org/jira/browse/SPARK-30893?focusedCommentId=17041618&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17041618], should we backport the nullability and datatype issues from this umbrella to branch-2.4 as they may cause corruption? CC: [~viirya], [~dongjoon] > Expressions should not change its data type/nullability after it's created > -- > > Key: SPARK-30893 > URL: https://issues.apache.org/jira/browse/SPARK-30893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Critical > Fix For: 3.0.0 > > > This is a problem because the configuration can change between different > phases of planning, and this can silently break a query plan which can lead > to crashes or data corruption, if data type/nullability gets changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system
[ https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuning Zhang closed SPARK-33089. > avro format does not propagate Hadoop config from DS options to underlying > HDFS file system > --- > > Key: SPARK-33089 > URL: https://issues.apache.org/jira/browse/SPARK-33089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuning Zhang >Assignee: Yuning Zhang >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > When running: > {code:java} > spark.read.format("avro").options(conf).load(path) > {code} > The underlying file system will not receive the `conf` options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25271) Creating parquet table with all the column null throws exception
[ https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212487#comment-17212487 ] Apache Spark commented on SPARK-25271: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/30017 > Creating parquet table with all the column null throws exception > > > Key: SPARK-25271 > URL: https://issues.apache.org/jira/browse/SPARK-25271 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Shivu Sondur >Assignee: L. C. Hsieh >Priority: Critical > Fix For: 3.0.0 > > Attachments: image-2018-09-07-09-12-34-944.png, > image-2018-09-07-09-29-33-370.png, image-2018-09-07-09-29-52-899.png, > image-2018-09-07-09-32-43-892.png, image-2018-09-07-09-33-03-095.png > > > {code:java} > 1)cat /data/parquet.dat > 1$abc2$pqr:3$xyz > null{code} > > {code:java} > 2)spark.sql("create table vp_reader_temp (projects map) ROW > FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' > MAP KEYS TERMINATED BY '$'") > {code} > {code:java} > 3)spark.sql(" > LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp") > {code} > {code:java} > 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from > vp_reader_temp") > {code} > *Result :* Throwing exception (Working fine with spark 2.2.1) > {code:java} > java.lang.RuntimeException: Parquet record is malformed: empty fields are > illegal, the field should be ommited completely instead > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123) > at > org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180) > at > org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46) > at > org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112) > at > org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are > illegal, the field should be ommited completely instead > at > org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320) > at > org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212451#comment-17212451 ] Ismaël Mejía commented on SPARK-27733: -- [~sha...@uber.com] sorry I missed somehow the previous notification. If you have a future parquet sync i would love to join to explain the full details, otherwise, the tldr version is fix on Spark side is 'easy' the real deal is that Spark gets Avro's 1.8 dependency via Hive and getting the fix on Hive has proven difficult (already more than 1y in the making) and even with the fix merged we still need it to be backported to the 2.3.x branch and have a release that includes it, we need LOTS of good will and help from the Hive people so if you guys know anyone there who can help that would be appreciated. The issue is related to a more strict validation on unions with ill defined defaults starting on Avro 1.9.x. There are multiple options to deal with this (discussed in the ticket) but we shall probably do a fix on Avro side. I will bring the update here once we agree on the fix, worse scenario it will require a release of Avro too but this is a good time since we were already discussing about a release soon. > Upgrade to Avro 1.10.0 > -- > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.1.0 >Reporter: Ismaël Mejía >Priority: Minor > > Avro 1.9.2 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranamer, no shaded guava, security > updates, so probably a worth upgrade. > Avro 1.10.0 was released and this is still not done. > There is at the moment (2020/08) still a blocker because of Hive related > transitive dependencies bringing older versions of Avro, so we could say that > this is somehow still blocked until HIVE-21737 is solved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31294) Benchmark the performance regression
[ https://issues.apache.org/jira/browse/SPARK-31294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31294. - Fix Version/s: 3.0.0 Resolution: Fixed > Benchmark the performance regression > > > Key: SPARK-31294 > URL: https://issues.apache.org/jira/browse/SPARK-31294 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33016) Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on.
[ https://issues.apache.org/jira/browse/SPARK-33016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33016: --- Assignee: Leanken.Lin > Potential SQLMetrics missed which might cause WEB UI display issue while AQE > is on. > --- > > Key: SPARK-33016 > URL: https://issues.apache.org/jira/browse/SPARK-33016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Assignee: Leanken.Lin >Priority: Minor > > In current AQE execution, there might be a following scenario which might > cause SQLMetrics being incorrectly override. > # Stage A and B are created, and UI updated thru event > onAdaptiveExecutionUpdate. > # Stage A and B are running. Subquery in stage A keep updating metrics thru > event onAdaptiveSQLMetricUpdate. > # Stage B completes, while stage A's subquery is still running, updating > metrics. > # Completion of stage B triggers new stage creation and UI update thru event > onAdaptiveExecutionUpdate again (just like step 1). > > But it's very hard to re-produce this issue, since it was only happened with > high concurrency. For the fix, I suggested that we might be able to keep all > duplicated metrics instead of updating it every time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33016) Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on.
[ https://issues.apache.org/jira/browse/SPARK-33016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33016. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29965 [https://github.com/apache/spark/pull/29965] > Potential SQLMetrics missed which might cause WEB UI display issue while AQE > is on. > --- > > Key: SPARK-33016 > URL: https://issues.apache.org/jira/browse/SPARK-33016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Assignee: Leanken.Lin >Priority: Minor > Fix For: 3.1.0 > > > In current AQE execution, there might be a following scenario which might > cause SQLMetrics being incorrectly override. > # Stage A and B are created, and UI updated thru event > onAdaptiveExecutionUpdate. > # Stage A and B are running. Subquery in stage A keep updating metrics thru > event onAdaptiveSQLMetricUpdate. > # Stage B completes, while stage A's subquery is still running, updating > metrics. > # Completion of stage B triggers new stage creation and UI update thru event > onAdaptiveExecutionUpdate again (just like step 1). > > But it's very hard to re-produce this issue, since it was only happened with > high concurrency. For the fix, I suggested that we might be able to keep all > duplicated metrics instead of updating it every time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33121) Spark does not stop gracefully
[ https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Tverdokhleb updated SPARK-33121: --- Description: Hi. I have a spark streaming code, like: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions } .filter { // Some operations filter } .groupBy { // Some operatons groupBy } } {code} I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as "*true*" and send a SIGTERM signal to stop the spark streaming, but an exception arrises java.util.concurrent.RejectedExecutionException: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) at org.apache.spark.executor.Executor.launchTask(Executor.scala:230) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91) at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){noformat} Logs: {noformat} ... Calling rdd.mapPartitions ... 2020-10-12 14:12:22 INFO MyProject - Shutdown hook called 2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler 2020-10-12 14:12:22 INFO ReceiverTracker - ReceiverTracker stopped 2020-10-12 14:12:22 INFO JobGenerator - Stopping JobGenerator gracefully 2020-10-12 14:12:22 INFO JobGenerator - Waiting for all received blocks to be consumed for job generation 2020-10-12 14:12:22 INFO JobGenerator - Waited for all received blocks to be consumed for job generation ... Calling rdd.filter 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6]{noformat} This exception arises only for RDD operations (Like map, filter, etc.), not business logic. Besides, there is no problem with graceful shutdown in spark 2.4.5. was: Hi. I have a spark streaming code, like: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions } .filter { // Some operations filter } .groupBy { // Some operatons groupBy } } {code} I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as "*true*" * *and* * send a SIGTERM signal to stop the spark streaming, but an exception arrises java.util.concurrent.RejectedExecutionException: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util
[jira] [Resolved] (SPARK-33103) Custom Schema with Custom RDD reorders columns when more than 4 added
[ https://issues.apache.org/jira/browse/SPARK-33103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Mays resolved SPARK-33103. - Resolution: Not A Problem > Custom Schema with Custom RDD reorders columns when more than 4 added > - > > Key: SPARK-33103 > URL: https://issues.apache.org/jira/browse/SPARK-33103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 > Environment: Java Application >Reporter: Justin Mays >Priority: Major > > I have a custom RDD written in Java that uses a custom schema. Everything > appears to work fine with using 4 columns, but when i add a 5th column, > calling show() fails with > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Long is not a valid external type for schema of > here is the schema definition in java: > StructType schema = new StructType() StructType schema = new StructType() > .add("recordId", DataTypes.LongType, false) .add("col1", > DataTypes.DoubleType, false) .add("col2", DataTypes.DoubleType, false) > .add("col3", DataTypes.IntegerType, false) .add("col4", > DataTypes.IntegerType, false); > > Here is the printout of schema.printTreeString(); > == Physical Plan == > *(1) Scan dw [recordId#0L,col1#1,col2#2,col3#3,col4#4] PushedFilters: [], > ReadSchema: struct > > I hardcoded a return in my Row object with values matching the schema: > @Override @Override public Object get(int i) \{ switch(i) { case 0: return > 0L; case 1: return 1.1911950001644689D; case 2: return 9.10949955666E9D; > case 3: return 476; case 4: return 500; } return 0L; } > > Here is the output of the show command: > 15:30:26.875 ERROR org.apache.spark.executor.Executor - Exception in task 0.0 > in stage 0.0 (TID 0)15:30:26.875 ERROR org.apache.spark.executor.Executor - > Exception in task 0.0 in stage 0.0 (TID 0)java.lang.RuntimeException: Error > while encoding: java.lang.RuntimeException: java.lang.Long is not a valid > external type for schema of > doublevalidateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, col1), DoubleType) AS > col1#30validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 1, recordId), LongType) AS > recordId#31Lvalidateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 2, col2), DoubleType) AS > col2#32validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 3, col3), IntegerType) AS > col3#33validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 4, col4), IntegerType) AS col4#34 at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:215) > ~[spark-catalyst_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:197) > ~[spark-catalyst_2.12-3.0.1.jar:3.0.1] at > scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > ~[scala-library-2.12.10.jar:?] at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) ~[?:?] at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > ~[spark-sql_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > ~[spark-sql_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) > ~[spark-sql_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872) > ~[spark-core_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872) > ~[spark-core_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > ~[spark-core_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) > ~[spark-core_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.rdd.RDD.iterator(RDD.scala:313) > ~[spark-core_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > ~[spark-core_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.scheduler.Task.run(Task.scala:127) > ~[spark-core_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) > ~[spark-core_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > ~[spark-core_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) > [spark-core_2.12-3.0.1.jar:3.0.1] at > java.util
[jira] [Commented] (SPARK-33103) Custom Schema with Custom RDD reorders columns when more than 4 added
[ https://issues.apache.org/jira/browse/SPARK-33103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212428#comment-17212428 ] Justin Mays commented on SPARK-33103: - Ok follow up... I think i figured this out but I guess the API was not doing what I expected to do. The issue is that my custom BaseRelation class was implementing PrunedFilteredScan and Spark was sending it the required columns out of order from the schema. Obviously it should have supported the columns coming in out of order, but I didn't expect this to come from Spark, I thought it would come based off my query to spark so it was never obvious to me and i hadn't progressed to implementing that feature yet. I ended up replacing that interface with the TableScan interface for the time being and everything worked with more than 4 columns > Custom Schema with Custom RDD reorders columns when more than 4 added > - > > Key: SPARK-33103 > URL: https://issues.apache.org/jira/browse/SPARK-33103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 > Environment: Java Application >Reporter: Justin Mays >Priority: Major > > I have a custom RDD written in Java that uses a custom schema. Everything > appears to work fine with using 4 columns, but when i add a 5th column, > calling show() fails with > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Long is not a valid external type for schema of > here is the schema definition in java: > StructType schema = new StructType() StructType schema = new StructType() > .add("recordId", DataTypes.LongType, false) .add("col1", > DataTypes.DoubleType, false) .add("col2", DataTypes.DoubleType, false) > .add("col3", DataTypes.IntegerType, false) .add("col4", > DataTypes.IntegerType, false); > > Here is the printout of schema.printTreeString(); > == Physical Plan == > *(1) Scan dw [recordId#0L,col1#1,col2#2,col3#3,col4#4] PushedFilters: [], > ReadSchema: struct > > I hardcoded a return in my Row object with values matching the schema: > @Override @Override public Object get(int i) \{ switch(i) { case 0: return > 0L; case 1: return 1.1911950001644689D; case 2: return 9.10949955666E9D; > case 3: return 476; case 4: return 500; } return 0L; } > > Here is the output of the show command: > 15:30:26.875 ERROR org.apache.spark.executor.Executor - Exception in task 0.0 > in stage 0.0 (TID 0)15:30:26.875 ERROR org.apache.spark.executor.Executor - > Exception in task 0.0 in stage 0.0 (TID 0)java.lang.RuntimeException: Error > while encoding: java.lang.RuntimeException: java.lang.Long is not a valid > external type for schema of > doublevalidateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, col1), DoubleType) AS > col1#30validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 1, recordId), LongType) AS > recordId#31Lvalidateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 2, col2), DoubleType) AS > col2#32validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 3, col3), IntegerType) AS > col3#33validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 4, col4), IntegerType) AS col4#34 at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:215) > ~[spark-catalyst_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:197) > ~[spark-catalyst_2.12-3.0.1.jar:3.0.1] at > scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > ~[scala-library-2.12.10.jar:?] at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) ~[?:?] at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > ~[spark-sql_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > ~[spark-sql_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) > ~[spark-sql_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872) > ~[spark-core_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872) > ~[spark-core_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > ~[spark-core_2.12-3.0.1.jar:3.0.1] at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) > ~[spark-core_2.
[jira] [Updated] (SPARK-33121) Spark does not stop gracefully
[ https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Tverdokhleb updated SPARK-33121: --- Description: Hi. I have a spark streaming code, like: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions } .filter { // Some operations filter } .groupBy { // Some operatons groupBy } } {code} I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as "*true*" and send a SIGTERM signal to stop the spark streaming, but an exception arrises: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) at org.apache.spark.executor.Executor.launchTask(Executor.scala:230) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91) at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){noformat} Logs: {noformat} ... Calling rdd.mapPartitions ... 2020-10-12 14:12:22 INFO MyProject - Shutdown hook called 2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler 2020-10-12 14:12:22 INFO ReceiverTracker - ReceiverTracker stopped 2020-10-12 14:12:22 INFO JobGenerator - Stopping JobGenerator gracefully 2020-10-12 14:12:22 INFO JobGenerator - Waiting for all received blocks to be consumed for job generation 2020-10-12 14:12:22 INFO JobGenerator - Waited for all received blocks to be consumed for job generation ... Calling rdd.filter 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6]{noformat} This exception arises only for RDD operations (Like map, filter, etc.), not business logic. Besides, there is no problem with graceful shutdown in spark 2.4.5. was: Hi. I have a spark streaming code, like: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions } .filter { // Some operations filter } .groupBy { // Some operatons groupBy } } {code} I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as "*true*" and send a SIGTERM signal to stop the spark streaming, but an exception arrises java.util.concurrent.RejectedExecutionException: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExe
[jira] [Created] (SPARK-33121) Spark does not stop gracefully
Dmitry Tverdokhleb created SPARK-33121: -- Summary: Spark does not stop gracefully Key: SPARK-33121 URL: https://issues.apache.org/jira/browse/SPARK-33121 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 3.0.1 Reporter: Dmitry Tverdokhleb Hi. I have a spark streaming code, like: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions null } .filter { // Some operations filter null } .groupBy { // Some operatons groupBy null } } {code} I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as "*true*" ** and ** send a SIGTERM signal to stop the spark streaming, but an exception arrises java.util.concurrent.RejectedExecutionException: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) at org.apache.spark.executor.Executor.launchTask(Executor.scala:230) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91) at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){noformat} Logs: {noformat} ... Calling rdd.mapPartitions ... 2020-10-12 14:12:22 INFO MyProject - Shutdown hook called 2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler 2020-10-12 14:12:22 INFO ReceiverTracker - ReceiverTracker stopped 2020-10-12 14:12:22 INFO JobGenerator - Stopping JobGenerator gracefully 2020-10-12 14:12:22 INFO JobGenerator - Waiting for all received blocks to be consumed for job generation 2020-10-12 14:12:22 INFO JobGenerator - Waited for all received blocks to be consumed for job generation ... Calling rdd.filter 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6]{noformat} This exception arises only for RDD operations (Like map, filter, etc.), not business logic. Besides, there is no problem with graceful shutdown in spark 2.4.5. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33121) Spark does not stop gracefully
[ https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Tverdokhleb updated SPARK-33121: --- Description: Hi. I have a spark streaming code, like: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions } .filter { // Some operations filter } .groupBy { // Some operatons groupBy } } {code} I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as "*true*" * *and* * send a SIGTERM signal to stop the spark streaming, but an exception arrises java.util.concurrent.RejectedExecutionException: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) at org.apache.spark.executor.Executor.launchTask(Executor.scala:230) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91) at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){noformat} Logs: {noformat} ... Calling rdd.mapPartitions ... 2020-10-12 14:12:22 INFO MyProject - Shutdown hook called 2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler 2020-10-12 14:12:22 INFO ReceiverTracker - ReceiverTracker stopped 2020-10-12 14:12:22 INFO JobGenerator - Stopping JobGenerator gracefully 2020-10-12 14:12:22 INFO JobGenerator - Waiting for all received blocks to be consumed for job generation 2020-10-12 14:12:22 INFO JobGenerator - Waited for all received blocks to be consumed for job generation ... Calling rdd.filter 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6]{noformat} This exception arises only for RDD operations (Like map, filter, etc.), not business logic. Besides, there is no problem with graceful shutdown in spark 2.4.5. was: Hi. I have a spark streaming code, like: {code:java} inputStream.foreachRDD { rdd => rdd .mapPartitions { // Some operations mapPartitions null } .filter { // Some operations filter null } .groupBy { // Some operatons groupBy null } } {code} I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as "*true*" ** and ** send a SIGTERM signal to stop the spark streaming, but an exception arrises java.util.concurrent.RejectedExecutionException: {noformat} 2020-10-12 14:12:29 ERROR Inbox - Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoo
[jira] [Created] (SPARK-33120) Lazy Load of SparkContext.addFiles
Taylor Smock created SPARK-33120: Summary: Lazy Load of SparkContext.addFiles Key: SPARK-33120 URL: https://issues.apache.org/jira/browse/SPARK-33120 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Environment: Mac OS X (2 systems), workload to eventually be run on Amazon EMR. Java 11 application. Reporter: Taylor Smock In my spark job, I may have various random files that may or may not be used by each task. I would like to avoid copying all of the files to every executor until it is actually needed. What I've tried: * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were distributed to all clients. * Broadcast variables. Since I _don't_ know what files I'm going to need until I have started the task, I have to broadcast all the data at once, which leads to nodes getting data, and then caching it to disk. In short, the same issues as SparkContext.addFiles, but with the added benefit of having the ability to create a mapping of paths to files. What I would like to see: * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, Enum.WaitForAvailability) or Future future = SparkFiles.get(file) Notes: https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346 indicated that `SparkFiles.get` would be required to get the data on the local driver, but in my testing that did not appear to be the case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33111) aft transform optimization
[ https://issues.apache.org/jira/browse/SPARK-33111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-33111: Assignee: zhengruifeng > aft transform optimization > -- > > Key: SPARK-33111 > URL: https://issues.apache.org/jira/browse/SPARK-33111 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > when {{predictionCol}} and {{quantilesCol}} are both set, we only need one > computation for each row -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33111) aft transform optimization
[ https://issues.apache.org/jira/browse/SPARK-33111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-33111. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 3 [https://github.com/apache/spark/pull/3] > aft transform optimization > -- > > Key: SPARK-33111 > URL: https://issues.apache.org/jira/browse/SPARK-33111 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.1.0 > > > when {{predictionCol}} and {{quantilesCol}} are both set, we only need one > computation for each row -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-33116) Spark SQL window function with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Du closed SPARK-33116. --- > Spark SQL window function with order by cause result incorrect > -- > > Key: SPARK-33116 > URL: https://issues.apache.org/jira/browse/SPARK-33116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Will Du >Priority: Major > > Prepare the data > CREATE TABLE IF NOT EXISTS product_catalog ( > name STRING,category STRING,location STRING,price DECIMAL(10,2)); > INSERT OVERWRITE product_catalog VALUES > ('Nest Coffee', 'drink', 'Toronto', 15.5), > ('Pepesi', 'drink', 'Toronto', 9.99), > ('Hasimal', 'toy', 'Toronto', 5.9), > ('Fire War', 'game', 'Toronto', 70.0), > ('Final Fantasy', 'game', 'Montreal', 79.99), > ('Lego Friends 15005', 'toy', 'Montreal', 12.99), > ('Nesion Milk', 'drink', 'Montreal', 8.9); > 1. Query without ORDER BY after PARTITION BY col, the result is correct. > SELECT > category, price, > max(price) over(PARTITION BY category) as max_p, > min(price) over(PARTITION BY category) as min_p, > sum(price) over(PARTITION BY category) as sum_p, > avg(price) over(PARTITION BY category) as avg_p, > count(*) over(PARTITION BY category) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > 7 rows selected (0.442 seconds) > 2 Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min > result is ok. Why other results are like that? > SELECT > category, price, > max(price) over(PARTITION BY category ORDER BY price) as max_p, > min(price) over(PARTITION BY category ORDER BY price) as min_p, > sum(price) over(PARTITION BY category ORDER BY price) as sum_p, > avg(price) over(PARTITION BY category ORDER BY price) as avg_p, > count(*) over(PARTITION BY category ORDER BY price) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| > | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| > | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| > | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| > 7 rows selected (0.436 seconds) > Does it seem that we can only order by the columns after partition by clause? > I do not think there are such limitation in standard SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33116) Spark SQL window function with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212359#comment-17212359 ] Will Du edited comment on SPARK-33116 at 10/12/20, 1:41 PM: [~maropu], the statement you mentioned does not have PARTITION BY in the example. But I am able to reproduce the same behavior in SQL server. I think this can be closed. was (Author: willddy): [~maropu], the statement you mentioned is comparing queries with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price). > Spark SQL window function with order by cause result incorrect > -- > > Key: SPARK-33116 > URL: https://issues.apache.org/jira/browse/SPARK-33116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Will Du >Priority: Major > > Prepare the data > CREATE TABLE IF NOT EXISTS product_catalog ( > name STRING,category STRING,location STRING,price DECIMAL(10,2)); > INSERT OVERWRITE product_catalog VALUES > ('Nest Coffee', 'drink', 'Toronto', 15.5), > ('Pepesi', 'drink', 'Toronto', 9.99), > ('Hasimal', 'toy', 'Toronto', 5.9), > ('Fire War', 'game', 'Toronto', 70.0), > ('Final Fantasy', 'game', 'Montreal', 79.99), > ('Lego Friends 15005', 'toy', 'Montreal', 12.99), > ('Nesion Milk', 'drink', 'Montreal', 8.9); > 1. Query without ORDER BY after PARTITION BY col, the result is correct. > SELECT > category, price, > max(price) over(PARTITION BY category) as max_p, > min(price) over(PARTITION BY category) as min_p, > sum(price) over(PARTITION BY category) as sum_p, > avg(price) over(PARTITION BY category) as avg_p, > count(*) over(PARTITION BY category) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > 7 rows selected (0.442 seconds) > 2 Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min > result is ok. Why other results are like that? > SELECT > category, price, > max(price) over(PARTITION BY category ORDER BY price) as max_p, > min(price) over(PARTITION BY category ORDER BY price) as min_p, > sum(price) over(PARTITION BY category ORDER BY price) as sum_p, > avg(price) over(PARTITION BY category ORDER BY price) as avg_p, > count(*) over(PARTITION BY category ORDER BY price) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| > | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| > | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| > | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| > 7 rows selected (0.436 seconds) > Does it seem that we can only order by the columns after partition by clause? > I do not think there are such limitation in standard SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31338) Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.
[ https://issues.apache.org/jira/browse/SPARK-31338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212373#comment-17212373 ] Ankush Chatterjee commented on SPARK-31338: --- When read like this is used : - {code:java} jdbcRead = spark.read .option("fetchsize", fetchSize) .jdbc( url = s"${connectionURL}", table = s"${query}", columnName = s"${partKey}", lowerBound = lBound, upperBound = hBound, numPartitions = numParts, connectionProperties = connProps); {code} Spark generates multiple queries to read each partition, in the first partition, spark adds "or $column is null" in the where clause, this makes few databases do a full table scan having a heavy impact on performance (on columns with not null enabled). In JDBCRelation.scala :- {code:java} while (i < numPartitions) { val lBoundValue = boundValueToString(currentValue) val lBound = if (i != 0) s"$column >= $lBoundValue" else null currentValue += stride val uBoundValue = boundValueToString(currentValue) val uBound = if (i != numPartitions - 1) s"$column < $uBoundValue" else null val whereClause = if (uBound == null) { lBound } else if (lBound == null) { s"$uBound or $column is null" } else { s"$lBound AND $uBound" } ans += JDBCPartition(whereClause, i) i = i + 1 }{code} Is it feasible to add an option in JDBCOptions to enable/disable adding "or $column is null", as using a not null column is a commonplace usage when paritioning [~olkuznsmith] > Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for > NOT NULL table definition of partition key. > -- > > Key: SPARK-31338 > URL: https://issues.apache.org/jira/browse/SPARK-31338 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Mohit Dave >Priority: Major > > h2. *Our Use-case Details:* > While reading from a jdbc source using spark sql, we are using below read > format : > jdbc(url: String, table: String, columnName: String, lowerBound: Long, > upperBound: Long, numPartitions: Int, connectionProperties: Properties). > *Table defination :* > postgres=> \d lineitem_sf1000 > Table "public.lineitem_sf1000" > Column | Type | Modifiers > -++-- > *l_orderkey | bigint | not null* > l_partkey | bigint | not null > l_suppkey | bigint | not null > l_linenumber | bigint | not null > l_quantity | numeric(10,2) | not null > l_extendedprice | numeric(10,2) | not null > l_discount | numeric(10,2) | not null > l_tax | numeric(10,2) | not null > l_returnflag | character varying(1) | not null > l_linestatus | character varying(1) | not null > l_shipdate | character varying(29) | not null > l_commitdate | character varying(29) | not null > l_receiptdate | character varying(29) | not null > l_shipinstruct | character varying(25) | not null > l_shipmode | character varying(10) | not null > l_comment | character varying(44) | not null > Indexes: > "l_order_sf1000_idx" btree (l_orderkey) > > *Partition column* : l_orderkey > *numpartion* : 16 > h2. *Problem details :* > > {code:java} > SELECT > "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" > FROM (SELECT > l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment > FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 150001 AND > l_orderkey < 187501 {code} > 15 queries are generated with the above BETWEEN clauses. The last query looks > like this below: > {code:java} > SELECT > "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" > FROM (SELECT > l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment > FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 37501 or > l_orderkey is null {code} > I*n the last query, we are trying to get the remaining records, along with > any data in the table for the partition key having NULL values.* > This hurts performance badly. While the first 15 SQLs took approximately 10 > minutes to execute, the last SQL with the NULL check takes 45 minut
[jira] [Comment Edited] (SPARK-33116) Spark SQL window function with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212359#comment-17212359 ] Will Du edited comment on SPARK-33116 at 10/12/20, 1:19 PM: [~maropu], the statement you mentioned is comparing queries with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price). was (Author: willddy): [~maropu], the statement you mentioned is comparing queries with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price). > Spark SQL window function with order by cause result incorrect > -- > > Key: SPARK-33116 > URL: https://issues.apache.org/jira/browse/SPARK-33116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Will Du >Priority: Major > > Prepare the data > CREATE TABLE IF NOT EXISTS product_catalog ( > name STRING,category STRING,location STRING,price DECIMAL(10,2)); > INSERT OVERWRITE product_catalog VALUES > ('Nest Coffee', 'drink', 'Toronto', 15.5), > ('Pepesi', 'drink', 'Toronto', 9.99), > ('Hasimal', 'toy', 'Toronto', 5.9), > ('Fire War', 'game', 'Toronto', 70.0), > ('Final Fantasy', 'game', 'Montreal', 79.99), > ('Lego Friends 15005', 'toy', 'Montreal', 12.99), > ('Nesion Milk', 'drink', 'Montreal', 8.9); > 1. Query without ORDER BY after PARTITION BY col, the result is correct. > SELECT > category, price, > max(price) over(PARTITION BY category) as max_p, > min(price) over(PARTITION BY category) as min_p, > sum(price) over(PARTITION BY category) as sum_p, > avg(price) over(PARTITION BY category) as avg_p, > count(*) over(PARTITION BY category) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > 7 rows selected (0.442 seconds) > 2 Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min > result is ok. Why other results are like that? > SELECT > category, price, > max(price) over(PARTITION BY category ORDER BY price) as max_p, > min(price) over(PARTITION BY category ORDER BY price) as min_p, > sum(price) over(PARTITION BY category ORDER BY price) as sum_p, > avg(price) over(PARTITION BY category ORDER BY price) as avg_p, > count(*) over(PARTITION BY category ORDER BY price) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| > | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| > | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| > | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| > 7 rows selected (0.436 seconds) > Does it seem that we can only order by the columns after partition by clause? > I do not think there are such limitation in standard SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33116) Spark SQL window function with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212359#comment-17212359 ] Will Du edited comment on SPARK-33116 at 10/12/20, 1:18 PM: [~maropu], the statement you mentioned is comparing queries with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price). was (Author: willddy): [~maropu], the statement is comparing query with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price). > Spark SQL window function with order by cause result incorrect > -- > > Key: SPARK-33116 > URL: https://issues.apache.org/jira/browse/SPARK-33116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Will Du >Priority: Major > > Prepare the data > CREATE TABLE IF NOT EXISTS product_catalog ( > name STRING,category STRING,location STRING,price DECIMAL(10,2)); > INSERT OVERWRITE product_catalog VALUES > ('Nest Coffee', 'drink', 'Toronto', 15.5), > ('Pepesi', 'drink', 'Toronto', 9.99), > ('Hasimal', 'toy', 'Toronto', 5.9), > ('Fire War', 'game', 'Toronto', 70.0), > ('Final Fantasy', 'game', 'Montreal', 79.99), > ('Lego Friends 15005', 'toy', 'Montreal', 12.99), > ('Nesion Milk', 'drink', 'Montreal', 8.9); > 1. Query without ORDER BY after PARTITION BY col, the result is correct. > SELECT > category, price, > max(price) over(PARTITION BY category) as max_p, > min(price) over(PARTITION BY category) as min_p, > sum(price) over(PARTITION BY category) as sum_p, > avg(price) over(PARTITION BY category) as avg_p, > count(*) over(PARTITION BY category) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > 7 rows selected (0.442 seconds) > 2 Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min > result is ok. Why other results are like that? > SELECT > category, price, > max(price) over(PARTITION BY category ORDER BY price) as max_p, > min(price) over(PARTITION BY category ORDER BY price) as min_p, > sum(price) over(PARTITION BY category ORDER BY price) as sum_p, > avg(price) over(PARTITION BY category ORDER BY price) as avg_p, > count(*) over(PARTITION BY category ORDER BY price) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| > | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| > | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| > | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| > 7 rows selected (0.436 seconds) > Does it seem that we can only order by the columns after partition by clause? > I do not think there are such limitation in standard SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33116) Spark SQL window function with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212359#comment-17212359 ] Will Du commented on SPARK-33116: - [~maropu], the statement is comparing query with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price). > Spark SQL window function with order by cause result incorrect > -- > > Key: SPARK-33116 > URL: https://issues.apache.org/jira/browse/SPARK-33116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Will Du >Priority: Major > > Prepare the data > CREATE TABLE IF NOT EXISTS product_catalog ( > name STRING,category STRING,location STRING,price DECIMAL(10,2)); > INSERT OVERWRITE product_catalog VALUES > ('Nest Coffee', 'drink', 'Toronto', 15.5), > ('Pepesi', 'drink', 'Toronto', 9.99), > ('Hasimal', 'toy', 'Toronto', 5.9), > ('Fire War', 'game', 'Toronto', 70.0), > ('Final Fantasy', 'game', 'Montreal', 79.99), > ('Lego Friends 15005', 'toy', 'Montreal', 12.99), > ('Nesion Milk', 'drink', 'Montreal', 8.9); > 1. Query without ORDER BY after PARTITION BY col, the result is correct. > SELECT > category, price, > max(price) over(PARTITION BY category) as max_p, > min(price) over(PARTITION BY category) as min_p, > sum(price) over(PARTITION BY category) as sum_p, > avg(price) over(PARTITION BY category) as avg_p, > count(*) over(PARTITION BY category) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > 7 rows selected (0.442 seconds) > 2 Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min > result is ok. Why other results are like that? > SELECT > category, price, > max(price) over(PARTITION BY category ORDER BY price) as max_p, > min(price) over(PARTITION BY category ORDER BY price) as min_p, > sum(price) over(PARTITION BY category ORDER BY price) as sum_p, > avg(price) over(PARTITION BY category ORDER BY price) as avg_p, > count(*) over(PARTITION BY category ORDER BY price) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| > | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| > | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| > | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| > 7 rows selected (0.436 seconds) > Does it seem that we can only order by the columns after partition by clause? > I do not think there are such limitation in standard SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33103) Custom Schema with Custom RDD reorders columns when more than 4 added
[ https://issues.apache.org/jira/browse/SPARK-33103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212356#comment-17212356 ] Justin Mays commented on SPARK-33103: - I tried running against Spark 2.4.7 and got the same behavior. Will work with 4 columns, but when adding a Fifth i get the same error: 09:13:15.661 INFO org.apache.spark.scheduler.DAGScheduler - ResultStage 0 (show at SparkTest.java:92) failed in 0.696 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Long09:13:15.661 INFO org.apache.spark.scheduler.DAGScheduler - ResultStage 0 (show at SparkTest.java:92) failed in 0.696 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Long at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong(rows.scala:42) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong$(rows.scala:42) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getLong(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:255) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:858) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:858) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace:09:13:15.670 INFO org.apache.spark.scheduler.DAGScheduler - Job 0 failed: show at SparkTest.java:92, took 0.772664 s > Custom Schema with Custom RDD reorders columns when more than 4 added > - > > Key: SPARK-33103 > URL: https://issues.apache.org/jira/browse/SPARK-33103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 > Environment: Java Application >Reporter: Justin Mays >Priority: Major > > I have a custom RDD written in Java that uses a custom schema. Everything > appears to work fine with using 4 columns, but when i add a 5th column, > calling show() fails with > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Long is not a valid external type for schema of > here is the schema definition in java: > StructType schema = new StructType() StructType schema = new StructType() > .add("recordId", DataTypes.LongType, false) .add("col1", > DataTypes.DoubleType, false) .add("col2", DataTypes.DoubleType, false) > .add("col3", DataTypes.IntegerType, false) .add("col4", > DataTypes.IntegerType, false); > > Here is the printout of schema.printTreeString(); > == Physical Plan == > *(1) Scan dw [recordId#0L,col1#1,col2#2,col3#3,col4#4] PushedFilters: [], > ReadSchema: struct > > I hardcoded a return in my Row object with values matching the schema: > @Override @Override public Object get(int i) \{ switch(i) { case 0: return > 0L; case 1: return 1.1911950001644689D; case 2: return 9.10949955666E9D; > case 3: return 476; case 4: return 500; } return 0L; } > > Here is the output of the show command: > 15:30:26.875 ERROR org.apache.spark.executor.Executor - Exception in task 0.0 > in stage 0.0 (TID 0)15:30:26.875 ERROR org.apache.spark.executor.Executor - > Exception in task 0.0 in stage 0
[jira] [Commented] (SPARK-32229) Application entry parsing fails because DriverWrapper registered instead of the normal driver
[ https://issues.apache.org/jira/browse/SPARK-32229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212327#comment-17212327 ] Gabor Somogyi commented on SPARK-32229: --- Started to work on this. > Application entry parsing fails because DriverWrapper registered instead of > the normal driver > - > > Key: SPARK-32229 > URL: https://issues.apache.org/jira/browse/SPARK-32229 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > In some cases DriverWrapper registered by DriverRegistry which causes > exception in PostgresConnectionProvider: > https://github.com/apache/spark/blob/371b35d2e0ab08ebd853147c6673de3adfad0553/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/DriverRegistry.scala#L53 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33119) ScalarSubquery should returns the first two rows to avoid Driver OOM
[ https://issues.apache.org/jira/browse/SPARK-33119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33119: Summary: ScalarSubquery should returns the first two rows to avoid Driver OOM (was: Only return the first two rows to avoid Driver OOM) > ScalarSubquery should returns the first two rows to avoid Driver OOM > - > > Key: SPARK-33119 > URL: https://issues.apache.org/jira/browse/SPARK-33119 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > Exception in thread "subquery-2871" java.lang.OutOfMemoryError: Requested > array size exceeds VM limit > at > scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:103) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:330) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:352) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:351) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:351) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:274) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:830) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:827) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:132) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:156) > at > org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:129) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827) > at scala. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33119) Only return the first two rows to avoid Driver OOM
[ https://issues.apache.org/jira/browse/SPARK-33119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212300#comment-17212300 ] Apache Spark commented on SPARK-33119: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/30016 > Only return the first two rows to avoid Driver OOM > -- > > Key: SPARK-33119 > URL: https://issues.apache.org/jira/browse/SPARK-33119 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > Exception in thread "subquery-2871" java.lang.OutOfMemoryError: Requested > array size exceeds VM limit > at > scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:103) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:330) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:352) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:351) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:351) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:274) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:830) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:827) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:132) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:156) > at > org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:129) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827) > at scala. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33119) Only return the first two rows to avoid Driver OOM
[ https://issues.apache.org/jira/browse/SPARK-33119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33119: Assignee: Apache Spark > Only return the first two rows to avoid Driver OOM > -- > > Key: SPARK-33119 > URL: https://issues.apache.org/jira/browse/SPARK-33119 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > {noformat} > Exception in thread "subquery-2871" java.lang.OutOfMemoryError: Requested > array size exceeds VM limit > at > scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:103) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:330) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:352) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:351) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:351) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:274) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:830) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:827) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:132) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:156) > at > org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:129) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827) > at scala. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33119) Only return the first two rows to avoid Driver OOM
[ https://issues.apache.org/jira/browse/SPARK-33119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33119: Assignee: (was: Apache Spark) > Only return the first two rows to avoid Driver OOM > -- > > Key: SPARK-33119 > URL: https://issues.apache.org/jira/browse/SPARK-33119 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > Exception in thread "subquery-2871" java.lang.OutOfMemoryError: Requested > array size exceeds VM limit > at > scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:103) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:330) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:352) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:351) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:351) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:274) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:830) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:827) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:132) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:156) > at > org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:129) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827) > at scala. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33119) Only return the first two rows to avoid Driver OOM
[ https://issues.apache.org/jira/browse/SPARK-33119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212299#comment-17212299 ] Apache Spark commented on SPARK-33119: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/30016 > Only return the first two rows to avoid Driver OOM > -- > > Key: SPARK-33119 > URL: https://issues.apache.org/jira/browse/SPARK-33119 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > Exception in thread "subquery-2871" java.lang.OutOfMemoryError: Requested > array size exceeds VM limit > at > scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:103) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:330) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:352) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:351) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:351) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:274) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:830) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:827) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:132) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:156) > at > org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:129) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827) > at > org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827) > at scala. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33119) Only return the first two rows to avoid Driver OOM
Yuming Wang created SPARK-33119: --- Summary: Only return the first two rows to avoid Driver OOM Key: SPARK-33119 URL: https://issues.apache.org/jira/browse/SPARK-33119 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang {noformat} Exception in thread "subquery-2871" java.lang.OutOfMemoryError: Requested array size exceeds VM limit at scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:103) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:330) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:352) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:351) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:351) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:274) at org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:830) at org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:827) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:132) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:156) at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:129) at org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827) at org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827) at scala. {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28099) Assertion when querying unpartitioned Hive table with partition-like naming
[ https://issues.apache.org/jira/browse/SPARK-28099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212281#comment-17212281 ] Zsombor Fedor commented on SPARK-28099: --- It is not ORC specific: val testData = List(1,2,3,4,5) val dataFrame = testData.toDF() dataFrame .coalesce(1) .write .format("parquet") .save("user/hive/warehouse/test/dir1=1/") spark.sql("CREATE EXTERNAL TABLE test (val INT) STORED AS PARQUET LOCATION '/user/hive/warehouse/test/'") val queryResponse = spark.sql("SELECT * FROM test") //java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) // at org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:214) > Assertion when querying unpartitioned Hive table with partition-like naming > --- > > Key: SPARK-28099 > URL: https://issues.apache.org/jira/browse/SPARK-28099 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Douglas Drinka >Priority: Major > > {code:java} > val testData = List(1,2,3,4,5) > val dataFrame = testData.toDF() > dataFrame > .coalesce(1) > .write > .mode(SaveMode.Overwrite) > .format("orc") > .option("compression", "zlib") > .save("s3://ddrinka.sparkbug/testFail/dir1=1/dir2=2/") > spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.testFail") > spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.testFail (val INT) STORED > AS ORC LOCATION 's3://ddrinka.sparkbug/testFail/'") > val queryResponse = spark.sql("SELECT * FROM ddrinka_sparkbug.testFail") > //Throws AssertionError > //at > org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:214){code} > It looks like the native ORC reader is creating virtual columns named dir1 > and dir2, which don't exist in the Hive table. [The > assertion|[https://github.com/apache/spark/blob/c0297dedd829a92cca920ab8983dab399f8f32d5/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L257]] > is checking that the number of columns match, which fails due to the virtual > partition columns. > Actually getting data back from this query will be dependent on SPARK-28098, > supporting subdirectories for Hive queries at all. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32704) Logging plan changes for execution
[ https://issues.apache.org/jira/browse/SPARK-32704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212273#comment-17212273 ] Apache Spark commented on SPARK-32704: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/30015 > Logging plan changes for execution > -- > > Key: SPARK-32704 > URL: https://issues.apache.org/jira/browse/SPARK-32704 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.1.0 > > > Since we only log plan changes for analyzer/optimizer now, this ticket > targets adding code to log plan changes in the preparation phase in > QueryExecution for execution. > {code} > scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN") > scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan > ... > 20/08/26 09:32:36 WARN PlanChangeLogger: > === Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages === > !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, > count#23L]) *(1) HashAggregate(keys=[id#19L], > functions=[count(1)], output=[id#19L, count#23L]) > !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], > output=[id#19L, count#27L]) +- *(1) HashAggregate(keys=[id#19L], > functions=[partial_count(1)], output=[id#19L, count#27L]) > ! +- Range (0, 10, step=1, splits=4) > +- *(1) Range (0, 10, step=1, splits=4) > > 20/08/26 09:32:36 WARN PlanChangeLogger: > === Result of Batch Preparations === > !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, > count#23L]) *(1) HashAggregate(keys=[id#19L], > functions=[count(1)], output=[id#19L, count#23L]) > !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], > output=[id#19L, count#27L]) +- *(1) HashAggregate(keys=[id#19L], > functions=[partial_count(1)], output=[id#19L, count#27L]) > ! +- Range (0, 10, step=1, splits=4) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32989) Performance regression when selecting from str_to_map
[ https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-32989. -- Fix Version/s: 3.1.0 Assignee: L. C. Hsieh Resolution: Fixed > Performance regression when selecting from str_to_map > - > > Key: SPARK-32989 > URL: https://issues.apache.org/jira/browse/SPARK-32989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Ondrej Kokes >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.1.0 > > > When I create a map using str_to_map and select more than a single value, I > notice a notable performance regression in 3.0.1 compared to 2.4.7. When > selecting a single value, the performance is the same. Plans are identical > between versions. > It seems like in 2.x the map from str_to_map is preserved for a given row, > but in 3.x it's recalculated for each column. One hint that it might be the > case is that when I tried forcing materialisation of said map in 3.x (by a > coalesce, don't know if there's a better way), I got the performance roughly > to 2.x levels. > Here's a reproducer (the csv in question gets autogenerated by the python > code): > {code:java} > $ head regression.csv > foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > ... (10M more rows) > {code} > {code:python} > import time > import os > import pyspark > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > if __name__ == '__main__': > print(pyspark.__version__) > spark = SparkSession.builder.getOrCreate() > filename = 'regression.csv' > if not os.path.isfile(filename): > with open(filename, 'wt') as fw: > fw.write('foo\n') > for _ in range(10_000_000): > fw.write('foo=bar&baz=bak&bar=foo\n') > df = spark.read.option('header', True).csv(filename) > t = time.time() > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > .select( > f.col('my_map')['foo'], > ) > ) > dd.write.mode('overwrite').csv('tmp') > t2 = time.time() > print('selected one', t2 - t) > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > # .coalesce(100) # forcing evaluation before selection speeds it > up in 3.0.1 > .select( > f.col('my_map')['foo'], > f.col('my_map')['bar'], > f.col('my_map')['baz'], > ) > ) > dd.explain(True) > dd.write.mode('overwrite').csv('tmp') > t3 = time.time() > print('selected three', t3 - t2) > {code} > Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS > (times are in seconds) > {code:java} > # 3.0.1 > # selected one 6.375471830368042 > > # selected three 14.847578048706055 > # 2.4.7 > # selected one 6.679579019546509 > > # selected three 6.5622029304504395 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
[ https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212253#comment-17212253 ] Apache Spark commented on SPARK-33118: -- User 'planga82' has created a pull request for this issue: https://github.com/apache/spark/pull/30014 > CREATE TEMPORARY TABLE fails with location > -- > > Key: SPARK-33118 > URL: https://issues.apache.org/jira/browse/SPARK-33118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Pablo Langa Blanco >Priority: Major > > The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION > > {code:java} > spark.range(3).write.parquet("/data/tmp/testspark1") > spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path > '/data/tmp/testspark1')") > spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION > '/data/tmp/testspark1'") > {code} > The error message in both cases is > {code:java} > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
[ https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33118: Assignee: Apache Spark > CREATE TEMPORARY TABLE fails with location > -- > > Key: SPARK-33118 > URL: https://issues.apache.org/jira/browse/SPARK-33118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Pablo Langa Blanco >Assignee: Apache Spark >Priority: Major > > The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION > > {code:java} > spark.range(3).write.parquet("/data/tmp/testspark1") > spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path > '/data/tmp/testspark1')") > spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION > '/data/tmp/testspark1'") > {code} > The error message in both cases is > {code:java} > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
[ https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33118: Assignee: (was: Apache Spark) > CREATE TEMPORARY TABLE fails with location > -- > > Key: SPARK-33118 > URL: https://issues.apache.org/jira/browse/SPARK-33118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Pablo Langa Blanco >Priority: Major > > The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION > > {code:java} > spark.range(3).write.parquet("/data/tmp/testspark1") > spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path > '/data/tmp/testspark1')") > spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION > '/data/tmp/testspark1'") > {code} > The error message in both cases is > {code:java} > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
[ https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33118: Assignee: Apache Spark > CREATE TEMPORARY TABLE fails with location > -- > > Key: SPARK-33118 > URL: https://issues.apache.org/jira/browse/SPARK-33118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Pablo Langa Blanco >Assignee: Apache Spark >Priority: Major > > The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION > > {code:java} > spark.range(3).write.parquet("/data/tmp/testspark1") > spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path > '/data/tmp/testspark1')") > spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION > '/data/tmp/testspark1'") > {code} > The error message in both cases is > {code:java} > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
[ https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212242#comment-17212242 ] Pablo Langa Blanco commented on SPARK-33118: I'm working on it > CREATE TEMPORARY TABLE fails with location > -- > > Key: SPARK-33118 > URL: https://issues.apache.org/jira/browse/SPARK-33118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Pablo Langa Blanco >Priority: Major > > The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION > > {code:java} > spark.range(3).write.parquet("/data/tmp/testspark1") > spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path > '/data/tmp/testspark1')") > spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION > '/data/tmp/testspark1'") > {code} > The error message in both cases is > {code:java} > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
[ https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco updated SPARK-33118: --- Description: The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION {code:java} spark.range(3).write.parquet("/data/tmp/testspark1") spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path '/data/tmp/testspark1')") spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION '/data/tmp/testspark1'") {code} The error message in both cases is {code:java} org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at org.apache.spark.sql.Dataset.(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) {code} was: The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION {code:java} spark.range(3).write.parquet("/data/tmp/testspark1") spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path '/data/tmp/testspark1')") spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION '/data/tmp/testspark1'") {code} The error message in both cases is {code:java} org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at org.apache.spark.sql.Dataset.(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.w
[jira] [Created] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
Pablo Langa Blanco created SPARK-33118: -- Summary: CREATE TEMPORARY TABLE fails with location Key: SPARK-33118 URL: https://issues.apache.org/jira/browse/SPARK-33118 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.0.0 Reporter: Pablo Langa Blanco The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION {code:java} spark.range(3).write.parquet("/data/tmp/testspark1") spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path '/data/tmp/testspark1')") spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION '/data/tmp/testspark1'") {code} The error message in both cases is {code:java} org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at org.apache.spark.sql.Dataset.(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33090) Upgrade Google Guava
[ https://issues.apache.org/jira/browse/SPARK-33090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212240#comment-17212240 ] angerszhu commented on SPARK-33090: --- Meet this problem these days too and spark build distribution will bring guava jar, make. a lot problem > Upgrade Google Guava > > > Key: SPARK-33090 > URL: https://issues.apache.org/jira/browse/SPARK-33090 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.1 >Reporter: Stephen Coy >Priority: Major > > Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using > features from newer versions of Google Guava. > This leads to MethodNotFound exceptions, etc in Spark builds that specify > newer versions of Hadoop. I believe this is due to the use of new methods in > com.google.common.base.Preconditions. > The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently > glued to guava-14.0.1. > I have been running a Spark cluster with the version bumped to guava-29.0-jre > without issue. > Partly due to the way Spark is built, this change is a little more > complicated that just changing the version, because newer versions of guava > have a new dependency on com.google.guava:failureaccess:1.0. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32455) LogisticRegressionModel prediction optimization
[ https://issues.apache.org/jira/browse/SPARK-32455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212227#comment-17212227 ] Apache Spark commented on SPARK-32455: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/30013 > LogisticRegressionModel prediction optimization > --- > > Key: SPARK-32455 > URL: https://issues.apache.org/jira/browse/SPARK-32455 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.1.0 > > > if needed, method getThreshold and/or following logic to compute rawThreshold > is called on each instance. > > {code:java} > override def getThreshold: Double = { > checkThresholdConsistency() > if (isSet(thresholds)) { > val ts = $(thresholds) > require(ts.length == 2, "Logistic Regression getThreshold only applies > to" + > " binary classification, but thresholds has length != 2. thresholds: " > + ts.mkString(",")) > 1.0 / (1.0 + ts(0) / ts(1)) > } else { > $(threshold) > } > } {code} > > {code:java} > val rawThreshold = if (t == 0.0) { > Double.NegativeInfinity > } else if (t == 1.0) { > Double.PositiveInfinity > } else { > math.log(t / (1.0 - t)) > } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32455) LogisticRegressionModel prediction optimization
[ https://issues.apache.org/jira/browse/SPARK-32455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212226#comment-17212226 ] Apache Spark commented on SPARK-32455: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/30013 > LogisticRegressionModel prediction optimization > --- > > Key: SPARK-32455 > URL: https://issues.apache.org/jira/browse/SPARK-32455 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.1.0 > > > if needed, method getThreshold and/or following logic to compute rawThreshold > is called on each instance. > > {code:java} > override def getThreshold: Double = { > checkThresholdConsistency() > if (isSet(thresholds)) { > val ts = $(thresholds) > require(ts.length == 2, "Logistic Regression getThreshold only applies > to" + > " binary classification, but thresholds has length != 2. thresholds: " > + ts.mkString(",")) > 1.0 / (1.0 + ts(0) / ts(1)) > } else { > $(threshold) > } > } {code} > > {code:java} > val rawThreshold = if (t == 0.0) { > Double.NegativeInfinity > } else if (t == 1.0) { > Double.PositiveInfinity > } else { > math.log(t / (1.0 - t)) > } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33092) Support subexpression elimination in ProjectExec
[ https://issues.apache.org/jira/browse/SPARK-33092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-33092. -- Fix Version/s: 3.1.0 Resolution: Fixed Resolved by https://github.com/apache/spark/pull/29975 > Support subexpression elimination in ProjectExec > > > Key: SPARK-33092 > URL: https://issues.apache.org/jira/browse/SPARK-33092 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > > Users frequently write repeatedly expression in projection. Currently in > ProjectExec, we don't support subexpression elimination in Whole-stage > codegen. We can support it to reduce redundant evaluation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
[ https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212205#comment-17212205 ] Takeshi Yamamuro commented on SPARK-24930: -- hm, I see. Could you check the resolved versions, e.g., 2.4.7, 3.0.1, ... > Exception information is not accurate when using `LOAD DATA LOCAL INPATH` > -- > > Key: SPARK-24930 > URL: https://issues.apache.org/jira/browse/SPARK-24930 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: Xiaochen Ouyang >Priority: Minor > > # root user create a test.txt file contains a record '123' in /root/ > directory > # switch mr user to execute spark-shell --master local > {code:java} > scala> spark.version > res2: String = 2.2.1 > scala> spark.sql("create table t1(id int) partitioned by(area string)"); > 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: > Location: hdfs://nameservice/spark/t1 specified for non-external table:t1 > res4: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("load data local inpath '/root/test.txt' into table t1 > partition(area ='025')") > org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: > /root/test.txt; > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) > ... 48 elided > scala> > {code} > In fact, the input path exists, but the mr user does not have permission to > access the directory `/root/` ,so the message throwed by `AnalysisException` > can confuse user. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
[ https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212201#comment-17212201 ] angerszhu commented on SPARK-24930: --- No, form message I put in comment, seems Hadoop method handle this. > Exception information is not accurate when using `LOAD DATA LOCAL INPATH` > -- > > Key: SPARK-24930 > URL: https://issues.apache.org/jira/browse/SPARK-24930 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: Xiaochen Ouyang >Priority: Minor > > # root user create a test.txt file contains a record '123' in /root/ > directory > # switch mr user to execute spark-shell --master local > {code:java} > scala> spark.version > res2: String = 2.2.1 > scala> spark.sql("create table t1(id int) partitioned by(area string)"); > 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: > Location: hdfs://nameservice/spark/t1 specified for non-external table:t1 > res4: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("load data local inpath '/root/test.txt' into table t1 > partition(area ='025')") > org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: > /root/test.txt; > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) > ... 48 elided > scala> > {code} > In fact, the input path exists, but the mr user does not have permission to > access the directory `/root/` ,so the message throwed by `AnalysisException` > can confuse user. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31294) Benchmark the performance regression
[ https://issues.apache.org/jira/browse/SPARK-31294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212199#comment-17212199 ] Maxim Gekk commented on SPARK-31294: [~cloud_fan] Could you close this ticket since the PR was merged. > Benchmark the performance regression > > > Key: SPARK-31294 > URL: https://issues.apache.org/jira/browse/SPARK-31294 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
[ https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212196#comment-17212196 ] Takeshi Yamamuro commented on SPARK-24930: -- Do you know which pr resolves this? (it'd be better to link this Jira to it if possible) > Exception information is not accurate when using `LOAD DATA LOCAL INPATH` > -- > > Key: SPARK-24930 > URL: https://issues.apache.org/jira/browse/SPARK-24930 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: Xiaochen Ouyang >Priority: Minor > > # root user create a test.txt file contains a record '123' in /root/ > directory > # switch mr user to execute spark-shell --master local > {code:java} > scala> spark.version > res2: String = 2.2.1 > scala> spark.sql("create table t1(id int) partitioned by(area string)"); > 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: > Location: hdfs://nameservice/spark/t1 specified for non-external table:t1 > res4: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("load data local inpath '/root/test.txt' into table t1 > partition(area ='025')") > org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: > /root/test.txt; > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) > ... 48 elided > scala> > {code} > In fact, the input path exists, but the mr user does not have permission to > access the directory `/root/` ,so the message throwed by `AnalysisException` > can confuse user. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33117) Update zstd-jni to 1.4.5-6
[ https://issues.apache.org/jira/browse/SPARK-33117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33117. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30010 [https://github.com/apache/spark/pull/30010] > Update zstd-jni to 1.4.5-6 > -- > > Key: SPARK-33117 > URL: https://issues.apache.org/jira/browse/SPARK-33117 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33117) Update zstd-jni to 1.4.5-6
[ https://issues.apache.org/jira/browse/SPARK-33117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33117: - Assignee: Dongjoon Hyun > Update zstd-jni to 1.4.5-6 > -- > > Key: SPARK-33117 > URL: https://issues.apache.org/jira/browse/SPARK-33117 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
[ https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212189#comment-17212189 ] angerszhu commented on SPARK-24930: --- cc [~maropu] Hey, for this jira, I check again and seems solved by other pr, can I mark it as resolved or ping your committer ? > Exception information is not accurate when using `LOAD DATA LOCAL INPATH` > -- > > Key: SPARK-24930 > URL: https://issues.apache.org/jira/browse/SPARK-24930 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: Xiaochen Ouyang >Priority: Minor > > # root user create a test.txt file contains a record '123' in /root/ > directory > # switch mr user to execute spark-shell --master local > {code:java} > scala> spark.version > res2: String = 2.2.1 > scala> spark.sql("create table t1(id int) partitioned by(area string)"); > 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: > Location: hdfs://nameservice/spark/t1 specified for non-external table:t1 > res4: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("load data local inpath '/root/test.txt' into table t1 > partition(area ='025')") > org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: > /root/test.txt; > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) > ... 48 elided > scala> > {code} > In fact, the input path exists, but the mr user does not have permission to > access the directory `/root/` ,so the message throwed by `AnalysisException` > can confuse user. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
[ https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212188#comment-17212188 ] angerszhu edited comment on SPARK-24930 at 10/12/20, 7:25 AM: -- In current version, error message spark-sql> load data local inpath '/home/hadoop/spark-3.1.0/test.txt' into table t1spark-sql> load data local inpath '/home/hadoop/spark-3.1.0/test.txt' into table t1 > ;20/10/12 15:20:26 ERROR Hive: Failed to move: java.io.FileNotFoundException: /home/hadoop/spark-3.1.0/test.txt (Permission denied)Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.FileNotFoundException: /home/hadoop/spark-3.1.0/test.txt (Permission denied);org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.FileNotFoundException: /home/hadoop/spark-3.1.0/test.txt (Permission denied); at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:113) at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:878) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadTable(ExternalCatalogWithListener.scala:167) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:520) at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:390) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3675) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3673) at org.apache.spark.sql.Dataset.(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:612) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:378) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:497) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:491) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:283) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:934) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1013) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1022) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.FileNotFoundException: /home/hadoop/spark-3.1.0/test.txt