date:20201012



[ 
https://issues.apache.org/jira/browse/SPARK-33071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212862#comment-17212862
 ] 

angerszhu commented on SPARK-33071:
---

it is wrong, I am working on this will rase a pr

> Join with ambiguous column succeeding but giving wrong output
> -
>
> Key: SPARK-33071
> URL: https://issues.apache.org/jira/browse/SPARK-33071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1
>Reporter: George
>Priority: Major
>  Labels: correctness
>
> When joining two datasets where one column in each dataset is sourced from 
> the same input dataset, the join successfully runs, but does not select the 
> correct columns, leading to incorrect output.
> Repro using pyspark:
> {code:java}
> sc.version
> import pyspark.sql.functions as F
> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 'units' 
> : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 1, 
> 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> input_df = spark.createDataFrame(d)
> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> df1 = df1.filter(F.col("key") != F.lit("c"))
> df2 = df2.filter(F.col("key") != F.lit("d"))
> ret = df1.join(df2, df1.key == df2.key, "full").select(
> df1["key"].alias("df1_key"),
> df2["key"].alias("df2_key"),
> df1["sales"],
> df2["units"],
> F.coalesce(df1["key"], df2["key"]).alias("key"))
> ret.show()
> ret.explain(){code}
> output for 2.4.4:
> {code:java}
> >>> sc.version
> u'2.4.4'
> >>> import pyspark.sql.functions as F
> >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
> >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 
> >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> >>> input_df = spark.createDataFrame(d)
> >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> >>> df1 = df1.filter(F.col("key") != F.lit("c"))
> >>> df2 = df2.filter(F.col("key") != F.lit("d"))
> >>> ret = df1.join(df2, df1.key == df2.key, "full").select(
> ... df1["key"].alias("df1_key"),
> ... df2["key"].alias("df2_key"),
> ... df1["sales"],
> ... df2["units"],
> ... F.coalesce(df1["key"], df2["key"]).alias("key"))
> 20/10/05 15:46:14 WARN Column: Constructing trivially true equals predicate, 
> 'key#213 = key#213'. Perhaps you need to use aliases.
> >>> ret.show()
> +---+---+-+-++
> |df1_key|df2_key|sales|units| key|
> +---+---+-+-++
> |  d|  d|3| null|   d|
> |   null|   null| null|2|null|
> |  b|  b|5|   10|   b|
> |  a|  a|3|6|   a|
> +---+---+-+-++>>> ret.explain()
> == Physical Plan ==
> *(5) Project [key#213 AS df1_key#258, key#213 AS df2_key#259, sales#223L, 
> units#230L, coalesce(key#213, key#213) AS key#260]
> +- SortMergeJoin [key#213], [key#237], FullOuter
>:- *(2) Sort [key#213 ASC NULLS FIRST], false, 0
>:  +- *(2) HashAggregate(keys=[key#213], functions=[sum(sales#214L)])
>: +- Exchange hashpartitioning(key#213, 200)
>:+- *(1) HashAggregate(keys=[key#213], 
> functions=[partial_sum(sales#214L)])
>:   +- *(1) Project [key#213, sales#214L]
>:  +- *(1) Filter (isnotnull(key#213) && NOT (key#213 = c))
>: +- Scan ExistingRDD[key#213,sales#214L,units#215L]
>+- *(4) Sort [key#237 ASC NULLS FIRST], false, 0
>   +- *(4) HashAggregate(keys=[key#237], functions=[sum(units#239L)])
>  +- Exchange hashpartitioning(key#237, 200)
> +- *(3) HashAggregate(keys=[key#237], 
> functions=[partial_sum(units#239L)])
>+- *(3) Project [key#237, units#239L]
>   +- *(3) Filter (isnotnull(key#237) && NOT (key#237 = d))
>  +- Scan ExistingRDD[key#237,sales#238L,units#239L]
> {code}
> output for 3.0.1:
> {code:java}
> // code placeholder
> >>> sc.version
> u'3.0.1'
> >>> import pyspark.sql.functions as F
> >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
> >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 
> >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> >>> input_df = spark.createDataFrame(d)
> /usr/local/lib/python2.7/site-packages/pyspark/sql/session.py:381: 
> UserWarning: inferring schema from dict is deprecated,please use 
> pyspark.sql.Row instead
>   warnings.warn("inferring schema from dict is deprecated,"
> >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> >>> df1 =

[jira] [Updated] (SPARK-33121) Spark does not shutdown gracefully



 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Description: 
Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in 
graceful shutdown.

Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".

Here is the code:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I send a SIGTERM signal to stop the spark streaming, but exception arrises:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:230)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at 
org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
A piece of logs:
{noformat}
...
Calling rdd.mapPartitions
...
Sending SIGTERM signal
...
2020-10-12 14:12:22 INFO  MyProject - Shutdown hook called
2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler
2020-10-12 14:12:22 INFO  ReceiverTracker - ReceiverTracker stopped
2020-10-12 14:12:22 INFO  JobGenerator - Stopping JobGenerator gracefully
2020-10-12 14:12:22 INFO  JobGenerator - Waiting for all received blocks to be 
consumed for job generation
2020-10-12 14:12:22 INFO  JobGenerator - Waited for all received blocks to be 
consumed for job generation
...
Calling rdd.filter
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]{noformat}
After this exception, streaming freezes and halts by timeout (Config parameter 
"hadoop.service.shutdown.timeout").

Pay attention, this exception arises only for RDD operations (Like map, filter, 
etc.), business logic is processing normally without any errors.

Besides, there is no problem with graceful shutdown in spark 2.4.5.
  
  
  
  

  was:
Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in 
graceful shutdown.

Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".

Here is the code:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I send a SIGTERM signal to stop the spark streaming, but exception arrises:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.Thre

[jira] [Updated] (SPARK-33121) Spark does not shutdown gracefully



 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Description: 
Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in 
graceful shutdown.

Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".

Here is the code:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I send a SIGTERM signal to stop the spark streaming, but exception arrises:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:230)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at 
org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
A piece of logs:
{noformat}
...
Calling rdd.mapPartitions
...
Sending SIGTERM signal
...
2020-10-12 14:12:22 INFO  MyProject - Shutdown hook called
2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler
2020-10-12 14:12:22 INFO  ReceiverTracker - ReceiverTracker stopped
2020-10-12 14:12:22 INFO  JobGenerator - Stopping JobGenerator gracefully
2020-10-12 14:12:22 INFO  JobGenerator - Waiting for all received blocks to be 
consumed for job generation
2020-10-12 14:12:22 INFO  JobGenerator - Waited for all received blocks to be 
consumed for job generation
...
Calling rdd.filter
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]{noformat}
This exception arises only for RDD operations (Like map, filter, etc.), not 
business logic.

Besides, there is no problem with graceful shutdown in spark 2.4.5.
  
  
  
  

  was:
Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in 
graceful shutdown.

Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".

Here is the code:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I send a SIGTERM signal to stop the spark streaming, but exception arrises:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPo

[jira] [Updated] (SPARK-33121) Spark does not shutdown gracefully



 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Description: 
Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in 
graceful shutdown.

Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".

Here is the code:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I send a SIGTERM signal to stop the spark streaming, but exception arrises:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:230)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at 
org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
A piece of logs:
{noformat}
... Calling rdd.mapPartitions ... Sending SIGTERM signal ... 2020-10-12 
14:12:22 INFO MyProject - Shutdown hook called 2020-10-12 14:12:22 DEBUG 
JobScheduler - Stopping JobScheduler 2020-10-12 14:12:22 INFO ReceiverTracker - 
ReceiverTracker stopped 2020-10-12 14:12:22 INFO JobGenerator - Stopping 
JobGenerator gracefully 2020-10-12 14:12:22 INFO JobGenerator - Waiting for all 
received blocks to be consumed for job generation 2020-10-12 14:12:22 INFO 
JobGenerator - Waited for all received blocks to be consumed for job generation 
... Calling rdd.filter 2020-10-12 14:12:29 ERROR Inbox - Ignoring error 
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]{noformat}
This exception arises only for RDD operations (Like map, filter, etc.), not 
business logic.

Besides, there is no problem with graceful shutdown in spark 2.4.5.
  
  
  
  

  was:
Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in 
graceful shutdown.

Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".

Here is the code:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I send a SIGTERM signal to stop the spark streaming, but exception arrises:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortP

[jira] [Updated] (SPARK-33121) Spark does not shutdown gracefully



 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Description: 
Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in 
graceful shutdown.

Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".

Here is the code:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I send a SIGTERM signal to stop the spark streaming, but exception arrises:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:230)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at 
org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
A piece of logs:
{noformat}
...
Calling rdd.mapPartitions
...
2020-10-12 14:12:22 INFO  MyProject - Shutdown hook called
2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler
2020-10-12 14:12:22 INFO  ReceiverTracker - ReceiverTracker stopped
2020-10-12 14:12:22 INFO  JobGenerator - Stopping JobGenerator gracefully
2020-10-12 14:12:22 INFO  JobGenerator - Waiting for all received blocks to be 
consumed for job generation
2020-10-12 14:12:22 INFO  JobGenerator - Waited for all received blocks to be 
consumed for job generation
...
Calling rdd.filter
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]{noformat}
This exception arises only for RDD operations (Like map, filter, etc.), not 
business logic.

Besides, there is no problem with graceful shutdown in spark 2.4.5.
  
  
  
  

  was:
Hi. I have a spark streaming code, like:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as 
"*true*" and send a SIGTERM signal to stop the spark streaming, but an 
exception arrises:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at 
java.util.concurrent.ThreadPo

[jira] [Updated] (SPARK-33121) Spark does not shutdown gracefully



 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Summary: Spark does not shutdown gracefully  (was: Spark does not stop 
gracefully)

> Spark does not shutdown gracefully
> --
>
> Key: SPARK-33121
> URL: https://issues.apache.org/jira/browse/SPARK-33121
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 3.0.1
>Reporter: Dmitry Tverdokhleb
>Priority: Major
>
> Hi. I have a spark streaming code, like:
> {code:java}
> inputStream.foreachRDD {
>   rdd =>
> rdd
>   .mapPartitions {
> // Some operations mapPartitions
>   }
>   .filter {
> // Some operations filter
>   }
>   .groupBy {
> // Some operatons groupBy
>   }
> }
> {code}
> I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as 
> "*true*" and send a SIGTERM signal to stop the spark streaming, but an 
> exception arrises:
> {noformat}
> 2020-10-12 14:12:29 ERROR Inbox - Ignoring error
> java.util.concurrent.RejectedExecutionException: Task 
> org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
> java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 6]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> at org.apache.spark.executor.Executor.launchTask(Executor.scala:230)
> at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
> at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
> at scala.collection.Iterator.foreach(Iterator.scala:941)
> at scala.collection.Iterator.foreach$(Iterator.scala:941)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> at 
> org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
> at 
> org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
> at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
> at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){noformat}
> Logs:
> {noformat}
> ...
> Calling rdd.mapPartitions
> ...
> 2020-10-12 14:12:22 INFO  MyProject - Shutdown hook called
> 2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler
> 2020-10-12 14:12:22 INFO  ReceiverTracker - ReceiverTracker stopped
> 2020-10-12 14:12:22 INFO  JobGenerator - Stopping JobGenerator gracefully
> 2020-10-12 14:12:22 INFO  JobGenerator - Waiting for all received blocks to 
> be consumed for job generation
> 2020-10-12 14:12:22 INFO  JobGenerator - Waited for all received blocks to be 
> consumed for job generation
> ...
> Calling rdd.filter
> 2020-10-12 14:12:29 ERROR Inbox - Ignoring error
> java.util.concurrent.RejectedExecutionException: Task 
> org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
> java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 6]{noformat}
> This exception arises only for RDD operations (Like map, filter, etc.), not 
> business logic.
> Besides, there is no problem with graceful shutdown in spark 2.4.5.
>   
>   
>   
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33090) Upgrade Google Guava

2020-10-12 Thread Stephen Coy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212826#comment-17212826
 ] 

Stephen Coy commented on SPARK-33090:
-

Created PR

> Upgrade Google Guava
> 
>
> Key: SPARK-33090
> URL: https://issues.apache.org/jira/browse/SPARK-33090
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.1
>Reporter: Stephen Coy
>Priority: Major
>
> Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using 
> features from newer versions of Google Guava.
> This leads to MethodNotFound exceptions, etc in Spark builds that specify 
> newer versions of Hadoop. I believe this is due to the use of new methods in 
> com.google.common.base.Preconditions.
> The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently 
> glued to guava-14.0.1.
> I have been running a Spark cluster with the version bumped to guava-29.0-jre 
> without issue.
> Partly due to the way Spark is built, this change is a little more 
> complicated that just changing the version, because newer versions of guava 
> have a new dependency on com.google.guava:failureaccess:1.0.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33090) Upgrade Google Guava



[ 
https://issues.apache.org/jira/browse/SPARK-33090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212825#comment-17212825
 ] 

Apache Spark commented on SPARK-33090:
--

User 'sfcoy' has created a pull request for this issue:
https://github.com/apache/spark/pull/30022

> Upgrade Google Guava
> 
>
> Key: SPARK-33090
> URL: https://issues.apache.org/jira/browse/SPARK-33090
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.1
>Reporter: Stephen Coy
>Priority: Major
>
> Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using 
> features from newer versions of Google Guava.
> This leads to MethodNotFound exceptions, etc in Spark builds that specify 
> newer versions of Hadoop. I believe this is due to the use of new methods in 
> com.google.common.base.Preconditions.
> The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently 
> glued to guava-14.0.1.
> I have been running a Spark cluster with the version bumped to guava-29.0-jre 
> without issue.
> Partly due to the way Spark is built, this change is a little more 
> complicated that just changing the version, because newer versions of guava 
> have a new dependency on com.google.guava:failureaccess:1.0.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33090) Upgrade Google Guava



 [ 
https://issues.apache.org/jira/browse/SPARK-33090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33090:


Assignee: (was: Apache Spark)

> Upgrade Google Guava
> 
>
> Key: SPARK-33090
> URL: https://issues.apache.org/jira/browse/SPARK-33090
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.1
>Reporter: Stephen Coy
>Priority: Major
>
> Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using 
> features from newer versions of Google Guava.
> This leads to MethodNotFound exceptions, etc in Spark builds that specify 
> newer versions of Hadoop. I believe this is due to the use of new methods in 
> com.google.common.base.Preconditions.
> The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently 
> glued to guava-14.0.1.
> I have been running a Spark cluster with the version bumped to guava-29.0-jre 
> without issue.
> Partly due to the way Spark is built, this change is a little more 
> complicated that just changing the version, because newer versions of guava 
> have a new dependency on com.google.guava:failureaccess:1.0.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33090) Upgrade Google Guava



 [ 
https://issues.apache.org/jira/browse/SPARK-33090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33090:


Assignee: Apache Spark

> Upgrade Google Guava
> 
>
> Key: SPARK-33090
> URL: https://issues.apache.org/jira/browse/SPARK-33090
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.1
>Reporter: Stephen Coy
>Assignee: Apache Spark
>Priority: Major
>
> Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using 
> features from newer versions of Google Guava.
> This leads to MethodNotFound exceptions, etc in Spark builds that specify 
> newer versions of Hadoop. I believe this is due to the use of new methods in 
> com.google.common.base.Preconditions.
> The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently 
> glued to guava-14.0.1.
> I have been running a Spark cluster with the version bumped to guava-29.0-jre 
> without issue.
> Partly due to the way Spark is built, this change is a little more 
> complicated that just changing the version, because newer versions of guava 
> have a new dependency on com.google.guava:failureaccess:1.0.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame



[ 
https://issues.apache.org/jira/browse/SPARK-33125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212805#comment-17212805
 ] 

Apache Spark commented on SPARK-33125:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/30021

> Improve the error when Lead and Lag are not allowed to specify window frame
> ---
>
> Key: SPARK-33125
> URL: https://issues.apache.org/jira/browse/SPARK-33125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Except for Postgresql, other data sources (for example: vertica, oracle, 
> redshift, mysql, presto) are not allowed to specify window frame for the Lead 
> and Lag functions.
> But the current error message is not clear enough.
> {code:java}
> Window Frame $f must match the required frame
> {code}
> The following error message is better.
> {code:java}
> Cannot specify window frame for lead function
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame



 [ 
https://issues.apache.org/jira/browse/SPARK-33125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33125:


Assignee: Apache Spark

> Improve the error when Lead and Lag are not allowed to specify window frame
> ---
>
> Key: SPARK-33125
> URL: https://issues.apache.org/jira/browse/SPARK-33125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Except for Postgresql, other data sources (for example: vertica, oracle, 
> redshift, mysql, presto) are not allowed to specify window frame for the Lead 
> and Lag functions.
> But the current error message is not clear enough.
> {code:java}
> Window Frame $f must match the required frame
> {code}
> The following error message is better.
> {code:java}
> Cannot specify window frame for lead function
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0

2020-10-12 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212804#comment-17212804
 ] 

Chao Sun commented on SPARK-27733:
--

[~iemejia] I can help with Hive releases if you can come up with fixes that 
proved workable for Spark side. Just got the CI working for branch-2.3 so 
merging PR should be relatively easier now. 

> Upgrade to Avro 1.10.0
> --
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.1.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.2 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranamer, no shaded guava, security 
> updates, so probably a worth upgrade.
> Avro 1.10.0 was released and this is still not done.
> There is at the moment (2020/08) still a blocker because of Hive related 
> transitive dependencies bringing older versions of Avro, so we could say that 
> this is somehow still blocked until HIVE-21737 is solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame



[ 
https://issues.apache.org/jira/browse/SPARK-33125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212803#comment-17212803
 ] 

Apache Spark commented on SPARK-33125:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/30021

> Improve the error when Lead and Lag are not allowed to specify window frame
> ---
>
> Key: SPARK-33125
> URL: https://issues.apache.org/jira/browse/SPARK-33125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Except for Postgresql, other data sources (for example: vertica, oracle, 
> redshift, mysql, presto) are not allowed to specify window frame for the Lead 
> and Lag functions.
> But the current error message is not clear enough.
> {code:java}
> Window Frame $f must match the required frame
> {code}
> The following error message is better.
> {code:java}
> Cannot specify window frame for lead function
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame



 [ 
https://issues.apache.org/jira/browse/SPARK-33125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33125:


Assignee: (was: Apache Spark)

> Improve the error when Lead and Lag are not allowed to specify window frame
> ---
>
> Key: SPARK-33125
> URL: https://issues.apache.org/jira/browse/SPARK-33125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Except for Postgresql, other data sources (for example: vertica, oracle, 
> redshift, mysql, presto) are not allowed to specify window frame for the Lead 
> and Lag functions.
> But the current error message is not clear enough.
> {code:java}
> Window Frame $f must match the required frame
> {code}
> The following error message is better.
> {code:java}
> Cannot specify window frame for lead function
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame

2020-10-12 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-33125:
--

 Summary: Improve the error when Lead and Lag are not allowed to 
specify window frame
 Key: SPARK-33125
 URL: https://issues.apache.org/jira/browse/SPARK-33125
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: jiaan.geng


Except for Postgresql, other data sources (for example: vertica, oracle, 
redshift, mysql, presto) are not allowed to specify window frame for the Lead 
and Lag functions.
But the current error message is not clear enough.

{code:java}
Window Frame $f must match the required frame
{code}

The following error message is better.
{code:java}
Cannot specify window frame for lead function
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30893) Expressions should not change its data type/nullability after it's created



[ 
https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212749#comment-17212749
 ] 

Anurag Mantripragada commented on SPARK-30893:
--

[~maropu], [~dongjoon] - I went through the PRs for individual issues in this 
Umbrella and looked at the code changes. Only 
[SPARK-30894|https://issues.apache.org/jira/browse/SPARK-30894] seems to affect 
branch-2.4. I've commented on that Jira separately asking the original author 
if we can backport this to branch-2.4. 

> Expressions should not change its data type/nullability after it's created
> --
>
> Key: SPARK-30893
> URL: https://issues.apache.org/jira/browse/SPARK-30893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Critical
> Fix For: 3.0.0
>
>
> This is a problem because the configuration can change between different 
> phases of planning, and this can silently break a query plan which can lead 
> to crashes or data corruption, if data type/nullability gets changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30894) The nullability of Size function should not depend on SQLConf.get



[ 
https://issues.apache.org/jira/browse/SPARK-30894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212748#comment-17212748
 ] 

Anurag Mantripragada commented on SPARK-30894:
--

[~maxgekk] - Looking at the code in branch-2.4, looks like this could be an 
issue there - 
[https://github.com/apache/spark/blob/652e5746019b95b78af4d36c23ec5155bb22325b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L94]

Should we backport this to branch-2.4 since it is LTS?

> The nullability of Size function should not depend on SQLConf.get
> -
>
> Key: SPARK-30894
> URL: https://issues.apache.org/jira/browse/SPARK-30894
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Maxim Gekk
>Priority: Blocker
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27292) Spark Job Fails with Unknown Error writing to S3 from AWS EMR

2020-10-12 Thread Venkata (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212746#comment-17212746
 ] 

Venkata commented on SPARK-27292:
-

[~hyukjin.kwon] Could you let me know how you resolved this issue?

> Spark Job Fails with Unknown Error writing to S3 from AWS EMR
> -
>
> Key: SPARK-27292
> URL: https://issues.apache.org/jira/browse/SPARK-27292
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 2.3.2
>Reporter: Olalekan Elesin
>Priority: Major
>
> I am currently experiencing issues writing data to S3 from my Spark Job 
> running on AWS EMR.
> The job writings to some staging path in S3 e.g 
> \{{.spark-random-alphanumeric}}. After which it fails with this error:
> {code:java}
> 9/03/26 10:54:07 WARN AsyncEventQueue: Dropped 196300 events from appStatus 
> since Tue Mar 26 10:52:05 UTC 2019.
> 19/03/26 10:55:07 WARN AsyncEventQueue: Dropped 211186 events from appStatus 
> since Tue Mar 26 10:54:07 UTC 2019.
> 19/03/26 11:37:09 WARN DataStreamer: Exception for 
> BP-312054361-10.41.97.71-1553586781241:blk_1073742995_2172
> java.io.EOFException: Unexpected EOF while trying to read response from server
>   at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:402)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
>   at 
> org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1073)
> 19/03/26 11:37:09 WARN DataStreamer: Error Recovery for 
> BP-312054361-10.41.97.71-1553586781241:blk_1073742995_2172 in pipeline 
> [DatanodeInfoWithStorage[10.41.121.135:50010,DS-cba2a850-fa30-4933-af2a-05b40b58fdb5,DISK],
>  
> DatanodeInfoWithStorage[10.41.71.181:50010,DS-c90a1d87-b40a-4928-a709-1aef027db65a,DISK]]:
>  datanode 
> 0(DatanodeInfoWithStorage[10.41.121.135:50010,DS-cba2a850-fa30-4933-af2a-05b40b58fdb5,DISK])
>  is bad.
> 19/03/26 11:50:34 WARN AsyncEventQueue: Dropped 157572 events from appStatus 
> since Tue Mar 26 10:55:07 UTC 2019.
> 19/03/26 11:51:34 WARN AsyncEventQueue: Dropped 785 events from appStatus 
> since Tue Mar 26 11:50:34 UTC 2019.
> 19/03/26 11:52:34 WARN AsyncEventQueue: Dropped 656 events from appStatus 
> since Tue Mar 26 11:51:34 UTC 2019.
> 19/03/26 11:53:35 WARN AsyncEventQueue: Dropped 1335 events from appStatus 
> since Tue Mar 26 11:52:34 UTC 2019.
> 19/03/26 11:54:35 WARN AsyncEventQueue: Dropped 1087 events from appStatus 
> since Tue Mar 26 11:53:35 UTC 2019.
> ...
> 19/03/26 13:39:39 WARN TaskSetManager: Lost task 33302.0 in stage 1444.0 (TID 
> 1324427, ip-10-41-122-224.eu-west-1.compute.internal, executor 18): 
> org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:254)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  Your socket connection to the server was not read from or written to within 
> the timeout period. Idle connections will be closed. (Service: Amazon S3; 
> Status Code: 400; Error Code: RequestTimeout; Request ID: 4E2E351899CDFB89; 
> S3 Extended Request ID: 
> iQhU4xTloYk9aTvO2FmDXk03M1pYCRQl539bG6PqEOeZrtw4KeAGRZDek9RugJywREfPmAC99FE=),
>  S3 Extended Request ID: 
> iQhU4xTloYk9aTvO2FmDXk03M1pYCRQl539bG6PqEOeZrtw4KeAGRZDek9RugJywREfPmAC99FE=
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1658)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1322)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpCl

[jira] [Commented] (SPARK-30893) Expressions should not change its data type/nullability after it's created



[ 
https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212731#comment-17212731
 ] 

Takeshi Yamamuro commented on SPARK-30893:
--

You find any data corruption in branch-2.4? If so, I think we should backport 
the PRs that are necessary to fix it.

> Expressions should not change its data type/nullability after it's created
> --
>
> Key: SPARK-30893
> URL: https://issues.apache.org/jira/browse/SPARK-30893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Critical
> Fix For: 3.0.0
>
>
> This is a problem because the configuration can change between different 
> phases of planning, and this can silently break a query plan which can lead 
> to crashes or data corruption, if data type/nullability gets changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33124) Adds a group tag in all the expressions for built-in functions

Takeshi Yamamuro created SPARK-33124:


 Summary: Adds a group tag in all the expressions for built-in 
functions
 Key: SPARK-33124
 URL: https://issues.apache.org/jira/browse/SPARK-33124
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.1.0
Reporter: Takeshi Yamamuro


We've added the script to automatically generate documents for built-in 
functions in SPARK-31429. To generate docs, we need to add a `group` tag in 
`ExpressionDescription`. Currently, a part of built-in functions has the tag, 
so we need to finish adding it to all the builtin-functions for better 
documentations.

 This ticket comes from the talk in 
https://github.com/apache/spark/pull/28224#issuecomment-707195753



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30893) Expressions should not change its data type/nullability after it's created



[ 
https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212717#comment-17212717
 ] 

Anurag Mantripragada commented on SPARK-30893:
--

Sorry for not being clear. I was referring to this comment.

 

??For data type and nullability, I think we should fix before 3.0, as they can 
lead to data corruption.??

??For other behaviors, we can have more discussion and wait for 3.1??

> Expressions should not change its data type/nullability after it's created
> --
>
> Key: SPARK-30893
> URL: https://issues.apache.org/jira/browse/SPARK-30893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Critical
> Fix For: 3.0.0
>
>
> This is a problem because the configuration can change between different 
> phases of planning, and this can silently break a query plan which can lead 
> to crashes or data corruption, if data type/nullability gets changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30893) Expressions should not change its data type/nullability after it's created



[ 
https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212715#comment-17212715
 ] 

Dongjoon Hyun commented on SPARK-30893:
---

[~anuragmantri]. Which comment are you referring specifically? (cc [~dbtsai])

> Expressions should not change its data type/nullability after it's created
> --
>
> Key: SPARK-30893
> URL: https://issues.apache.org/jira/browse/SPARK-30893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Critical
> Fix For: 3.0.0
>
>
> This is a problem because the configuration can change between different 
> phases of planning, and this can silently break a query plan which can lead 
> to crashes or data corruption, if data type/nullability gets changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2020-10-12 Thread Prakash Rajendran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212711#comment-17212711
 ] 

Prakash Rajendran commented on SPARK-24434:
---

 

Hello experts, I have requirement to spin a sidecar container through 
spark-submit in k8s. Is there a way to create a sidecar using 
*spark.kubernetes.driver.podTemplateFile ?*

or anyother way to spin sidecar container through spark-submit? Appreciate your 
inputs on this.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Assignee: Onur Satici
>Priority: Major
> Fix For: 3.0.0
>
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33123) Ignore `GitHub Action file` change in Amplab Jenkins



[ 
https://issues.apache.org/jira/browse/SPARK-33123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212708#comment-17212708
 ] 

Apache Spark commented on SPARK-33123:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30020

> Ignore `GitHub Action file` change in Amplab Jenkins
> 
>
> Key: SPARK-33123
> URL: https://issues.apache.org/jira/browse/SPARK-33123
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33123) Ignore `GitHub Action file` change in Amplab Jenkins



 [ 
https://issues.apache.org/jira/browse/SPARK-33123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33123:


Assignee: Apache Spark

> Ignore `GitHub Action file` change in Amplab Jenkins
> 
>
> Key: SPARK-33123
> URL: https://issues.apache.org/jira/browse/SPARK-33123
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33123) Ignore `GitHub Action file` change in Amplab Jenkins



 [ 
https://issues.apache.org/jira/browse/SPARK-33123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33123:


Assignee: (was: Apache Spark)

> Ignore `GitHub Action file` change in Amplab Jenkins
> 
>
> Key: SPARK-33123
> URL: https://issues.apache.org/jira/browse/SPARK-33123
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33123) Ignore `GitHub Action file` change in Amplab Jenkins



[ 
https://issues.apache.org/jira/browse/SPARK-33123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212707#comment-17212707
 ] 

Apache Spark commented on SPARK-33123:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30020

> Ignore `GitHub Action file` change in Amplab Jenkins
> 
>
> Key: SPARK-33123
> URL: https://issues.apache.org/jira/browse/SPARK-33123
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33123) Ignore `GitHub Action file` change in Amplab Jenkins

2020-10-12 Thread William Hyun (Jira)

William Hyun created SPARK-33123:


 Summary: Ignore `GitHub Action file` change in Amplab Jenkins
 Key: SPARK-33123
 URL: https://issues.apache.org/jira/browse/SPARK-33123
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.1.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32381) Expose the ability for users to use parallel file & avoid location information discovery in RDDs



[ 
https://issues.apache.org/jira/browse/SPARK-32381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212684#comment-17212684
 ] 

Apache Spark commented on SPARK-32381:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30019

> Expose the ability for users to use parallel file & avoid location 
> information discovery in RDDs
> 
>
> Key: SPARK-32381
> URL: https://issues.apache.org/jira/browse/SPARK-32381
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> We already have this in SQL so it's mostly a matter of re-organizing the code 
> a bit and agreeing on how to best expose this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33120) Lazy Load of SparkContext.addFiles

2020-10-12 Thread Taylor Smock (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212679#comment-17212679
 ] 

Taylor Smock commented on SPARK-33120:
--

I'd like to avoid using excess network resources and disk resources. For 
example, if I only have 5 GiB of space left on a node, and I've got 10 GiB of 
data, I don't want to send anything that that node doesn't need.

 

For example, I'm doing something geographically, and I've got a set of binary 
data files for the whole world (from the NASA SRTM elevation data, if you are 
interested). The (current) binary files have a naming scheme like 
`(N/S)(E/W).ext` . I can work around that, but I've been trying to 
make the methodology generic enough for future binary data files. I think the 
best solution would be a lazy load for the `addFiles` function (each file is 
used by relatively few jobs). I could be going at the problem in a sub-optimal 
fashion though.

 

This isn't (currently) high priority for me (hence `minor`), since the total 
size of the data files is currently < 10 GiB (there are 9576 elevation files).

Hopefully this answered your question.

> Lazy Load of SparkContext.addFiles
> --
>
> Key: SPARK-33120
> URL: https://issues.apache.org/jira/browse/SPARK-33120
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Mac OS X (2 systems), workload to eventually be run on 
> Amazon EMR.
> Java 11 application.
>Reporter: Taylor Smock
>Priority: Minor
>
> In my spark job, I may have various random files that may or may not be used 
> by each task.
> I would like to avoid copying all of the files to every executor until it is 
> actually needed.
>  
> What I've tried:
>  * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were 
> distributed to all clients.
>  * Broadcast variables. Since I _don't_ know what files I'm going to need 
> until I have started the task, I have to broadcast all the data at once, 
> which leads to nodes getting data, and then caching it to disk. In short, the 
> same issues as SparkContext.addFiles, but with the added benefit of having 
> the ability to create a mapping of paths to files.
> What I would like to see:
>  * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, 
> Enum.WaitForAvailability) or Future future = SparkFiles.get(file)
>  
>  
> Notes: 
> https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346
>  indicated that `SparkFiles.get` would be required to get the data on the 
> local driver, but in my testing that did not appear to be the case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33118) CREATE TEMPORARY TABLE fails with location



 [ 
https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33118:
-

Assignee: Pablo Langa Blanco  (was: Apache Spark)

> CREATE TEMPORARY TABLE fails with location
> --
>
> Key: SPARK-33118
> URL: https://issues.apache.org/jira/browse/SPARK-33118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Pablo Langa Blanco
>Assignee: Pablo Langa Blanco
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION
>  
> {code:java}
> spark.range(3).write.parquet("/data/tmp/testspark1")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
> '/data/tmp/testspark1')")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
> '/data/tmp/testspark1'")
> {code}
> The error message in both cases is 
> {code:java}
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33118) CREATE TEMPORARY TABLE fails with location



 [ 
https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33118.
---
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30014
[https://github.com/apache/spark/pull/30014]

> CREATE TEMPORARY TABLE fails with location
> --
>
> Key: SPARK-33118
> URL: https://issues.apache.org/jira/browse/SPARK-33118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Pablo Langa Blanco
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION
>  
> {code:java}
> spark.range(3).write.parquet("/data/tmp/testspark1")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
> '/data/tmp/testspark1')")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
> '/data/tmp/testspark1'")
> {code}
> The error message in both cases is 
> {code:java}
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25390) Data source V2 API refactoring

2020-10-12 Thread Jan Berkel (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212675#comment-17212675
 ] 

Jan Berkel commented on SPARK-25390:


I'm in a similar situation. [~Kyrdan] asked on the mailing list as directed, 
but nobody replied. It's strange that such a central API is completely 
undocumented. The new iteration of the datasource API doesn't look remotely 
like v2, it might as well have been called v3.

If it's not possible to provide the documentation, put at least some 
notes/warnings in the migration guide or changelog indicating that Spark3's 
datasource API has changed completely.

And, as far as I can tell at the moment, it doesn't seem to be possible to 
implement the new Datasource V2 using plain Java classes.

> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently it's not very clear how we should abstract data source v2 API. The 
> abstraction should be unified between batch and streaming, or similar but 
> have a well-defined difference between batch and streaming. And the 
> abstraction should also include catalog/table.
> An example of the abstraction:
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}
> We should refactor the data source v2 API according to the abstraction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33118) CREATE TEMPORARY TABLE fails with location



 [ 
https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33118:
--
Affects Version/s: 3.1.0

> CREATE TEMPORARY TABLE fails with location
> --
>
> Key: SPARK-33118
> URL: https://issues.apache.org/jira/browse/SPARK-33118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Pablo Langa Blanco
>Assignee: Apache Spark
>Priority: Major
>
> The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION
>  
> {code:java}
> spark.range(3).write.parquet("/data/tmp/testspark1")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
> '/data/tmp/testspark1')")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
> '/data/tmp/testspark1'")
> {code}
> The error message in both cases is 
> {code:java}
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33118) CREATE TEMPORARY TABLE fails with location



[ 
https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212672#comment-17212672
 ] 

Dongjoon Hyun commented on SPARK-33118:
---

Thank you for your contribution, [~planga82].

> CREATE TEMPORARY TABLE fails with location
> --
>
> Key: SPARK-33118
> URL: https://issues.apache.org/jira/browse/SPARK-33118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Pablo Langa Blanco
>Assignee: Apache Spark
>Priority: Major
>
> The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION
>  
> {code:java}
> spark.range(3).write.parquet("/data/tmp/testspark1")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
> '/data/tmp/testspark1')")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
> '/data/tmp/testspark1'")
> {code}
> The error message in both cases is 
> {code:java}
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33120) Lazy Load of SparkContext.addFiles



[ 
https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212670#comment-17212670
 ] 

Dongjoon Hyun edited comment on SPARK-33120 at 10/12/20, 9:00 PM:
--

Hi, [~tsmock]. What is the benefit you need here?
bq. I would like to avoid copying all of the files to every executor until it 
is actually needed.


was (Author: dongjoon):
Hi, [~tsmock]. What is the benefit you need here?
> I would like to avoid copying all of the files to every executor until it is 
> actually needed.

> Lazy Load of SparkContext.addFiles
> --
>
> Key: SPARK-33120
> URL: https://issues.apache.org/jira/browse/SPARK-33120
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Mac OS X (2 systems), workload to eventually be run on 
> Amazon EMR.
> Java 11 application.
>Reporter: Taylor Smock
>Priority: Minor
>
> In my spark job, I may have various random files that may or may not be used 
> by each task.
> I would like to avoid copying all of the files to every executor until it is 
> actually needed.
>  
> What I've tried:
>  * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were 
> distributed to all clients.
>  * Broadcast variables. Since I _don't_ know what files I'm going to need 
> until I have started the task, I have to broadcast all the data at once, 
> which leads to nodes getting data, and then caching it to disk. In short, the 
> same issues as SparkContext.addFiles, but with the added benefit of having 
> the ability to create a mapping of paths to files.
> What I would like to see:
>  * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, 
> Enum.WaitForAvailability) or Future future = SparkFiles.get(file)
>  
>  
> Notes: 
> https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346
>  indicated that `SparkFiles.get` would be required to get the data on the 
> local driver, but in my testing that did not appear to be the case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33120) Lazy Load of SparkContext.addFiles



[ 
https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212670#comment-17212670
 ] 

Dongjoon Hyun commented on SPARK-33120:
---

Hi, [~tsmock]. What is the benefit you need here?
> I would like to avoid copying all of the files to every executor until it is 
> actually needed.

> Lazy Load of SparkContext.addFiles
> --
>
> Key: SPARK-33120
> URL: https://issues.apache.org/jira/browse/SPARK-33120
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Mac OS X (2 systems), workload to eventually be run on 
> Amazon EMR.
> Java 11 application.
>Reporter: Taylor Smock
>Priority: Minor
>
> In my spark job, I may have various random files that may or may not be used 
> by each task.
> I would like to avoid copying all of the files to every executor until it is 
> actually needed.
>  
> What I've tried:
>  * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were 
> distributed to all clients.
>  * Broadcast variables. Since I _don't_ know what files I'm going to need 
> until I have started the task, I have to broadcast all the data at once, 
> which leads to nodes getting data, and then caching it to disk. In short, the 
> same issues as SparkContext.addFiles, but with the added benefit of having 
> the ability to create a mapping of paths to files.
> What I would like to see:
>  * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, 
> Enum.WaitForAvailability) or Future future = SparkFiles.get(file)
>  
>  
> Notes: 
> https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346
>  indicated that `SparkFiles.get` would be required to get the data on the 
> local driver, but in my testing that did not appear to be the case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33122) Remove redundant aggregates in the Optimzier



[ 
https://issues.apache.org/jira/browse/SPARK-33122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212669#comment-17212669
 ] 

Dongjoon Hyun commented on SPARK-33122:
---

Hi, [~tanelk]. Since this is an `Improvement` JIRA, I set the affected version 
to `3.1.0` because Apache Spark doesn't allow backporting improvement or new 
feature patches. Please use `3.1.0` next time when you propose an improvement.

> Remove redundant aggregates in the Optimzier
> 
>
> Key: SPARK-33122
> URL: https://issues.apache.org/jira/browse/SPARK-33122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Major
>
> It is possible to have two or more consecutive aggregates whose sole purpose 
> is to keep only distinct values (for example TPCDS q87). We can remove all 
> but the last one do improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33122) Remove redundant aggregates in the Optimzier



 [ 
https://issues.apache.org/jira/browse/SPARK-33122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33122:
--
Affects Version/s: (was: 3.0.1)
   3.1.0

> Remove redundant aggregates in the Optimzier
> 
>
> Key: SPARK-33122
> URL: https://issues.apache.org/jira/browse/SPARK-33122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Major
>
> It is possible to have two or more consecutive aggregates whose sole purpose 
> is to keep only distinct values (for example TPCDS q87). We can remove all 
> but the last one do improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25271) Creating parquet table with all the column null throws exception



 [ 
https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25271:
--
Fix Version/s: 2.4.8

> Creating parquet table with all the column null throws exception
> 
>
> Key: SPARK-25271
> URL: https://issues.apache.org/jira/browse/SPARK-25271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Shivu Sondur
>Assignee: L. C. Hsieh
>Priority: Critical
> Fix For: 2.4.8, 3.0.0
>
> Attachments: image-2018-09-07-09-12-34-944.png, 
> image-2018-09-07-09-29-33-370.png, image-2018-09-07-09-29-52-899.png, 
> image-2018-09-07-09-32-43-892.png, image-2018-09-07-09-33-03-095.png
>
>
> {code:java}
>  1)cat /data/parquet.dat
> 1$abc2$pqr:3$xyz
> null{code}
>  
> {code:java}
> 2)spark.sql("create table vp_reader_temp (projects map) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' 
> MAP KEYS TERMINATED BY '$'")
> {code}
> {code:java}
> 3)spark.sql("
> LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp")
> {code}
> {code:java}
> 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from 
> vp_reader_temp")
> {code}
> *Result :* Throwing exception (Working fine with spark 2.2.1)
> {code:java}
> java.lang.RuntimeException: Parquet record is malformed: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320)
>   at 
> org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
>   ... 2

[jira] [Commented] (SPARK-33122) Remove redundant aggregates in the Optimzier



[ 
https://issues.apache.org/jira/browse/SPARK-33122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212660#comment-17212660
 ] 

Apache Spark commented on SPARK-33122:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30018

> Remove redundant aggregates in the Optimzier
> 
>
> Key: SPARK-33122
> URL: https://issues.apache.org/jira/browse/SPARK-33122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Tanel Kiis
>Priority: Major
>
> It is possible to have two or more consecutive aggregates whose sole purpose 
> is to keep only distinct values (for example TPCDS q87). We can remove all 
> but the last one do improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33122) Remove redundant aggregates in the Optimzier



[ 
https://issues.apache.org/jira/browse/SPARK-33122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212661#comment-17212661
 ] 

Apache Spark commented on SPARK-33122:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30018

> Remove redundant aggregates in the Optimzier
> 
>
> Key: SPARK-33122
> URL: https://issues.apache.org/jira/browse/SPARK-33122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Tanel Kiis
>Priority: Major
>
> It is possible to have two or more consecutive aggregates whose sole purpose 
> is to keep only distinct values (for example TPCDS q87). We can remove all 
> but the last one do improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33122) Remove redundant aggregates in the Optimzier



 [ 
https://issues.apache.org/jira/browse/SPARK-33122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33122:


Assignee: (was: Apache Spark)

> Remove redundant aggregates in the Optimzier
> 
>
> Key: SPARK-33122
> URL: https://issues.apache.org/jira/browse/SPARK-33122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Tanel Kiis
>Priority: Major
>
> It is possible to have two or more consecutive aggregates whose sole purpose 
> is to keep only distinct values (for example TPCDS q87). We can remove all 
> but the last one do improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33122) Remove redundant aggregates in the Optimzier



 [ 
https://issues.apache.org/jira/browse/SPARK-33122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33122:


Assignee: Apache Spark

> Remove redundant aggregates in the Optimzier
> 
>
> Key: SPARK-33122
> URL: https://issues.apache.org/jira/browse/SPARK-33122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Tanel Kiis
>Assignee: Apache Spark
>Priority: Major
>
> It is possible to have two or more consecutive aggregates whose sole purpose 
> is to keep only distinct values (for example TPCDS q87). We can remove all 
> but the last one do improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33122) Remove redundant aggregates in the Optimzier

2020-10-12 Thread Tanel Kiis (Jira)

Tanel Kiis created SPARK-33122:
--

 Summary: Remove redundant aggregates in the Optimzier
 Key: SPARK-33122
 URL: https://issues.apache.org/jira/browse/SPARK-33122
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1
Reporter: Tanel Kiis


It is possible to have two or more consecutive aggregates whose sole purpose is 
to keep only distinct values (for example TPCDS q87). We can remove all but the 
last one do improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30893) Expressions should not change its data type/nullability after it's created



[ 
https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212635#comment-17212635
 ] 

Anurag Mantripragada edited comment on SPARK-30893 at 10/12/20, 7:56 PM:
-

As mentioned above in [~cloud_fan]'s comment, should we backport the 
nullability and datatype issues from this umbrella to branch-2.4 as they may 
cause corruption? 

CC: [~viirya], [~dongjoon]


was (Author: anuragmantri):
As mentioned here 
[#https://issues.apache.org/jira/browse/SPARK-30893?focusedCommentId=17041618&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17041618],
 should we backport the nullability and datatype issues from this umbrella to 
branch-2.4 as they may cause corruption? 

CC: [~viirya], [~dongjoon]

> Expressions should not change its data type/nullability after it's created
> --
>
> Key: SPARK-30893
> URL: https://issues.apache.org/jira/browse/SPARK-30893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Critical
> Fix For: 3.0.0
>
>
> This is a problem because the configuration can change between different 
> phases of planning, and this can silently break a query plan which can lead 
> to crashes or data corruption, if data type/nullability gets changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30893) Expressions should not change its data type/nullability after it's created



[ 
https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212635#comment-17212635
 ] 

Anurag Mantripragada commented on SPARK-30893:
--

As mentioned here 
[#https://issues.apache.org/jira/browse/SPARK-30893?focusedCommentId=17041618&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17041618],
 should we backport the nullability and datatype issues from this umbrella to 
branch-2.4 as they may cause corruption? 

CC: [~viirya], [~dongjoon]

> Expressions should not change its data type/nullability after it's created
> --
>
> Key: SPARK-30893
> URL: https://issues.apache.org/jira/browse/SPARK-30893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Critical
> Fix For: 3.0.0
>
>
> This is a problem because the configuration can change between different 
> phases of planning, and this can silently break a query plan which can lead 
> to crashes or data corruption, if data type/nullability gets changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-12 Thread Yuning Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuning Zhang closed SPARK-33089.


> avro format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> ---
>
> Key: SPARK-33089
> URL: https://issues.apache.org/jira/browse/SPARK-33089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuning Zhang
>Assignee: Yuning Zhang
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("avro").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25271) Creating parquet table with all the column null throws exception



[ 
https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212487#comment-17212487
 ] 

Apache Spark commented on SPARK-25271:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30017

> Creating parquet table with all the column null throws exception
> 
>
> Key: SPARK-25271
> URL: https://issues.apache.org/jira/browse/SPARK-25271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Shivu Sondur
>Assignee: L. C. Hsieh
>Priority: Critical
> Fix For: 3.0.0
>
> Attachments: image-2018-09-07-09-12-34-944.png, 
> image-2018-09-07-09-29-33-370.png, image-2018-09-07-09-29-52-899.png, 
> image-2018-09-07-09-32-43-892.png, image-2018-09-07-09-33-03-095.png
>
>
> {code:java}
>  1)cat /data/parquet.dat
> 1$abc2$pqr:3$xyz
> null{code}
>  
> {code:java}
> 2)spark.sql("create table vp_reader_temp (projects map) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' 
> MAP KEYS TERMINATED BY '$'")
> {code}
> {code:java}
> 3)spark.sql("
> LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp")
> {code}
> {code:java}
> 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from 
> vp_reader_temp")
> {code}
> *Result :* Throwing exception (Working fine with spark 2.2.1)
> {code:java}
> java.lang.RuntimeException: Parquet record is malformed: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320)
>   at 
> org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)

[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0

2020-10-12 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212451#comment-17212451
 ] 

Ismaël Mejía commented on SPARK-27733:
--

[~sha...@uber.com] sorry I missed somehow the previous notification. If you 
have a future parquet sync i would love to join to explain the full details, 
otherwise, the tldr version is fix on Spark side is 'easy' the real deal is 
that Spark gets Avro's 1.8 dependency via Hive and getting the fix on Hive has 
proven difficult (already more than 1y in the making) and even with the fix 
merged we still need it to be backported to the 2.3.x branch and have a release 
that includes it, we need LOTS of good will and help from the Hive people so if 
you guys know anyone there who can help that would be appreciated.

The issue is related to a more strict validation on unions with ill defined 
defaults starting on Avro 1.9.x. There are multiple options to deal with this 
(discussed in the ticket) but we shall probably do a fix on Avro side. I will 
bring the update here once we agree on the fix, worse scenario it will require 
a release of Avro too but this is a good time since we were already discussing 
about a release soon.

> Upgrade to Avro 1.10.0
> --
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.1.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.2 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranamer, no shaded guava, security 
> updates, so probably a worth upgrade.
> Avro 1.10.0 was released and this is still not done.
> There is at the moment (2020/08) still a blocker because of Hive related 
> transitive dependencies bringing older versions of Avro, so we could say that 
> this is somehow still blocked until HIVE-21737 is solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31294) Benchmark the performance regression

2020-10-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31294.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Benchmark the performance regression
> 
>
> Key: SPARK-31294
> URL: https://issues.apache.org/jira/browse/SPARK-31294
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33016) Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on.

2020-10-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33016:
---

Assignee: Leanken.Lin

> Potential SQLMetrics missed which might cause WEB UI display issue while AQE 
> is on.
> ---
>
> Key: SPARK-33016
> URL: https://issues.apache.org/jira/browse/SPARK-33016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Minor
>
> In current AQE execution, there might be a following scenario which might 
> cause SQLMetrics being incorrectly override.
>  # Stage A and B are created, and UI updated thru event 
> onAdaptiveExecutionUpdate.
>  # Stage A and B are running. Subquery in stage A keep updating metrics thru 
> event onAdaptiveSQLMetricUpdate.
>  # Stage B completes, while stage A's subquery is still running, updating 
> metrics.
>  # Completion of stage B triggers new stage creation and UI update thru event 
> onAdaptiveExecutionUpdate again (just like step 1).
>  
> But it's very hard to re-produce this issue, since it was only happened with 
> high concurrency. For the fix, I suggested that we might be able to keep all 
> duplicated metrics instead of updating it every time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33016) Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on.

2020-10-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33016.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29965
[https://github.com/apache/spark/pull/29965]

> Potential SQLMetrics missed which might cause WEB UI display issue while AQE 
> is on.
> ---
>
> Key: SPARK-33016
> URL: https://issues.apache.org/jira/browse/SPARK-33016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Minor
> Fix For: 3.1.0
>
>
> In current AQE execution, there might be a following scenario which might 
> cause SQLMetrics being incorrectly override.
>  # Stage A and B are created, and UI updated thru event 
> onAdaptiveExecutionUpdate.
>  # Stage A and B are running. Subquery in stage A keep updating metrics thru 
> event onAdaptiveSQLMetricUpdate.
>  # Stage B completes, while stage A's subquery is still running, updating 
> metrics.
>  # Completion of stage B triggers new stage creation and UI update thru event 
> onAdaptiveExecutionUpdate again (just like step 1).
>  
> But it's very hard to re-produce this issue, since it was only happened with 
> high concurrency. For the fix, I suggested that we might be able to keep all 
> duplicated metrics instead of updating it every time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33121) Spark does not stop gracefully



 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Description: 
Hi. I have a spark streaming code, like:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as 
"*true*" and send a SIGTERM signal to stop the spark streaming, but an 
exception arrises java.util.concurrent.RejectedExecutionException:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:230)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at 
org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
Logs:
{noformat}
...
Calling rdd.mapPartitions
...
2020-10-12 14:12:22 INFO  MyProject - Shutdown hook called
2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler
2020-10-12 14:12:22 INFO  ReceiverTracker - ReceiverTracker stopped
2020-10-12 14:12:22 INFO  JobGenerator - Stopping JobGenerator gracefully
2020-10-12 14:12:22 INFO  JobGenerator - Waiting for all received blocks to be 
consumed for job generation
2020-10-12 14:12:22 INFO  JobGenerator - Waited for all received blocks to be 
consumed for job generation
...
Calling rdd.filter
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]{noformat}
This exception arises only for RDD operations (Like map, filter, etc.), not 
business logic.

Besides, there is no problem with graceful shutdown in spark 2.4.5.
  
  
 

  was:
Hi. I have a spark streaming code, like:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as 
"*true*" * *and* * send a SIGTERM signal to stop the spark streaming, but an 
exception arrises java.util.concurrent.RejectedExecutionException:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at 
java.util

[jira] [Resolved] (SPARK-33103) Custom Schema with Custom RDD reorders columns when more than 4 added

2020-10-12 Thread Justin Mays (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Mays resolved SPARK-33103.
-
Resolution: Not A Problem

> Custom Schema with Custom RDD reorders columns when more than 4 added
> -
>
> Key: SPARK-33103
> URL: https://issues.apache.org/jira/browse/SPARK-33103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
> Environment: Java Application
>Reporter: Justin Mays
>Priority: Major
>
> I have a custom RDD written in Java that uses a custom schema.  Everything 
> appears to work fine with using 4 columns, but when i add a 5th column, 
> calling show() fails with 
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Long is not a valid external type for schema of
> here is the schema definition in java:
> StructType schema = new StructType() StructType schema = new StructType() 
> .add("recordId", DataTypes.LongType, false) .add("col1", 
> DataTypes.DoubleType, false) .add("col2", DataTypes.DoubleType, false) 
> .add("col3", DataTypes.IntegerType, false) .add("col4", 
> DataTypes.IntegerType, false);
>  
> Here is the printout of schema.printTreeString();
> == Physical Plan ==
> *(1) Scan dw [recordId#0L,col1#1,col2#2,col3#3,col4#4] PushedFilters: [], 
> ReadSchema: struct
>  
> I hardcoded a return in my Row object with values matching the schema:
> @Override @Override public Object get(int i) \{ switch(i) { case 0: return 
> 0L; case 1: return 1.1911950001644689D; case 2: return 9.10949955666E9D; 
> case 3: return 476; case 4: return 500; } return 0L; }
>  
> Here is the output of the show command:
> 15:30:26.875 ERROR org.apache.spark.executor.Executor - Exception in task 0.0 
> in stage 0.0 (TID 0)15:30:26.875 ERROR org.apache.spark.executor.Executor - 
> Exception in task 0.0 in stage 0.0 (TID 0)java.lang.RuntimeException: Error 
> while encoding: java.lang.RuntimeException: java.lang.Long is not a valid 
> external type for schema of 
> doublevalidateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, col1), DoubleType) AS 
> col1#30validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 1, recordId), LongType) AS 
> recordId#31Lvalidateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 2, col2), DoubleType) AS 
> col2#32validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 3, col3), IntegerType) AS 
> col3#33validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 4, col4), IntegerType) AS col4#34 at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:215)
>  ~[spark-catalyst_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:197)
>  ~[spark-catalyst_2.12-3.0.1.jar:3.0.1] at 
> scala.collection.Iterator$$anon$10.next(Iterator.scala:459) 
> ~[scala-library-2.12.10.jar:?] at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[?:?] at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
>  ~[spark-sql_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
>  ~[spark-sql_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872) 
> ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
>  ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) 
> ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:313) 
> ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
> ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.scheduler.Task.run(Task.scala:127) 
> ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>  ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) 
> ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) 
> [spark-core_2.12-3.0.1.jar:3.0.1] at 
> java.util

[jira] [Commented] (SPARK-33103) Custom Schema with Custom RDD reorders columns when more than 4 added

2020-10-12 Thread Justin Mays (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212428#comment-17212428
 ] 

Justin Mays commented on SPARK-33103:
-

Ok follow up... I think i figured this out but I guess the API was not doing 
what I expected to do.  The issue is that my custom BaseRelation class was 
implementing PrunedFilteredScan and Spark was sending it the required columns 
out of order from the schema.  Obviously it should have supported the columns 
coming in out of order, but I didn't expect this to come from Spark, I thought 
it would come based off my query to spark so it was never obvious to me and i 
hadn't progressed to implementing that feature yet.  I ended up replacing that 
interface with the TableScan interface for the time being and everything worked 
with more than 4 columns

> Custom Schema with Custom RDD reorders columns when more than 4 added
> -
>
> Key: SPARK-33103
> URL: https://issues.apache.org/jira/browse/SPARK-33103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
> Environment: Java Application
>Reporter: Justin Mays
>Priority: Major
>
> I have a custom RDD written in Java that uses a custom schema.  Everything 
> appears to work fine with using 4 columns, but when i add a 5th column, 
> calling show() fails with 
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Long is not a valid external type for schema of
> here is the schema definition in java:
> StructType schema = new StructType() StructType schema = new StructType() 
> .add("recordId", DataTypes.LongType, false) .add("col1", 
> DataTypes.DoubleType, false) .add("col2", DataTypes.DoubleType, false) 
> .add("col3", DataTypes.IntegerType, false) .add("col4", 
> DataTypes.IntegerType, false);
>  
> Here is the printout of schema.printTreeString();
> == Physical Plan ==
> *(1) Scan dw [recordId#0L,col1#1,col2#2,col3#3,col4#4] PushedFilters: [], 
> ReadSchema: struct
>  
> I hardcoded a return in my Row object with values matching the schema:
> @Override @Override public Object get(int i) \{ switch(i) { case 0: return 
> 0L; case 1: return 1.1911950001644689D; case 2: return 9.10949955666E9D; 
> case 3: return 476; case 4: return 500; } return 0L; }
>  
> Here is the output of the show command:
> 15:30:26.875 ERROR org.apache.spark.executor.Executor - Exception in task 0.0 
> in stage 0.0 (TID 0)15:30:26.875 ERROR org.apache.spark.executor.Executor - 
> Exception in task 0.0 in stage 0.0 (TID 0)java.lang.RuntimeException: Error 
> while encoding: java.lang.RuntimeException: java.lang.Long is not a valid 
> external type for schema of 
> doublevalidateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, col1), DoubleType) AS 
> col1#30validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 1, recordId), LongType) AS 
> recordId#31Lvalidateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 2, col2), DoubleType) AS 
> col2#32validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 3, col3), IntegerType) AS 
> col3#33validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 4, col4), IntegerType) AS col4#34 at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:215)
>  ~[spark-catalyst_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:197)
>  ~[spark-catalyst_2.12-3.0.1.jar:3.0.1] at 
> scala.collection.Iterator$$anon$10.next(Iterator.scala:459) 
> ~[scala-library-2.12.10.jar:?] at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[?:?] at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
>  ~[spark-sql_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
>  ~[spark-sql_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872) 
> ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
>  ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) 
> ~[spark-core_2.

[jira] [Updated] (SPARK-33121) Spark does not stop gracefully



 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Description: 
Hi. I have a spark streaming code, like:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as 
"*true*" and send a SIGTERM signal to stop the spark streaming, but an 
exception arrises:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:230)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at 
org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
Logs:
{noformat}
...
Calling rdd.mapPartitions
...
2020-10-12 14:12:22 INFO  MyProject - Shutdown hook called
2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler
2020-10-12 14:12:22 INFO  ReceiverTracker - ReceiverTracker stopped
2020-10-12 14:12:22 INFO  JobGenerator - Stopping JobGenerator gracefully
2020-10-12 14:12:22 INFO  JobGenerator - Waiting for all received blocks to be 
consumed for job generation
2020-10-12 14:12:22 INFO  JobGenerator - Waited for all received blocks to be 
consumed for job generation
...
Calling rdd.filter
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]{noformat}
This exception arises only for RDD operations (Like map, filter, etc.), not 
business logic.

Besides, there is no problem with graceful shutdown in spark 2.4.5.
  
  
  
 

  was:
Hi. I have a spark streaming code, like:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as 
"*true*" and send a SIGTERM signal to stop the spark streaming, but an 
exception arrises java.util.concurrent.RejectedExecutionException:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExe

[jira] [Created] (SPARK-33121) Spark does not stop gracefully

Dmitry Tverdokhleb created SPARK-33121:
--

 Summary: Spark does not stop gracefully
 Key: SPARK-33121
 URL: https://issues.apache.org/jira/browse/SPARK-33121
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 3.0.1
Reporter: Dmitry Tverdokhleb


Hi. I have a spark streaming code, like:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
null
  }
  .filter {
// Some operations filter
null
  }
  .groupBy {
// Some operatons groupBy
null
  }
}
{code}
I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as 
"*true*" ** and ** send a SIGTERM signal to stop the spark streaming, but an 
exception arrises java.util.concurrent.RejectedExecutionException:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:230)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at 
org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
Logs:
{noformat}
...
Calling rdd.mapPartitions
...
2020-10-12 14:12:22 INFO  MyProject - Shutdown hook called
2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler
2020-10-12 14:12:22 INFO  ReceiverTracker - ReceiverTracker stopped
2020-10-12 14:12:22 INFO  JobGenerator - Stopping JobGenerator gracefully
2020-10-12 14:12:22 INFO  JobGenerator - Waiting for all received blocks to be 
consumed for job generation
2020-10-12 14:12:22 INFO  JobGenerator - Waited for all received blocks to be 
consumed for job generation
...
Calling rdd.filter
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]{noformat}
This exception arises only for RDD operations (Like map, filter, etc.), not 
business logic.

Besides, there is no problem with graceful shutdown in spark 2.4.5.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33121) Spark does not stop gracefully



 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Description: 
Hi. I have a spark streaming code, like:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as 
"*true*" * *and* * send a SIGTERM signal to stop the spark streaming, but an 
exception arrises java.util.concurrent.RejectedExecutionException:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:230)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at 
org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
Logs:
{noformat}
...
Calling rdd.mapPartitions
...
2020-10-12 14:12:22 INFO  MyProject - Shutdown hook called
2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler
2020-10-12 14:12:22 INFO  ReceiverTracker - ReceiverTracker stopped
2020-10-12 14:12:22 INFO  JobGenerator - Stopping JobGenerator gracefully
2020-10-12 14:12:22 INFO  JobGenerator - Waiting for all received blocks to be 
consumed for job generation
2020-10-12 14:12:22 INFO  JobGenerator - Waited for all received blocks to be 
consumed for job generation
...
Calling rdd.filter
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]{noformat}
This exception arises only for RDD operations (Like map, filter, etc.), not 
business logic.

Besides, there is no problem with graceful shutdown in spark 2.4.5.
  
 

  was:
Hi. I have a spark streaming code, like:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
null
  }
  .filter {
// Some operations filter
null
  }
  .groupBy {
// Some operatons groupBy
null
  }
}
{code}
I set the config parameter "spark.streaming.stopGracefullyOnShutdown" as 
"*true*" ** and ** send a SIGTERM signal to stop the spark streaming, but an 
exception arrises java.util.concurrent.RejectedExecutionException:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoo

[jira] [Created] (SPARK-33120) Lazy Load of SparkContext.addFiles

2020-10-12 Thread Taylor Smock (Jira)

Taylor Smock created SPARK-33120:

Summary: Lazy Load of SparkContext.addFiles
Key: SPARK-33120
URL: https://issues.apache.org/jira/browse/SPARK-33120
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 3.0.1
Environment: Mac OS X (2 systems), workload to eventually be run on
Amazon EMR.

Java 11 application.
Reporter: Taylor Smock

In my spark job, I may have various random files that may or may not be used by
each task.

I would like to avoid copying all of the files to every executor until it is
actually needed.

What I've tried:
* SparkContext.addFiles w/ SparkFiles.get . In testing, all files were
distributed to all clients.
* Broadcast variables. Since I _don't_ know what files I'm going to need until
I have started the task, I have to broadcast all the data at once, which leads
to nodes getting data, and then caching it to disk. In short, the same issues
as SparkContext.addFiles, but with the added benefit of having the ability to
create a mapping of paths to files.

What I would like to see:
* SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file,
Enum.WaitForAvailability) or Future future = SparkFiles.get(file)

Notes:
https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346
indicated that `SparkFiles.get` would be required to get the data on the local
driver, but in my testing that did not appear to be the case.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33111) aft transform optimization

2020-10-12 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-33111:


Assignee: zhengruifeng

> aft transform optimization
> --
>
> Key: SPARK-33111
> URL: https://issues.apache.org/jira/browse/SPARK-33111
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> when {{predictionCol}} and {{quantilesCol}} are both set, we only need one 
> computation for each row



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33111) aft transform optimization

2020-10-12 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-33111.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 3
[https://github.com/apache/spark/pull/3]

> aft transform optimization
> --
>
> Key: SPARK-33111
> URL: https://issues.apache.org/jira/browse/SPARK-33111
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.1.0
>
>
> when {{predictionCol}} and {{quantilesCol}} are both set, we only need one 
> computation for each row



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-33116) Spark SQL window function with order by cause result incorrect



 [ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Du closed SPARK-33116.
---

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33116) Spark SQL window function with order by cause result incorrect



[ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212359#comment-17212359
 ] 

Will Du edited comment on SPARK-33116 at 10/12/20, 1:41 PM:


[~maropu], the statement you mentioned does not have PARTITION BY in the 
example. But I am able to reproduce the same behavior in SQL server. I think 
this can be closed. 


was (Author: willddy):
[~maropu], the statement you mentioned is comparing queries with PARTITION BY 
and without PARTITION BY. But if you look at the query I provided, both of them 
have PARTITION BY clause. The only difference is the ORDER BY clause added or 
not. The expected result I think should be the same on both queries except the 
orders of rows (by price).

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31338) Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.

2020-10-12 Thread Ankush Chatterjee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212373#comment-17212373
 ] 

Ankush Chatterjee commented on SPARK-31338:
---

When read like this is used : -
{code:java}
jdbcRead = spark.read
.option("fetchsize", fetchSize)
.jdbc(
url = s"${connectionURL}",
table = s"${query}",
columnName = s"${partKey}",
lowerBound = lBound,
upperBound = hBound,
numPartitions = numParts,
connectionProperties = connProps);
{code}
 

Spark generates multiple queries to read each partition, in the first 
partition, spark adds "or $column is null" in the where clause, this makes few 
databases do a full table scan having a heavy impact on performance (on columns 
with not null enabled).

In JDBCRelation.scala :- 
 
{code:java}
while (i < numPartitions) {
  val lBoundValue = boundValueToString(currentValue)
  val lBound = if (i != 0) s"$column >= $lBoundValue" else null
  currentValue += stride
  val uBoundValue = boundValueToString(currentValue)
  val uBound = if (i != numPartitions - 1) s"$column < $uBoundValue" else 
null
  val whereClause =
if (uBound == null) {
  lBound
} else if (lBound == null) {
  s"$uBound or $column is null"
} else {
  s"$lBound AND $uBound"
}
  ans += JDBCPartition(whereClause, i)
  i = i + 1
}{code}
 

Is it feasible to add an option in JDBCOptions to enable/disable adding  "or 
$column is null", as using a not null column is a commonplace usage when 
paritioning

 

[~olkuznsmith]

 

> Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for 
> NOT NULL table definition of partition key.
> --
>
> Key: SPARK-31338
> URL: https://issues.apache.org/jira/browse/SPARK-31338
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Mohit Dave
>Priority: Major
>
> h2. *Our Use-case Details:*
> While reading from a jdbc source using spark sql, we are using below read 
> format :
> jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties).
> *Table defination :* 
>  postgres=> \d lineitem_sf1000
>  Table "public.lineitem_sf1000"
>  Column | Type | Modifiers
>  -++--
>  *l_orderkey | bigint | not null*
>  l_partkey | bigint | not null
>  l_suppkey | bigint | not null
>  l_linenumber | bigint | not null
>  l_quantity | numeric(10,2) | not null
>  l_extendedprice | numeric(10,2) | not null
>  l_discount | numeric(10,2) | not null
>  l_tax | numeric(10,2) | not null
>  l_returnflag | character varying(1) | not null
>  l_linestatus | character varying(1) | not null
>  l_shipdate | character varying(29) | not null
>  l_commitdate | character varying(29) | not null
>  l_receiptdate | character varying(29) | not null
>  l_shipinstruct | character varying(25) | not null
>  l_shipmode | character varying(10) | not null
>  l_comment | character varying(44) | not null
>  Indexes:
>  "l_order_sf1000_idx" btree (l_orderkey)
>  
> *Partition column* : l_orderkey 
> *numpartion* : 16 
> h2. *Problem details :* 
>  
> {code:java}
> SELECT 
> "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
>  FROM (SELECT 
> l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
>  FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 150001 AND 
> l_orderkey < 187501 {code}
> 15 queries are generated with the above BETWEEN clauses. The last query looks 
> like this below:
> {code:java}
> SELECT 
> "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
>  FROM (SELECT 
> l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
>  FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 37501 or 
> l_orderkey is null {code}
> I*n the last query, we are trying to get the remaining records, along with 
> any data in the table for the partition key having NULL values.*
> This hurts performance badly. While the first 15 SQLs took approximately 10 
> minutes to execute, the last SQL with the NULL check takes 45 minut

[jira] [Comment Edited] (SPARK-33116) Spark SQL window function with order by cause result incorrect



[ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212359#comment-17212359
 ] 

Will Du edited comment on SPARK-33116 at 10/12/20, 1:19 PM:


[~maropu], the statement you mentioned is comparing queries with PARTITION BY 
and without PARTITION BY. But if you look at the query I provided, both of them 
have PARTITION BY clause. The only difference is the ORDER BY clause added or 
not. The expected result I think should be the same on both queries except the 
orders of rows (by price).


was (Author: willddy):
[~maropu], the statement you mentioned is comparing queries with PARTITION BY 
and without PARTITION BY. But if you look at the query I provided, both of them 
have PARTITION BY clause. The only difference is the ORDER BY clause added or 
not. The expected result I think should be the same on both queries except the 
orders of rows (by price).

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33116) Spark SQL window function with order by cause result incorrect



[ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212359#comment-17212359
 ] 

Will Du edited comment on SPARK-33116 at 10/12/20, 1:18 PM:


[~maropu], the statement you mentioned is comparing queries with PARTITION BY 
and without PARTITION BY. But if you look at the query I provided, both of them 
have PARTITION BY clause. The only difference is the ORDER BY clause added or 
not. The expected result I think should be the same on both queries except the 
orders of rows (by price).


was (Author: willddy):
[~maropu], the statement is comparing query with PARTITION BY and without 
PARTITION BY. But if you look at the query I provided, both of them have 
PARTITION BY clause. The only difference is the ORDER BY clause added or not. 
The expected result I think should be the same on both queries except the 
orders of rows (by price).

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33116) Spark SQL window function with order by cause result incorrect



[ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212359#comment-17212359
 ] 

Will Du commented on SPARK-33116:
-

[~maropu], the statement is comparing query with PARTITION BY and without 
PARTITION BY. But if you look at the query I provided, both of them have 
PARTITION BY clause. The only difference is the ORDER BY clause added or not. 
The expected result I think should be the same on both queries except the 
orders of rows (by price).

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33103) Custom Schema with Custom RDD reorders columns when more than 4 added

2020-10-12 Thread Justin Mays (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212356#comment-17212356
 ] 

Justin Mays commented on SPARK-33103:
-

I tried running against Spark 2.4.7 and got the same behavior.  Will work with 
4 columns, but when adding a Fifth i get the same error:

 

09:13:15.661 INFO  org.apache.spark.scheduler.DAGScheduler - ResultStage 0 
(show at SparkTest.java:92) failed in 0.696 s due to Job aborted due to stage 
failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 
in stage 0.0 (TID 0, localhost, executor driver): java.lang.ClassCastException: 
java.lang.Double cannot be cast to java.lang.Long09:13:15.661 INFO  
org.apache.spark.scheduler.DAGScheduler - ResultStage 0 (show at 
SparkTest.java:92) failed in 0.696 s due to Job aborted due to stage failure: 
Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
0.0 (TID 0, localhost, executor driver): java.lang.ClassCastException: 
java.lang.Double cannot be cast to java.lang.Long at 
scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong(rows.scala:42)
 at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong$(rows.scala:42)
 at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getLong(rows.scala:195)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:636)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:255)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:858) at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:858)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:123) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:09:13:15.670 INFO  org.apache.spark.scheduler.DAGScheduler - 
Job 0 failed: show at SparkTest.java:92, took 0.772664 s

> Custom Schema with Custom RDD reorders columns when more than 4 added
> -
>
> Key: SPARK-33103
> URL: https://issues.apache.org/jira/browse/SPARK-33103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
> Environment: Java Application
>Reporter: Justin Mays
>Priority: Major
>
> I have a custom RDD written in Java that uses a custom schema.  Everything 
> appears to work fine with using 4 columns, but when i add a 5th column, 
> calling show() fails with 
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Long is not a valid external type for schema of
> here is the schema definition in java:
> StructType schema = new StructType() StructType schema = new StructType() 
> .add("recordId", DataTypes.LongType, false) .add("col1", 
> DataTypes.DoubleType, false) .add("col2", DataTypes.DoubleType, false) 
> .add("col3", DataTypes.IntegerType, false) .add("col4", 
> DataTypes.IntegerType, false);
>  
> Here is the printout of schema.printTreeString();
> == Physical Plan ==
> *(1) Scan dw [recordId#0L,col1#1,col2#2,col3#3,col4#4] PushedFilters: [], 
> ReadSchema: struct
>  
> I hardcoded a return in my Row object with values matching the schema:
> @Override @Override public Object get(int i) \{ switch(i) { case 0: return 
> 0L; case 1: return 1.1911950001644689D; case 2: return 9.10949955666E9D; 
> case 3: return 476; case 4: return 500; } return 0L; }
>  
> Here is the output of the show command:
> 15:30:26.875 ERROR org.apache.spark.executor.Executor - Exception in task 0.0 
> in stage 0.0 (TID 0)15:30:26.875 ERROR org.apache.spark.executor.Executor - 
> Exception in task 0.0 in stage 0

[jira] [Commented] (SPARK-32229) Application entry parsing fails because DriverWrapper registered instead of the normal driver

2020-10-12 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212327#comment-17212327
 ] 

Gabor Somogyi commented on SPARK-32229:
---

Started to work on this.

> Application entry parsing fails because DriverWrapper registered instead of 
> the normal driver
> -
>
> Key: SPARK-32229
> URL: https://issues.apache.org/jira/browse/SPARK-32229
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In some cases DriverWrapper registered by DriverRegistry which causes 
> exception in PostgresConnectionProvider:
> https://github.com/apache/spark/blob/371b35d2e0ab08ebd853147c6673de3adfad0553/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/DriverRegistry.scala#L53



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33119) ScalarSubquery should returns the first two rows to avoid Driver OOM

2020-10-12 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33119:

Summary: ScalarSubquery should returns the first two rows to avoid Driver 
OOM   (was: Only return the first two rows to avoid Driver OOM)

> ScalarSubquery should returns the first two rows to avoid Driver OOM 
> -
>
> Key: SPARK-33119
> URL: https://issues.apache.org/jira/browse/SPARK-33119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> Exception in thread "subquery-2871" java.lang.OutOfMemoryError: Requested 
> array size exceeds VM limit
>  at 
> scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:103)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48)
>  at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:330)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:352)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:351)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:351)
>  at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:274)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:830)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:827)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:132)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:156)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:129)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827)
>  at scala.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33119) Only return the first two rows to avoid Driver OOM



[ 
https://issues.apache.org/jira/browse/SPARK-33119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212300#comment-17212300
 ] 

Apache Spark commented on SPARK-33119:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30016

> Only return the first two rows to avoid Driver OOM
> --
>
> Key: SPARK-33119
> URL: https://issues.apache.org/jira/browse/SPARK-33119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> Exception in thread "subquery-2871" java.lang.OutOfMemoryError: Requested 
> array size exceeds VM limit
>  at 
> scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:103)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48)
>  at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:330)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:352)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:351)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:351)
>  at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:274)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:830)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:827)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:132)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:156)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:129)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827)
>  at scala.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33119) Only return the first two rows to avoid Driver OOM



 [ 
https://issues.apache.org/jira/browse/SPARK-33119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33119:


Assignee: Apache Spark

> Only return the first two rows to avoid Driver OOM
> --
>
> Key: SPARK-33119
> URL: https://issues.apache.org/jira/browse/SPARK-33119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {noformat}
> Exception in thread "subquery-2871" java.lang.OutOfMemoryError: Requested 
> array size exceeds VM limit
>  at 
> scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:103)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48)
>  at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:330)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:352)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:351)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:351)
>  at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:274)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:830)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:827)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:132)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:156)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:129)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827)
>  at scala.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33119) Only return the first two rows to avoid Driver OOM



 [ 
https://issues.apache.org/jira/browse/SPARK-33119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33119:


Assignee: (was: Apache Spark)

> Only return the first two rows to avoid Driver OOM
> --
>
> Key: SPARK-33119
> URL: https://issues.apache.org/jira/browse/SPARK-33119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> Exception in thread "subquery-2871" java.lang.OutOfMemoryError: Requested 
> array size exceeds VM limit
>  at 
> scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:103)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48)
>  at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:330)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:352)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:351)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:351)
>  at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:274)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:830)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:827)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:132)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:156)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:129)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827)
>  at scala.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33119) Only return the first two rows to avoid Driver OOM



[ 
https://issues.apache.org/jira/browse/SPARK-33119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212299#comment-17212299
 ] 

Apache Spark commented on SPARK-33119:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30016

> Only return the first two rows to avoid Driver OOM
> --
>
> Key: SPARK-33119
> URL: https://issues.apache.org/jira/browse/SPARK-33119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> Exception in thread "subquery-2871" java.lang.OutOfMemoryError: Requested 
> array size exceeds VM limit
>  at 
> scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:103)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48)
>  at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:330)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:352)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:351)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:351)
>  at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:274)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:830)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:827)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:132)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:156)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:129)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827)
>  at 
> org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827)
>  at scala.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33119) Only return the first two rows to avoid Driver OOM

2020-10-12 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-33119:
---

 Summary: Only return the first two rows to avoid Driver OOM
 Key: SPARK-33119
 URL: https://issues.apache.org/jira/browse/SPARK-33119
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


{noformat}
Exception in thread "subquery-2871" java.lang.OutOfMemoryError: Requested array 
size exceeds VM limit
 at 
scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:103)
 at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48)
 at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1$$anonfun$apply$2.apply(SparkPlan.scala:352)
 at scala.collection.Iterator$class.foreach(Iterator.scala:893)
 at 
org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:330)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:352)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:351)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
 at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:351)
 at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:274)
 at 
org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:830)
 at 
org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1$$anonfun$apply$3.apply(basicPhysicalOperators.scala:827)
 at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:132)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:156)
 at 
org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:129)
 at 
org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827)
 at 
org.apache.spark.sql.execution.SubqueryExec$$anonfun$relationFuture$1.apply(basicPhysicalOperators.scala:827)
 at scala.
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28099) Assertion when querying unpartitioned Hive table with partition-like naming

2020-10-12 Thread Zsombor Fedor (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212281#comment-17212281
 ] 

Zsombor Fedor commented on SPARK-28099:
---

It is not ORC specific:


val testData = List(1,2,3,4,5)
val dataFrame = testData.toDF()
dataFrame
.coalesce(1)
.write
.format("parquet")
.save("user/hive/warehouse/test/dir1=1/")
spark.sql("CREATE EXTERNAL TABLE test (val INT) STORED AS PARQUET LOCATION 
'/user/hive/warehouse/test/'")

val queryResponse = spark.sql("SELECT * FROM test") 
//java.lang.AssertionError: assertion failed at 
scala.Predef$.assert(Predef.scala:156) 
// at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:214)


 

> Assertion when querying unpartitioned Hive table with partition-like naming
> ---
>
> Key: SPARK-28099
> URL: https://issues.apache.org/jira/browse/SPARK-28099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Douglas Drinka
>Priority: Major
>
> {code:java}
> val testData = List(1,2,3,4,5)
> val dataFrame = testData.toDF()
> dataFrame
> .coalesce(1)
> .write
> .mode(SaveMode.Overwrite)
> .format("orc")
> .option("compression", "zlib")
> .save("s3://ddrinka.sparkbug/testFail/dir1=1/dir2=2/")
> spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.testFail")
> spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.testFail (val INT) STORED 
> AS ORC LOCATION 's3://ddrinka.sparkbug/testFail/'")
> val queryResponse = spark.sql("SELECT * FROM ddrinka_sparkbug.testFail")
> //Throws AssertionError
> //at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:214){code}
> It looks like the native ORC reader is creating virtual columns named dir1 
> and dir2, which don't exist in the Hive table. [The 
> assertion|[https://github.com/apache/spark/blob/c0297dedd829a92cca920ab8983dab399f8f32d5/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L257]]
>  is checking that the number of columns match, which fails due to the virtual 
> partition columns.
> Actually getting data back from this query will be dependent on SPARK-28098, 
> supporting subdirectories for Hive queries at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32704) Logging plan changes for execution



[ 
https://issues.apache.org/jira/browse/SPARK-32704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212273#comment-17212273
 ] 

Apache Spark commented on SPARK-32704:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/30015

> Logging plan changes for execution
> --
>
> Key: SPARK-32704
> URL: https://issues.apache.org/jira/browse/SPARK-32704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.1.0
>
>
> Since we only log plan changes for analyzer/optimizer now, this ticket 
> targets adding code to log plan changes in the preparation phase in 
> QueryExecution for execution.
> {code}
> scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN")
> scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan
> ...
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)
>   +- *(1) Range (0, 10, step=1, splits=4)
>  
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Result of Batch Preparations ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32989) Performance regression when selecting from str_to_map



 [ 
https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32989.
--
Fix Version/s: 3.1.0
 Assignee: L. C. Hsieh
   Resolution: Fixed

> Performance regression when selecting from str_to_map
> -
>
> Key: SPARK-32989
> URL: https://issues.apache.org/jira/browse/SPARK-32989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Ondrej Kokes
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.1.0
>
>
> When I create a map using str_to_map and select more than a single value, I 
> notice a notable performance regression in 3.0.1 compared to 2.4.7. When 
> selecting a single value, the performance is the same. Plans are identical 
> between versions.
> It seems like in 2.x the map from str_to_map is preserved for a given row, 
> but in 3.x it's recalculated for each column. One hint that it might be the 
> case is that when I tried forcing materialisation of said map in 3.x (by a 
> coalesce, don't know if there's a better way), I got the performance roughly 
> to 2.x levels.
> Here's a reproducer (the csv in question gets autogenerated by the python 
> code):
> {code:java}
> $ head regression.csv 
> foo
> foo=bar&baz=bak&bar=foo
> foo=bar&baz=bak&bar=foo
> foo=bar&baz=bak&bar=foo
> foo=bar&baz=bak&bar=foo
> foo=bar&baz=bak&bar=foo
> ... (10M more rows)
> {code}
> {code:python}
> import time
> import os
> import pyspark  
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> if __name__ == '__main__':
> print(pyspark.__version__)
> spark = SparkSession.builder.getOrCreate()
> filename = 'regression.csv'
> if not os.path.isfile(filename):
> with open(filename, 'wt') as fw:
> fw.write('foo\n')
> for _ in range(10_000_000):
> fw.write('foo=bar&baz=bak&bar=foo\n')
> df = spark.read.option('header', True).csv(filename)
> t = time.time()
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> .select(
> f.col('my_map')['foo'],
> )
> )
> dd.write.mode('overwrite').csv('tmp')
> t2 = time.time()
> print('selected one', t2 - t)
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> # .coalesce(100) # forcing evaluation before selection speeds it 
> up in 3.0.1
> .select(
> f.col('my_map')['foo'],
> f.col('my_map')['bar'],
> f.col('my_map')['baz'],
> )
> )
> dd.explain(True)
> dd.write.mode('overwrite').csv('tmp')
> t3 = time.time()
> print('selected three', t3 - t2)
> {code}
> Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS 
> (times are in seconds)
> {code:java}
> # 3.0.1
> # selected one 6.375471830368042  
> 
> # selected three 14.847578048706055
> # 2.4.7
> # selected one 6.679579019546509  
> 
> # selected three 6.5622029304504395  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33118) CREATE TEMPORARY TABLE fails with location



[ 
https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212253#comment-17212253
 ] 

Apache Spark commented on SPARK-33118:
--

User 'planga82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30014

> CREATE TEMPORARY TABLE fails with location
> --
>
> Key: SPARK-33118
> URL: https://issues.apache.org/jira/browse/SPARK-33118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Pablo Langa Blanco
>Priority: Major
>
> The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION
>  
> {code:java}
> spark.range(3).write.parquet("/data/tmp/testspark1")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
> '/data/tmp/testspark1')")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
> '/data/tmp/testspark1'")
> {code}
> The error message in both cases is 
> {code:java}
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33118) CREATE TEMPORARY TABLE fails with location



 [ 
https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33118:


Assignee: Apache Spark

> CREATE TEMPORARY TABLE fails with location
> --
>
> Key: SPARK-33118
> URL: https://issues.apache.org/jira/browse/SPARK-33118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Pablo Langa Blanco
>Assignee: Apache Spark
>Priority: Major
>
> The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION
>  
> {code:java}
> spark.range(3).write.parquet("/data/tmp/testspark1")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
> '/data/tmp/testspark1')")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
> '/data/tmp/testspark1'")
> {code}
> The error message in both cases is 
> {code:java}
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33118) CREATE TEMPORARY TABLE fails with location



 [ 
https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33118:


Assignee: (was: Apache Spark)

> CREATE TEMPORARY TABLE fails with location
> --
>
> Key: SPARK-33118
> URL: https://issues.apache.org/jira/browse/SPARK-33118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Pablo Langa Blanco
>Priority: Major
>
> The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION
>  
> {code:java}
> spark.range(3).write.parquet("/data/tmp/testspark1")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
> '/data/tmp/testspark1')")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
> '/data/tmp/testspark1'")
> {code}
> The error message in both cases is 
> {code:java}
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33118) CREATE TEMPORARY TABLE fails with location



 [ 
https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33118:


Assignee: Apache Spark

> CREATE TEMPORARY TABLE fails with location
> --
>
> Key: SPARK-33118
> URL: https://issues.apache.org/jira/browse/SPARK-33118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Pablo Langa Blanco
>Assignee: Apache Spark
>Priority: Major
>
> The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION
>  
> {code:java}
> spark.range(3).write.parquet("/data/tmp/testspark1")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
> '/data/tmp/testspark1')")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
> '/data/tmp/testspark1'")
> {code}
> The error message in both cases is 
> {code:java}
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33118) CREATE TEMPORARY TABLE fails with location

2020-10-12 Thread Pablo Langa Blanco (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212242#comment-17212242
 ] 

Pablo Langa Blanco commented on SPARK-33118:


I'm working on it

> CREATE TEMPORARY TABLE fails with location
> --
>
> Key: SPARK-33118
> URL: https://issues.apache.org/jira/browse/SPARK-33118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Pablo Langa Blanco
>Priority: Major
>
> The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION
>  
> {code:java}
> spark.range(3).write.parquet("/data/tmp/testspark1")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
> '/data/tmp/testspark1')")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
> '/data/tmp/testspark1'")
> {code}
> The error message in both cases is 
> {code:java}
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33118) CREATE TEMPORARY TABLE fails with location

2020-10-12 Thread Pablo Langa Blanco (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco updated SPARK-33118:
---
Description: 
The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION

 
{code:java}
spark.range(3).write.parquet("/data/tmp/testspark1")

spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
'/data/tmp/testspark1')")
spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
'/data/tmp/testspark1'")
{code}
The error message in both cases is 
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It 
must be specified manually.;
  at 
org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
  at scala.Option.getOrElse(Option.scala:189)
  at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
  at 
org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset.(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
{code}
 

  was:
The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION

 
{code:java}
spark.range(3).write.parquet("/data/tmp/testspark1")

spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
'/data/tmp/testspark1')")
spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
'/data/tmp/testspark1'")
{code}
The error message in both cases is

 

 
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It 
must be specified manually.;
  at 
org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
  at scala.Option.getOrElse(Option.scala:189)
  at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
  at 
org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset.(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
  at org.apache.spark.sql.SparkSession.w

[jira] [Created] (SPARK-33118) CREATE TEMPORARY TABLE fails with location

2020-10-12 Thread Pablo Langa Blanco (Jira)

Pablo Langa Blanco created SPARK-33118:
--

 Summary: CREATE TEMPORARY TABLE fails with location
 Key: SPARK-33118
 URL: https://issues.apache.org/jira/browse/SPARK-33118
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.0.0
Reporter: Pablo Langa Blanco


The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION

 
{code:java}
spark.range(3).write.parquet("/data/tmp/testspark1")

spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
'/data/tmp/testspark1')")
spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
'/data/tmp/testspark1'")
{code}
The error message in both cases is

 

 
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It 
must be specified manually.;
  at 
org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
  at scala.Option.getOrElse(Option.scala:189)
  at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
  at 
org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset.(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33090) Upgrade Google Guava



[ 
https://issues.apache.org/jira/browse/SPARK-33090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212240#comment-17212240
 ] 

angerszhu commented on SPARK-33090:
---

Meet this problem these days too and spark build distribution will bring guava 
jar, make. a lot problem

> Upgrade Google Guava
> 
>
> Key: SPARK-33090
> URL: https://issues.apache.org/jira/browse/SPARK-33090
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.1
>Reporter: Stephen Coy
>Priority: Major
>
> Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using 
> features from newer versions of Google Guava.
> This leads to MethodNotFound exceptions, etc in Spark builds that specify 
> newer versions of Hadoop. I believe this is due to the use of new methods in 
> com.google.common.base.Preconditions.
> The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently 
> glued to guava-14.0.1.
> I have been running a Spark cluster with the version bumped to guava-29.0-jre 
> without issue.
> Partly due to the way Spark is built, this change is a little more 
> complicated that just changing the version, because newer versions of guava 
> have a new dependency on com.google.guava:failureaccess:1.0.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32455) LogisticRegressionModel prediction optimization



[ 
https://issues.apache.org/jira/browse/SPARK-32455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212227#comment-17212227
 ] 

Apache Spark commented on SPARK-32455:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/30013

> LogisticRegressionModel prediction optimization
> ---
>
> Key: SPARK-32455
> URL: https://issues.apache.org/jira/browse/SPARK-32455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.1.0
>
>
> if needed, method getThreshold and/or following logic to compute rawThreshold 
> is called on each instance.
>  
> {code:java}
> override def getThreshold: Double = {
>   checkThresholdConsistency()
>   if (isSet(thresholds)) {
> val ts = $(thresholds)
> require(ts.length == 2, "Logistic Regression getThreshold only applies 
> to" +
>   " binary classification, but thresholds has length != 2.  thresholds: " 
> + ts.mkString(","))
> 1.0 / (1.0 + ts(0) / ts(1))
>   } else {
> $(threshold)
>   }
> } {code}
>  
> {code:java}
>   val rawThreshold = if (t == 0.0) {
> Double.NegativeInfinity
>   } else if (t == 1.0) {
> Double.PositiveInfinity
>   } else {
> math.log(t / (1.0 - t))
>   } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32455) LogisticRegressionModel prediction optimization



[ 
https://issues.apache.org/jira/browse/SPARK-32455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212226#comment-17212226
 ] 

Apache Spark commented on SPARK-32455:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/30013

> LogisticRegressionModel prediction optimization
> ---
>
> Key: SPARK-32455
> URL: https://issues.apache.org/jira/browse/SPARK-32455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.1.0
>
>
> if needed, method getThreshold and/or following logic to compute rawThreshold 
> is called on each instance.
>  
> {code:java}
> override def getThreshold: Double = {
>   checkThresholdConsistency()
>   if (isSet(thresholds)) {
> val ts = $(thresholds)
> require(ts.length == 2, "Logistic Regression getThreshold only applies 
> to" +
>   " binary classification, but thresholds has length != 2.  thresholds: " 
> + ts.mkString(","))
> 1.0 / (1.0 + ts(0) / ts(1))
>   } else {
> $(threshold)
>   }
> } {code}
>  
> {code:java}
>   val rawThreshold = if (t == 0.0) {
> Double.NegativeInfinity
>   } else if (t == 1.0) {
> Double.PositiveInfinity
>   } else {
> math.log(t / (1.0 - t))
>   } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33092) Support subexpression elimination in ProjectExec



 [ 
https://issues.apache.org/jira/browse/SPARK-33092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33092.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/29975

> Support subexpression elimination in ProjectExec
> 
>
> Key: SPARK-33092
> URL: https://issues.apache.org/jira/browse/SPARK-33092
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> Users frequently write repeatedly expression in projection. Currently in 
> ProjectExec, we don't support subexpression elimination in Whole-stage 
> codegen. We can support it to reduce redundant evaluation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`



[ 
https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212205#comment-17212205
 ] 

Takeshi Yamamuro commented on SPARK-24930:
--

hm, I see. Could you check the resolved versions, e.g., 2.4.7, 3.0.1, ...

>  Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
> --
>
> Key: SPARK-24930
> URL: https://issues.apache.org/jira/browse/SPARK-24930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Xiaochen Ouyang
>Priority: Minor
>
> # root user create a test.txt file contains a record '123'  in /root/ 
> directory
>  # switch mr user to execute spark-shell --master local
> {code:java}
> scala> spark.version
> res2: String = 2.2.1
> scala> spark.sql("create table t1(id int) partitioned by(area string)");
> 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: 
> Location: hdfs://nameservice/spark/t1 specified for non-external table:t1
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
> partition(area ='025')")
> org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
> /root/test.txt;
>  at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
>  ... 48 elided
> scala>
> {code}
> In fact, the input path exists, but the mr user does not have permission to 
> access the directory `/root/` ,so the message throwed by `AnalysisException` 
> can confuse user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`



[ 
https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212201#comment-17212201
 ] 

angerszhu commented on SPARK-24930:
---

No, form message I put in comment, seems Hadoop method handle this.

>  Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
> --
>
> Key: SPARK-24930
> URL: https://issues.apache.org/jira/browse/SPARK-24930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Xiaochen Ouyang
>Priority: Minor
>
> # root user create a test.txt file contains a record '123'  in /root/ 
> directory
>  # switch mr user to execute spark-shell --master local
> {code:java}
> scala> spark.version
> res2: String = 2.2.1
> scala> spark.sql("create table t1(id int) partitioned by(area string)");
> 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: 
> Location: hdfs://nameservice/spark/t1 specified for non-external table:t1
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
> partition(area ='025')")
> org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
> /root/test.txt;
>  at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
>  ... 48 elided
> scala>
> {code}
> In fact, the input path exists, but the mr user does not have permission to 
> access the directory `/root/` ,so the message throwed by `AnalysisException` 
> can confuse user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31294) Benchmark the performance regression

2020-10-12 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212199#comment-17212199
 ] 

Maxim Gekk commented on SPARK-31294:


[~cloud_fan] Could you close this ticket since the PR was merged.

> Benchmark the performance regression
> 
>
> Key: SPARK-31294
> URL: https://issues.apache.org/jira/browse/SPARK-31294
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`



[ 
https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212196#comment-17212196
 ] 

Takeshi Yamamuro commented on SPARK-24930:
--

Do you know which pr resolves this? (it'd be better to link this Jira to it if 
possible)

>  Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
> --
>
> Key: SPARK-24930
> URL: https://issues.apache.org/jira/browse/SPARK-24930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Xiaochen Ouyang
>Priority: Minor
>
> # root user create a test.txt file contains a record '123'  in /root/ 
> directory
>  # switch mr user to execute spark-shell --master local
> {code:java}
> scala> spark.version
> res2: String = 2.2.1
> scala> spark.sql("create table t1(id int) partitioned by(area string)");
> 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: 
> Location: hdfs://nameservice/spark/t1 specified for non-external table:t1
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
> partition(area ='025')")
> org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
> /root/test.txt;
>  at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
>  ... 48 elided
> scala>
> {code}
> In fact, the input path exists, but the mr user does not have permission to 
> access the directory `/root/` ,so the message throwed by `AnalysisException` 
> can confuse user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33117) Update zstd-jni to 1.4.5-6



 [ 
https://issues.apache.org/jira/browse/SPARK-33117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33117.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30010
[https://github.com/apache/spark/pull/30010]

> Update zstd-jni to 1.4.5-6
> --
>
> Key: SPARK-33117
> URL: https://issues.apache.org/jira/browse/SPARK-33117
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33117) Update zstd-jni to 1.4.5-6



 [ 
https://issues.apache.org/jira/browse/SPARK-33117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33117:
-

Assignee: Dongjoon Hyun

> Update zstd-jni to 1.4.5-6
> --
>
> Key: SPARK-33117
> URL: https://issues.apache.org/jira/browse/SPARK-33117
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`



[ 
https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212189#comment-17212189
 ] 

angerszhu commented on SPARK-24930:
---

cc [~maropu]

Hey, for this jira, I check again and seems solved by other pr, can I mark it 
as resolved or ping your committer ?

>  Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
> --
>
> Key: SPARK-24930
> URL: https://issues.apache.org/jira/browse/SPARK-24930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Xiaochen Ouyang
>Priority: Minor
>
> # root user create a test.txt file contains a record '123'  in /root/ 
> directory
>  # switch mr user to execute spark-shell --master local
> {code:java}
> scala> spark.version
> res2: String = 2.2.1
> scala> spark.sql("create table t1(id int) partitioned by(area string)");
> 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: 
> Location: hdfs://nameservice/spark/t1 specified for non-external table:t1
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
> partition(area ='025')")
> org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
> /root/test.txt;
>  at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
>  ... 48 elided
> scala>
> {code}
> In fact, the input path exists, but the mr user does not have permission to 
> access the directory `/root/` ,so the message throwed by `AnalysisException` 
> can confuse user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`