[jira] [Commented] (SPARK-33085) "Master removed our application" error leads to FAILED driver status instead of KILLED driver status

2020-10-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213647#comment-17213647
 ] 

Hyukjin Kwon commented on SPARK-33085:
--

Can you show the reproducible steps?

> "Master removed our application" error leads to FAILED driver status instead 
> of KILLED driver status
> 
>
> Key: SPARK-33085
> URL: https://issues.apache.org/jira/browse/SPARK-33085
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.6
>Reporter: t oo
>Priority: Major
>
>  
> driver-20200930160855-0316 exited with status FAILED
>  
> I am using Spark Standalone scheduler with spot ec2 workers. I confirmed that 
> myip.87 EC2 instance was terminated at 2020-09-30 16:16
>  
> *I would expect the overall driver status to be KILLED but instead it was 
> FAILED*, my goal is to interpret FAILED status as 'don't rerun as 
> non-transient error faced' but KILLED/ERROR status as 'yes, rerun as 
> transient error faced'. But it looks like FAILED status is being set in below 
> case of transient error:
>   
> Below are driver logs
> {code:java}
> 2020-09-30 16:12:41,183 [main] INFO  
> com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to 
> s3a://redacted2020-09-30 16:12:41,183 [main] INFO  
> com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to 
> s3a://redacted20-09-30 16:16:40,366 [dispatcher-event-loop-15] ERROR 
> org.apache.spark.scheduler.TaskSchedulerImpl - Lost executor 0 on myip.87: 
> Remote RPC client disassociated. Likely due to containers exceeding 
> thresholds, or network issues. Check driver logs for WARN messages.2020-09-30 
> 16:16:40,372 [dispatcher-event-loop-15] WARN  
> org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 6.0 (TID 
> 6, myip.87, executor 0): ExecutorLostFailure (executor 0 exited caused by one 
> of the running tasks) Reason: Remote RPC client disassociated. Likely due to 
> containers exceeding thresholds, or network issues. Check driver logs for 
> WARN messages.2020-09-30 16:16:40,376 [dispatcher-event-loop-13] WARN  
> org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas 
> available for rdd_3_0 !2020-09-30 16:16:40,398 [dispatcher-event-loop-2] INFO 
>  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/0 removed: Worker shutting down2020-09-30 
> 16:16:40,399 [dispatcher-event-loop-2] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/1 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,401 [dispatcher-event-loop-5] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/1 removed: java.lang.IllegalStateException: Shutdown 
> hooks cannot be modified during shutdown.2020-09-30 16:16:40,402 
> [dispatcher-event-loop-5] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/2 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,403 [dispatcher-event-loop-11] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/2 removed: java.lang.IllegalStateException: Shutdown 
> hooks cannot be modified during shutdown.2020-09-30 16:16:40,404 
> [dispatcher-event-loop-11] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/3 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,405 [dispatcher-event-loop-1] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/3 removed: java.lang.IllegalStateException: Shutdown 
> hooks cannot be modified during shutdown.2020-09-30 16:16:40,406 
> [dispatcher-event-loop-1] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/4 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,407 [dispatcher-event-loop-12] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/4 removed: java.lang.IllegalStateException: Shutdown 
> hooks cannot be modified during shutdown.2020-09-30 16:16:40,408 
> [dispatcher-event-loop-12] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted 
> executor ID app-20200930160902-0895/5 on hostPort myip.87:11647 with 2 
> core(s), 5.0 GB RAM2020-09-30 16:16:40,409 [dispatcher-event-loop-4] INFO  
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor 
> app-20200930160902-0895/5 re

[jira] [Commented] (SPARK-33113) [SparkR] gapply works with arrow disabled, fails with arrow enabled stringsAsFactors=TRUE

2020-10-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213643#comment-17213643
 ] 

Hyukjin Kwon commented on SPARK-33113:
--

It works with Spark 3.0.0 too. Can you show your versions of R and Arrow?

> [SparkR] gapply works with arrow disabled, fails with arrow enabled 
> stringsAsFactors=TRUE
> -
>
> Key: SPARK-33113
> URL: https://issues.apache.org/jira/browse/SPARK-33113
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Jacek Pliszka
>Priority: Major
>
> Running in databricks on Azure
> {code}
> library("arrow")
> library("SparkR")
> df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA")
> udf <- function(key, x) data.frame(out=c("dfs"))
> {code}
>  
> This works:
> {code}
> sparkR.session(master = "local[*]", 
> sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false"))
> df1 <- gapply(df, c("ColumnA"), udf, "out String")
> collect(df1)
> {code}
> This fails:
> {code}
> sparkR.session(master = "local[*]", 
> sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true"))
> df2 <- gapply(df, c("ColumnA"), udf, "out String")
> collect(df2)
> {code}
>  
> with error
> {code} 
> Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : }}Error 
> in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' 
> argument
> Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 
> 'n' argument In addition: Warning messages: 1: Use 'read_ipc_stream' or 
> 'read_feather' instead. 2: Use 'read_ipc_stream' or 'read_feather' instead.
> {code}
>   
> Clicking through Failed Stages to Failure Reason:
>   
> {code}
>  Job aborted due to stage failure: Task 49 in stage 1843.0 failed 4 times, 
> most recent failure: Lost task 49.3 in stage 1843.0 (TID 89810, 10.99.0.5, 
> executor 0): java.lang.UnsupportedOperationException
>  at 
> org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233)
>  at 
> org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109)
>  at 
> org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:131)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:140)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:115)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  at scala.collection.Iterator.foreach$(Iterator.scala:941)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>  at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>  at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>  at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>  at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>  at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>  at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>  at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>  at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$3(Dataset.scala:3589)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
>  at org.apache.spark.scheduler.Task.run(Task.scala:117)
>  at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)
>  at 
> java.util.concurrent.ThreadP

[jira] [Commented] (SPARK-33113) [SparkR] gapply works with arrow disabled, fails with arrow enabled stringsAsFactors=TRUE

2020-10-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213640#comment-17213640
 ] 

Hyukjin Kwon commented on SPARK-33113:
--

It works in my local in Spark dev branch:
{code:java}
> df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA")
> udf <- function(key, x) data.frame(out=c("dfs"))
> sparkR.session(master = "local[*]", 
> sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false"))
Java ref type org.apache.spark.sql.SparkSession id 1
> df1 <- gapply(df, c("ColumnA"), udf, "out String")
> collect(df1)
  out
1 dfs
2 dfs
3 dfs
> sparkR.session(master = "local[*]", 
> sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true"))
Java ref type org.apache.spark.sql.SparkSession id 1
> df2 <- gapply(df, c("ColumnA"), udf, "out String")
> collect(df2)
  out
1 dfs
2 dfs
3 dfs
{code}

> [SparkR] gapply works with arrow disabled, fails with arrow enabled 
> stringsAsFactors=TRUE
> -
>
> Key: SPARK-33113
> URL: https://issues.apache.org/jira/browse/SPARK-33113
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Jacek Pliszka
>Priority: Major
>
> Running in databricks on Azure
> {code}
> library("arrow")
> library("SparkR")
> df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA")
> udf <- function(key, x) data.frame(out=c("dfs"))
> {code}
>  
> This works:
> {code}
> sparkR.session(master = "local[*]", 
> sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false"))
> df1 <- gapply(df, c("ColumnA"), udf, "out String")
> collect(df1)
> {code}
> This fails:
> {code}
> sparkR.session(master = "local[*]", 
> sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true"))
> df2 <- gapply(df, c("ColumnA"), udf, "out String")
> collect(df2)
> {code}
>  
> with error
> {code} 
> Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : }}Error 
> in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' 
> argument
> Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 
> 'n' argument In addition: Warning messages: 1: Use 'read_ipc_stream' or 
> 'read_feather' instead. 2: Use 'read_ipc_stream' or 'read_feather' instead.
> {code}
>   
> Clicking through Failed Stages to Failure Reason:
>   
> {code}
>  Job aborted due to stage failure: Task 49 in stage 1843.0 failed 4 times, 
> most recent failure: Lost task 49.3 in stage 1843.0 (TID 89810, 10.99.0.5, 
> executor 0): java.lang.UnsupportedOperationException
>  at 
> org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233)
>  at 
> org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109)
>  at 
> org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:131)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:140)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:115)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  at scala.collection.Iterator.foreach$(Iterator.scala:941)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>  at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>  at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>  at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>  at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>  at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>  at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>  at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>  at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>  at scala.colle

[jira] [Updated] (SPARK-33113) [SparkR] gapply works with arrow disabled, fails with arrow enabled stringsAsFactors=TRUE

2020-10-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33113:
-
Description: 
Running in databricks on Azure

{code}
library("arrow")
library("SparkR")

df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA")
udf <- function(key, x) data.frame(out=c("dfs"))
{code}
 

This works:

{code}
sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false"))
df1 <- gapply(df, c("ColumnA"), udf, "out String")
collect(df1)
{code}

This fails:

{code}
sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true"))
df2 <- gapply(df, c("ColumnA"), udf, "out String")
collect(df2)
{code}

 

with error

{code} 
Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : }}Error in 
readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument
Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' 
argument In addition: Warning messages: 1: Use 'read_ipc_stream' or 
'read_feather' instead. 2: Use 'read_ipc_stream' or 'read_feather' instead.
{cpde}
  
 Clicking through Failed Stages to Failure Reason:
  
{code}
 Job aborted due to stage failure: Task 49 in stage 1843.0 failed 4 times, most 
recent failure: Lost task 49.3 in stage 1843.0 (TID 89810, 10.99.0.5, executor 
0): java.lang.UnsupportedOperationException
 at 
org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233)
 at 
org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109)
 at 
org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
 at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:131)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
 at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:140)
 at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:115)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
 at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
 at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
 at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
 at scala.collection.AbstractIterator.to(Iterator.scala:1429)
 at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
 at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
 at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
 at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
 at 
org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$3(Dataset.scala:3589)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
 at org.apache.spark.scheduler.Task.run(Task.scala:117)
 at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
{code}

  
  

 

 

 

 

  was:
Running in databricks on Azure

{code}
library("arrow")
library("SparkR")

df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA")
udf <- function(key, x) data.frame(out=c("dfs"))
{code}
 

This works:

{code}
sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false"))
df1 <- gapply(df, c("ColumnA"), udf, "out String")
collect(df1)
{code}

This fails:

{code}
sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true"))
df2 <- gapply(

[jira] [Updated] (SPARK-33113) [SparkR] gapply works with arrow disabled, fails with arrow enabled stringsAsFactors=TRUE

2020-10-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33113:
-
Description: 
Running in databricks on Azure

{code}
library("arrow")
library("SparkR")

df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA")
udf <- function(key, x) data.frame(out=c("dfs"))
{code}
 

This works:

{code}
sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false"))
df1 <- gapply(df, c("ColumnA"), udf, "out String")
collect(df1)
{code}

This fails:

{code}
sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true"))
df2 <- gapply(df, c("ColumnA"), udf, "out String")
collect(df2)
{code}

 

with error

{code} 
Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : }}Error in 
readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument
Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' 
argument In addition: Warning messages: 1: Use 'read_ipc_stream' or 
'read_feather' instead. 2: Use 'read_ipc_stream' or 'read_feather' instead.
{code}
  
Clicking through Failed Stages to Failure Reason:
  
{code}
 Job aborted due to stage failure: Task 49 in stage 1843.0 failed 4 times, most 
recent failure: Lost task 49.3 in stage 1843.0 (TID 89810, 10.99.0.5, executor 
0): java.lang.UnsupportedOperationException
 at 
org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233)
 at 
org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109)
 at 
org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
 at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:131)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
 at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:140)
 at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:115)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
 at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
 at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
 at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
 at scala.collection.AbstractIterator.to(Iterator.scala:1429)
 at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
 at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
 at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
 at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
 at 
org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$3(Dataset.scala:3589)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
 at org.apache.spark.scheduler.Task.run(Task.scala:117)
 at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
{code}

  
  

 

 

 

 

  was:
Running in databricks on Azure

{code}
library("arrow")
library("SparkR")

df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA")
udf <- function(key, x) data.frame(out=c("dfs"))
{code}
 

This works:

{code}
sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false"))
df1 <- gapply(df, c("ColumnA"), udf, "out String")
collect(df1)
{code}

This fails:

{code}
sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true"))
df2 <- gapply(d

[jira] [Updated] (SPARK-33113) [SparkR] gapply works with arrow disabled, fails with arrow enabled stringsAsFactors=TRUE

2020-10-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33113:
-
Description: 
Running in databricks on Azure

{code}
library("arrow")
library("SparkR")

df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA")
udf <- function(key, x) data.frame(out=c("dfs"))
{code}
 

This works:

{code}
sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false"))
df1 <- gapply(df, c("ColumnA"), udf, "out String")
collect(df1)
{code}

This fails:

{code}
sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true"))
df2 <- gapply(df, c("ColumnA"), udf, "out String")
collect(df2)
{code}

 

with error
 \{{ Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : 
}}Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 
'n' argument
 Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 
'n' argument In addition: Warning messages: 1: Use 'read_ipc_stream' or 
'read_feather' instead. 2: Use 'read_ipc_stream' or 'read_feather' instead.
  
 Clicking through Failed Stages to Failure Reason:
  
{code}
 Job aborted due to stage failure: Task 49 in stage 1843.0 failed 4 times, most 
recent failure: Lost task 49.3 in stage 1843.0 (TID 89810, 10.99.0.5, executor 
0): java.lang.UnsupportedOperationException
 at 
org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233)
 at 
org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109)
 at 
org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
 at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:131)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
 at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:140)
 at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:115)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
 at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
 at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
 at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
 at scala.collection.AbstractIterator.to(Iterator.scala:1429)
 at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
 at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
 at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
 at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
 at 
org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$3(Dataset.scala:3589)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
 at org.apache.spark.scheduler.Task.run(Task.scala:117)
 at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
{code}

  
  

 

 

 

 

  was:
Running in databricks on Azure

library("arrow")
 library("SparkR")

df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA")
 udf <- function(key, x) data.frame(out=c("dfs"))

 

This works:

sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false"))
 df1 <- gapply(df, c("ColumnA"), udf, "out String")
 collect(df1)

This fails:

sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true"))
 df2 <- gapply(df, c("ColumnA"), udf, "out String")
 

[jira] [Commented] (SPARK-33120) Lazy Load of SparkContext.addFiles

2020-10-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213635#comment-17213635
 ] 

Hyukjin Kwon commented on SPARK-33120:
--

Why don't you just upload your files into HDFS or something else and use it?
You could also leverage binaryFile source, etc. You could also think about 
using fuse if you should access it like a local file system.

In my case, when I did some geographical stuff before, I used fuse instead of 
passing files over addFiles. So each task can random access and do some 
topographic correction and angle correction from the original large image.

{{SparkContext.addFiles}} is usually for passing some meta stuff to handle data 
like jars. 

> Lazy Load of SparkContext.addFiles
> --
>
> Key: SPARK-33120
> URL: https://issues.apache.org/jira/browse/SPARK-33120
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Mac OS X (2 systems), workload to eventually be run on 
> Amazon EMR.
> Java 11 application.
>Reporter: Taylor Smock
>Priority: Minor
>
> In my spark job, I may have various random files that may or may not be used 
> by each task.
> I would like to avoid copying all of the files to every executor until it is 
> actually needed.
>  
> What I've tried:
>  * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were 
> distributed to all clients.
>  * Broadcast variables. Since I _don't_ know what files I'm going to need 
> until I have started the task, I have to broadcast all the data at once, 
> which leads to nodes getting data, and then caching it to disk. In short, the 
> same issues as SparkContext.addFiles, but with the added benefit of having 
> the ability to create a mapping of paths to files.
> What I would like to see:
>  * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, 
> Enum.WaitForAvailability) or Future future = SparkFiles.get(file)
>  
>  
> Notes: 
> https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346
>  indicated that `SparkFiles.get` would be required to get the data on the 
> local driver, but in my testing that did not appear to be the case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33133) History server fails when loading invalid rolling event logs

2020-10-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213629#comment-17213629
 ] 

Hyukjin Kwon commented on SPARK-33133:
--

cc [~kabhwan] FYI

> History server fails when loading invalid rolling event logs
> 
>
> Key: SPARK-33133
> URL: https://issues.apache.org/jira/browse/SPARK-33133
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Adam Binford
>Priority: Major
>
> We have run into an issue where our history server fails to load new 
> applications, and when restarted, fails to load any applications at all. This 
> happens when it encounters invalid rolling event log files. We encounter this 
> with long running streaming applications. There seems to be two issues here 
> that lead to problems:
>  * It looks like our long running streaming applications event log directory 
> is being cleaned up. The next time the application logs event data, it 
> recreates the event log directory but without recreating the "appstatus" 
> file. I don't know the full extent of this behavior or if something "wrong" 
> is happening here.
>  * The history server then reads this new folder, and throws an exception 
> because the "appstatus" file doesn't exist in the rolling event log folder. 
> This exception breaks the entire listing process, so no new applications will 
> be read, and if restarted no applications at all will be read.
> There seems like a couple ways to go about fixing this, and I'm curious 
> anyone's thoughts who knows more about how the history server works, 
> specifically with rolling event logs:
>  * Don't completely fail checking for new applications if one bad rolling 
> event log folder is encountered. This seems like the simplest fix and makes 
> sense to me, it already checks for a few other errors and ignores them. It 
> doesn't necessarily fix the underlying issue that leads to this happening 
> though.
>  * Figure out why the in progress event log folder is being deleted and make 
> sure that doesn't happen. Maybe this is supposed to happen? Or maybe we don't 
> want to delete the top level folder and only delete event log files within 
> the folders? Again I don't know the exact current behavior here with this.
>  * When writing new event log data, make sure the folder and appstatus file 
> exist every time, creating them again if not.
> Here's the stack trace we encounter when this happens, from 3.0.1 with a 
> couple extra MRs backported that I hoped would fix the issue:
> {{2020-10-13 12:10:31,751 ERROR history.FsHistoryProvider: Exception in 
> checking for event log updates2020-10-13 12:10:31,751 ERROR 
> history.FsHistoryProvider: Exception in checking for event log 
> updatesjava.lang.IllegalArgumentException: requirement failed: Log directory 
> must contain an appstatus file! at scala.Predef$.require(Predef.scala:281) at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:214)
>  at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files(EventLogFileReaders.scala:211)
>  at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles$lzycompute(EventLogFileReaders.scala:221)
>  at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles(EventLogFileReaders.scala:220)
>  at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.lastEventLogFile(EventLogFileReaders.scala:272)
>  at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:240)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:524)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466)
>  at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) at 
> scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) at 
> scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at 
> scala.collection.TraversableLike.filter(TraversableLike.scala:347) at 
> scala.collection.TraversableLike.filter$(TraversableLike.scala:347) at 
> scala.collection.AbstractTraversable.filter(Traversable.scala:108) at 
> org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$

[jira] [Commented] (SPARK-33136) Handling nullability for complex types is broken during resolution of V2 write command

2020-10-13 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213612#comment-17213612
 ] 

Jungtaek Lim commented on SPARK-33136:
--

Note that AppendData in branch-2.4 is also broken as same, but the usage of 
AppendData is reverted in 
[{{b6e4aca}}|https://github.com/apache/spark/commit/b6e4aca0be7f3b863c326063a3c02aa8a1c266a3]
 for branch-2.4 and shipped to Spark 2.4.0. (That said, no Spark 2.x version is 
affected.)

So while the code in AppendData for branch-2.4 is broken as well, it's a dead 
code.

> Handling nullability for complex types is broken during resolution of V2 
> write command
> --
>
> Key: SPARK-33136
> URL: https://issues.apache.org/jira/browse/SPARK-33136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> I figured out Spark 3.x cannot write to complex type with nullable if 
> matching column type in DataFrame is non-nullable.
> For example, 
> {code:java}
> case class StructData(a: String, b: Int)
> case class Data(col_b: Boolean, col_i: Int, col_l: Long, col_f: Float, col_d: 
> Double, col_s: String, col_fi: Array[Byte], col_bi: Array[Byte], col_de: 
> Double, col_st: StructData, col_li: Seq[String], col_ma: Map[Int, 
> String]){code}
> `col_st.b` would be non-nullable in DataFrame, which should not matter when 
> we insert from DataFrame to the table which has `col_st.b` as nullable. 
> (non-nullable to nullable should be possible)
> This looks to be broken in V2 write command.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29536) PySpark does not work with Python 3.8.0

2020-10-13 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213558#comment-17213558
 ] 

Dongjoon Hyun edited comment on SPARK-29536 at 10/14/20, 3:18 AM:
--

Hi, [~hyukjin.kwon]. Apache Spark 2.4.7 also fails. I will update the affected 
version.
{code}
$ bin/pyspark
Python 3.8.5 (default, Sep 10 2020, 11:46:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/shell.py",
 line 31, in 
from pyspark import SparkConf
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/__init__.py",
 line 51, in 
from pyspark.context import SparkContext
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/context.py",
 line 31, in 
from pyspark import accumulators
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/accumulators.py",
 line 97, in 
from pyspark.serializers import read_int, PickleSerializer
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/serializers.py",
 line 72, in 
from pyspark import cloudpickle
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/cloudpickle.py",
 line 145, in 
_cell_set_template_code = _make_cell_set_template_code()
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/cloudpickle.py",
 line 126, in _make_cell_set_template_code
return types.CodeType(
TypeError: an integer is required (got type bytes)
>>>
{code}


was (Author: dongjoon):
Hi, [~hyukjin.kwon]. Apache Spark 2.4.7 also fails. I will update the affected 
version.
{code}
$ current_pyspark
Python 3.8.5 (default, Sep 10 2020, 11:46:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/shell.py",
 line 31, in 
from pyspark import SparkConf
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/__init__.py",
 line 51, in 
from pyspark.context import SparkContext
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/context.py",
 line 31, in 
from pyspark import accumulators
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/accumulators.py",
 line 97, in 
from pyspark.serializers import read_int, PickleSerializer
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/serializers.py",
 line 72, in 
from pyspark import cloudpickle
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/cloudpickle.py",
 line 145, in 
_cell_set_template_code = _make_cell_set_template_code()
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/cloudpickle.py",
 line 126, in _make_cell_set_template_code
return types.CodeType(
TypeError: an integer is required (got type bytes)
>>>
{code}

> PySpark does not work with Python 3.8.0
> ---
>
> Key: SPARK-29536
> URL: https://issues.apache.org/jira/browse/SPARK-29536
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.0.0
>
>
> You open a shell and run arbitrary codes:
> {code}
>   File "/.../3.8/lib/python3.8/runpy.py", line 183, in _run_module_as_main
> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>   File "/.../3.8/lib/python3.8/runpy.py", line 109, in _get_module_details
> __import__(pkg_name)
>   File /.../workspace/forked/spark/python/pyspark/__init__.py", line 51, in 
> 
> from pyspark.context import SparkContext
>   File "/.../spark/python/pyspark/context.py", line 31, in 
> from pyspark import accumulators
>   File "/.../python/pyspark/accumulators.py", line 97, in 
> from pyspark.serializers import read_int, PickleSerializer
>   File "/.../python/pyspark/serializers.py", line 71, in 
> from pyspark import cloudpickle
>   File "/.../python/pyspark/cloudpickle.py", line 152, in 
> _cell_set_template_code = _make_cell_set_template_code()
>   File "/.../spark/python/pyspark/cloudpickle.py", line 133, in 
> _make_cell_set_template_code
> return types.CodeType(
> TypeError: an integer is required (got type bytes)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (SPARK-29536) PySpark does not work with Python 3.8.0

2020-10-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29536:
--
Affects Version/s: 2.4.7

> PySpark does not work with Python 3.8.0
> ---
>
> Key: SPARK-29536
> URL: https://issues.apache.org/jira/browse/SPARK-29536
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.4.7, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.0.0
>
>
> You open a shell and run arbitrary codes:
> {code}
>   File "/.../3.8/lib/python3.8/runpy.py", line 183, in _run_module_as_main
> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>   File "/.../3.8/lib/python3.8/runpy.py", line 109, in _get_module_details
> __import__(pkg_name)
>   File /.../workspace/forked/spark/python/pyspark/__init__.py", line 51, in 
> 
> from pyspark.context import SparkContext
>   File "/.../spark/python/pyspark/context.py", line 31, in 
> from pyspark import accumulators
>   File "/.../python/pyspark/accumulators.py", line 97, in 
> from pyspark.serializers import read_int, PickleSerializer
>   File "/.../python/pyspark/serializers.py", line 71, in 
> from pyspark import cloudpickle
>   File "/.../python/pyspark/cloudpickle.py", line 152, in 
> _cell_set_template_code = _make_cell_set_template_code()
>   File "/.../spark/python/pyspark/cloudpickle.py", line 133, in 
> _make_cell_set_template_code
> return types.CodeType(
> TypeError: an integer is required (got type bytes)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29536) PySpark does not work with Python 3.8.0

2020-10-13 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213558#comment-17213558
 ] 

Dongjoon Hyun commented on SPARK-29536:
---

Hi, [~hyukjin.kwon]. Apache Spark 2.4.7 also fails. I will update the affected 
version.
{code}
$ current_pyspark
Python 3.8.5 (default, Sep 10 2020, 11:46:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/shell.py",
 line 31, in 
from pyspark import SparkConf
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/__init__.py",
 line 51, in 
from pyspark.context import SparkContext
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/context.py",
 line 31, in 
from pyspark import accumulators
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/accumulators.py",
 line 97, in 
from pyspark.serializers import read_int, PickleSerializer
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/serializers.py",
 line 72, in 
from pyspark import cloudpickle
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/cloudpickle.py",
 line 145, in 
_cell_set_template_code = _make_cell_set_template_code()
  File 
"/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/cloudpickle.py",
 line 126, in _make_cell_set_template_code
return types.CodeType(
TypeError: an integer is required (got type bytes)
>>>
{code}

> PySpark does not work with Python 3.8.0
> ---
>
> Key: SPARK-29536
> URL: https://issues.apache.org/jira/browse/SPARK-29536
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.0.0
>
>
> You open a shell and run arbitrary codes:
> {code}
>   File "/.../3.8/lib/python3.8/runpy.py", line 183, in _run_module_as_main
> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>   File "/.../3.8/lib/python3.8/runpy.py", line 109, in _get_module_details
> __import__(pkg_name)
>   File /.../workspace/forked/spark/python/pyspark/__init__.py", line 51, in 
> 
> from pyspark.context import SparkContext
>   File "/.../spark/python/pyspark/context.py", line 31, in 
> from pyspark import accumulators
>   File "/.../python/pyspark/accumulators.py", line 97, in 
> from pyspark.serializers import read_int, PickleSerializer
>   File "/.../python/pyspark/serializers.py", line 71, in 
> from pyspark import cloudpickle
>   File "/.../python/pyspark/cloudpickle.py", line 152, in 
> _cell_set_template_code = _make_cell_set_template_code()
>   File "/.../spark/python/pyspark/cloudpickle.py", line 133, in 
> _make_cell_set_template_code
> return types.CodeType(
> TypeError: an integer is required (got type bytes)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33134.
--
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30032
[https://github.com/apache/spark/pull/30032]

> Incorrect nested complex JSON fields raise an exception
> ---
>
> Key: SPARK-33134
> URL: https://issues.apache.org/jira/browse/SPARK-33134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> The code below:
> {code:scala}
> val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 
> 123456}]""").toDF("events")
> val event = new StructType()
>   .add("playerId", LongType)
>   .add("cards", ArrayType(
> new StructType()
>   .add("id", LongType)
>   .add("rank", StringType)))
> val pokerhand_events = pokerhand_raw
>   .select(explode(from_json($"events", ArrayType(event))).as("event"))
> pokerhand_events.show
> {code}
> throw the exception in the PERMISSIVE mode (default):
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> org.apache.spark.sql.catalyst.util.ArrayData
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
> {code}
> The same works in Spark 2.4:
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.6
>   /_/
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
> ...
> scala> pokerhand_events.show()
> +-+
> |event|
> +-+
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33134:


Assignee: Maxim Gekk

> Incorrect nested complex JSON fields raise an exception
> ---
>
> Key: SPARK-33134
> URL: https://issues.apache.org/jira/browse/SPARK-33134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> The code below:
> {code:scala}
> val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 
> 123456}]""").toDF("events")
> val event = new StructType()
>   .add("playerId", LongType)
>   .add("cards", ArrayType(
> new StructType()
>   .add("id", LongType)
>   .add("rank", StringType)))
> val pokerhand_events = pokerhand_raw
>   .select(explode(from_json($"events", ArrayType(event))).as("event"))
> pokerhand_events.show
> {code}
> throw the exception in the PERMISSIVE mode (default):
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> org.apache.spark.sql.catalyst.util.ArrayData
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
> {code}
> The same works in Spark 2.4:
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.6
>   /_/
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
> ...
> scala> pokerhand_events.show()
> +-+
> |event|
> +-+
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33138) unify temp view and permanent view behaviors

2020-10-13 Thread Thejdeep (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213550#comment-17213550
 ] 

Thejdeep commented on SPARK-33138:
--

Can I look into this please ?

> unify temp view and permanent view behaviors
> 
>
> Key: SPARK-33138
> URL: https://issues.apache.org/jira/browse/SPARK-33138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, temp view store mapping of temp view name and its logicalPlan, and 
> permanent view store in HMS stores its origin SQL text.
> So for permanent view, when try to refer the permanent view, its SQL text 
> will be parse-analyze-optimize-plan again with current SQLConf and 
> SparkSession context, so it might keep changing when the SQLConf and context 
> is different each time.
> In order the unify the behaviors of temp view and permanent view, proposed 
> that we keep its origin SQLText for both temp and permanent view, and also 
> keep record of the SQLConf when the view was created. Each time we try to 
> refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
> the SQLText, in this way, we can make sure the output of the created view to 
> be stable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33142) SQL temp view should store SQL text as well

2020-10-13 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33142:
---

 Summary: SQL temp view should store SQL text as well
 Key: SPARK-33142
 URL: https://issues.apache.org/jira/browse/SPARK-33142
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33141) capture SQL configs when creating permanent views

2020-10-13 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33141:
---

 Summary: capture SQL configs when creating permanent views
 Key: SPARK-33141
 URL: https://issues.apache.org/jira/browse/SPARK-33141
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33140) make Analyzer and Optimizer rules using SQLConf.get

2020-10-13 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33140:
---

 Summary: make Analyzer and Optimizer rules using SQLConf.get
 Key: SPARK-33140
 URL: https://issues.apache.org/jira/browse/SPARK-33140
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33139) protect setActiveSession and clearActiveSession

2020-10-13 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33139:
---

 Summary: protect setActiveSession and clearActiveSession
 Key: SPARK-33139
 URL: https://issues.apache.org/jira/browse/SPARK-33139
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33138) unify temp view and permanent view behaviors

2020-10-13 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33138:

Description: 
Currently, temp view store mapping of temp view name and its logicalPlan, and 
permanent view store in HMS stores its origin SQL text.

So for permanent view, when try to refer the permanent view, its SQL text will 
be parse-analyze-optimize-plan again with current SQLConf and SparkSession 
context, so it might keep changing when the SQLConf and context is different 
each time.

In order the unify the behaviors of temp view and permanent view, proposed that 
we keep its origin SQLText for both temp and permanent view, and also keep 
record of the SQLConf when the view was created. Each time we try to refer the 
view, we using the Snapshot SQLConf to parse-analyze-optimize-plan the SQLText, 
in this way, we can make sure the output of the created view to be stable.

  was:Currently, temp view store mapping of temp view name and its logicalPlan, 
and permanent view store in HMS stores its origin SQL text. So for permanent 
view, when try to refer the permanent view, its SQL text will be 
parse-analyze-optimize-plan again with current SQLConf and SparkSession 
context, so it might keep changing the SQLConf and context is different each 
time. So, in order the unify the behaviors of temp view and permanent view, 
propose that we keep its origin SQLText for both temp and permanent view, and 
also keep record of the SQLConf when the view was created. Each time we try to 
refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
the SQLText, in this way, we can make sure the output of the created view to be 
stable.


> unify temp view and permanent view behaviors
> 
>
> Key: SPARK-33138
> URL: https://issues.apache.org/jira/browse/SPARK-33138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, temp view store mapping of temp view name and its logicalPlan, and 
> permanent view store in HMS stores its origin SQL text.
> So for permanent view, when try to refer the permanent view, its SQL text 
> will be parse-analyze-optimize-plan again with current SQLConf and 
> SparkSession context, so it might keep changing when the SQLConf and context 
> is different each time.
> In order the unify the behaviors of temp view and permanent view, proposed 
> that we keep its origin SQLText for both temp and permanent view, and also 
> keep record of the SQLConf when the view was created. Each time we try to 
> refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
> the SQLText, in this way, we can make sure the output of the created view to 
> be stable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33138) unify temp view and permanent view behaviors

2020-10-13 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33138:
---

 Summary: unify temp view and permanent view behaviors
 Key: SPARK-33138
 URL: https://issues.apache.org/jira/browse/SPARK-33138
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
 Environment: Currently, temp view store mapping of temp view name and 
its logicalPlan, and permanent view store in HMS stores its origin SQL text. So 
for permanent view, when try to refer the permanent view, its SQL text will be 
parse-analyze-optimize-plan again with current SQLConf and SparkSession 
context, so it might keep changing the SQLConf and context is different each 
time. So, in order the unify the behaviors of temp view and permanent view, 
propose that we keep its origin SQLText for both temp and permanent view, and 
also keep record of the SQLConf when the view was created. Each time we try to 
refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
the SQLText, in this way, we can make sure the output of the created view to be 
stable.
Reporter: Leanken.Lin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33138) unify temp view and permanent view behaviors

2020-10-13 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33138:

Description: Currently, temp view store mapping of temp view name and its 
logicalPlan, and permanent view store in HMS stores its origin SQL text. So for 
permanent view, when try to refer the permanent view, its SQL text will be 
parse-analyze-optimize-plan again with current SQLConf and SparkSession 
context, so it might keep changing the SQLConf and context is different each 
time. So, in order the unify the behaviors of temp view and permanent view, 
propose that we keep its origin SQLText for both temp and permanent view, and 
also keep record of the SQLConf when the view was created. Each time we try to 
refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
the SQLText, in this way, we can make sure the output of the created view to be 
stable.
Environment: (was: Currently, temp view store mapping of temp view name 
and its logicalPlan, and permanent view store in HMS stores its origin SQL 
text. So for permanent view, when try to refer the permanent view, its SQL text 
will be parse-analyze-optimize-plan again with current SQLConf and SparkSession 
context, so it might keep changing the SQLConf and context is different each 
time. So, in order the unify the behaviors of temp view and permanent view, 
propose that we keep its origin SQLText for both temp and permanent view, and 
also keep record of the SQLConf when the view was created. Each time we try to 
refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
the SQLText, in this way, we can make sure the output of the created view to be 
stable.)

> unify temp view and permanent view behaviors
> 
>
> Key: SPARK-33138
> URL: https://issues.apache.org/jira/browse/SPARK-33138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, temp view store mapping of temp view name and its logicalPlan, and 
> permanent view store in HMS stores its origin SQL text. So for permanent 
> view, when try to refer the permanent view, its SQL text will be 
> parse-analyze-optimize-plan again with current SQLConf and SparkSession 
> context, so it might keep changing the SQLConf and context is different each 
> time. So, in order the unify the behaviors of temp view and permanent view, 
> propose that we keep its origin SQLText for both temp and permanent view, and 
> also keep record of the SQLConf when the view was created. Each time we try 
> to refer the view, we using the Snapshot SQLConf to 
> parse-analyze-optimize-plan the SQLText, in this way, we can make sure the 
> output of the created view to be stable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33137) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (PostgreSQL dialect)

2020-10-13 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-33137:
--

 Summary: Support ALTER TABLE in JDBC v2 Table Catalog: update type 
and nullability of columns (PostgreSQL dialect)
 Key: SPARK-33137
 URL: https://issues.apache.org/jira/browse/SPARK-33137
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Huaxin Gao


Override the default SQL strings for:

ALTER TABLE UPDATE COLUMN TYPE
ALTER TABLE UPDATE COLUMN NULLABILITY
in the following PostgreSQL JDBC dialect according to official documentation.

Write PostgreSQL integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33136) Handling nullability for complex types is broken during resolution of V2 write command

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213461#comment-17213461
 ] 

Apache Spark commented on SPARK-33136:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/30033

> Handling nullability for complex types is broken during resolution of V2 
> write command
> --
>
> Key: SPARK-33136
> URL: https://issues.apache.org/jira/browse/SPARK-33136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> I figured out Spark 3.x cannot write to complex type with nullable if 
> matching column type in DataFrame is non-nullable.
> For example, 
> {code:java}
> case class StructData(a: String, b: Int)
> case class Data(col_b: Boolean, col_i: Int, col_l: Long, col_f: Float, col_d: 
> Double, col_s: String, col_fi: Array[Byte], col_bi: Array[Byte], col_de: 
> Double, col_st: StructData, col_li: Seq[String], col_ma: Map[Int, 
> String]){code}
> `col_st.b` would be non-nullable in DataFrame, which should not matter when 
> we insert from DataFrame to the table which has `col_st.b` as nullable. 
> (non-nullable to nullable should be possible)
> This looks to be broken in V2 write command.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33136) Handling nullability for complex types is broken during resolution of V2 write command

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33136:


Assignee: Apache Spark

> Handling nullability for complex types is broken during resolution of V2 
> write command
> --
>
> Key: SPARK-33136
> URL: https://issues.apache.org/jira/browse/SPARK-33136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> I figured out Spark 3.x cannot write to complex type with nullable if 
> matching column type in DataFrame is non-nullable.
> For example, 
> {code:java}
> case class StructData(a: String, b: Int)
> case class Data(col_b: Boolean, col_i: Int, col_l: Long, col_f: Float, col_d: 
> Double, col_s: String, col_fi: Array[Byte], col_bi: Array[Byte], col_de: 
> Double, col_st: StructData, col_li: Seq[String], col_ma: Map[Int, 
> String]){code}
> `col_st.b` would be non-nullable in DataFrame, which should not matter when 
> we insert from DataFrame to the table which has `col_st.b` as nullable. 
> (non-nullable to nullable should be possible)
> This looks to be broken in V2 write command.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33136) Handling nullability for complex types is broken during resolution of V2 write command

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33136:


Assignee: (was: Apache Spark)

> Handling nullability for complex types is broken during resolution of V2 
> write command
> --
>
> Key: SPARK-33136
> URL: https://issues.apache.org/jira/browse/SPARK-33136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> I figured out Spark 3.x cannot write to complex type with nullable if 
> matching column type in DataFrame is non-nullable.
> For example, 
> {code:java}
> case class StructData(a: String, b: Int)
> case class Data(col_b: Boolean, col_i: Int, col_l: Long, col_f: Float, col_d: 
> Double, col_s: String, col_fi: Array[Byte], col_bi: Array[Byte], col_de: 
> Double, col_st: StructData, col_li: Seq[String], col_ma: Map[Int, 
> String]){code}
> `col_st.b` would be non-nullable in DataFrame, which should not matter when 
> we insert from DataFrame to the table which has `col_st.b` as nullable. 
> (non-nullable to nullable should be possible)
> This looks to be broken in V2 write command.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33136) Handling nullability for complex types is broken during resolution of V2 write command

2020-10-13 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213448#comment-17213448
 ] 

Jungtaek Lim commented on SPARK-33136:
--

will submit a PR soon.

> Handling nullability for complex types is broken during resolution of V2 
> write command
> --
>
> Key: SPARK-33136
> URL: https://issues.apache.org/jira/browse/SPARK-33136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> I figured out Spark 3.x cannot write to complex type with nullable if 
> matching column type in DataFrame is non-nullable.
> For example, 
> {code:java}
> case class StructData(a: String, b: Int)
> case class Data(col_b: Boolean, col_i: Int, col_l: Long, col_f: Float, col_d: 
> Double, col_s: String, col_fi: Array[Byte], col_bi: Array[Byte], col_de: 
> Double, col_st: StructData, col_li: Seq[String], col_ma: Map[Int, 
> String]){code}
> `col_st.b` would be non-nullable in DataFrame, which should not matter when 
> we insert from DataFrame to the table which has `col_st.b` as nullable. 
> (non-nullable to nullable should be possible)
> This looks to be broken in V2 write command.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33136) Handling nullability for complex types is broken during resolution of V2 write command

2020-10-13 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-33136:


 Summary: Handling nullability for complex types is broken during 
resolution of V2 write command
 Key: SPARK-33136
 URL: https://issues.apache.org/jira/browse/SPARK-33136
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.0.0, 3.1.0
Reporter: Jungtaek Lim


I figured out Spark 3.x cannot write to complex type with nullable if matching 
column type in DataFrame is non-nullable.

For example, 
{code:java}
case class StructData(a: String, b: Int)

case class Data(col_b: Boolean, col_i: Int, col_l: Long, col_f: Float, col_d: 
Double, col_s: String, col_fi: Array[Byte], col_bi: Array[Byte], col_de: 
Double, col_st: StructData, col_li: Seq[String], col_ma: Map[Int, String]){code}
`col_st.b` would be non-nullable in DataFrame, which should not matter when we 
insert from DataFrame to the table which has `col_st.b` as nullable. 
(non-nullable to nullable should be possible)

This looks to be broken in V2 write command.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice

2020-10-13 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213336#comment-17213336
 ] 

L. C. Hsieh commented on SPARK-29625:
-

How did you specify the starting offset?

> Spark Structure Streaming Kafka Wrong Reset Offset twice
> 
>
> Key: SPARK-29625
> URL: https://issues.apache.org/jira/browse/SPARK-29625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Sandish Kumar HN
>Priority: Major
>
> Spark Structure Streaming Kafka Reset Offset twice, once with right offsets 
> and second time with very old offsets 
> {code}
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-151 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-118 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-85 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 122677634.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-19 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 120504922.*
> [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO ContextCleaner: Cleaned accumulator 810
> {code}
> which is causing a Data loss issue.  
> {code}
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, 
> runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
> java.lang.IllegalStateException: Partition topic-52's offset was changed from 
> 122677598 to 120504922, some data may have been missed.
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have 
> been lost because they are not available in Kafka any more; either the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  data was aged out 
> by Kafka or the topic may have been deleted before all the data in the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  topic was 
> processed. If you don't want your streaming query to fail on such cases, set 
> the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  source option 
> "failOnDataLoss" to "false".
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} I

[jira] [Updated] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-13 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33134:
---
Description: 
The code below:
{code:scala}
val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 
123456}]""").toDF("events")
val event = new StructType()
  .add("playerId", LongType)
  .add("cards", ArrayType(
new StructType()
  .add("id", LongType)
  .add("rank", StringType)))
val pokerhand_events = pokerhand_raw
  .select(explode(from_json($"events", ArrayType(event))).as("event"))
pokerhand_events.show
{code}
throw the exception in the PERMISSIVE mode (default):
{code:java}
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
org.apache.spark.sql.catalyst.util.ArrayData
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
  at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
  at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
{code}
The same works in Spark 2.4:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6
  /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
...
scala> pokerhand_events.show()
+-+
|event|
+-+
+-+
{code}

  was:
The code below:
{code:scala}
val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
583651}]""").toDF("events")
val event = new StructType()
  .add("playerId", LongType)
  .add("cards", ArrayType(
new StructType()
  .add("id", LongType)
  .add("rank", StringType)))
val pokerhand_events = pokerhand_raw
  .select(explode(from_json($"events", ArrayType(event))).as("event"))
pokerhand_events.show
{code}
throw the exception in the PERMISSIVE mode (default):
{code:java}
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
org.apache.spark.sql.catalyst.util.ArrayData
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
  at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
  at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
{code}
The same works in Spark 2.4:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6
  /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
...
scala> pokerhand_events.show()
+-+
|event|
+-+
+-+
{code}


> Incorrect nested complex JSON fields raise an exception
> ---
>
> Key: SPARK-33134
> URL: https://issues.apache.org/jira/browse/SPARK-33134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The code below:
> {code:scala}
> val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 
> 123456}]""").toDF("events")
> val event = new StructType()
>   .add("playerId", LongType)
>   .add("cards", ArrayType(
> new StructType()
>   .add("id", LongType)
>   .add("rank", StringType)))
> val pokerhand_events = pokerhand_raw
>   .select(explode(from_json($"events", ArrayType(event))).as("event"))
> pokerhand_events.show
> {code}
> throw the exception in the PERMISSIVE mode (default):
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> org.apache.spark.sql.catalyst.util.ArrayData
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.ge

[jira] [Commented] (SPARK-33071) Join with ambiguous column succeeding but giving wrong output

2020-10-13 Thread Aoyuan Liao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213312#comment-17213312
 ] 

Aoyuan Liao commented on SPARK-33071:
-

The master branch has the same bug.

> Join with ambiguous column succeeding but giving wrong output
> -
>
> Key: SPARK-33071
> URL: https://issues.apache.org/jira/browse/SPARK-33071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1
>Reporter: George
>Priority: Major
>  Labels: correctness
>
> When joining two datasets where one column in each dataset is sourced from 
> the same input dataset, the join successfully runs, but does not select the 
> correct columns, leading to incorrect output.
> Repro using pyspark:
> {code:java}
> sc.version
> import pyspark.sql.functions as F
> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 'units' 
> : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 1, 
> 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> input_df = spark.createDataFrame(d)
> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> df1 = df1.filter(F.col("key") != F.lit("c"))
> df2 = df2.filter(F.col("key") != F.lit("d"))
> ret = df1.join(df2, df1.key == df2.key, "full").select(
> df1["key"].alias("df1_key"),
> df2["key"].alias("df2_key"),
> df1["sales"],
> df2["units"],
> F.coalesce(df1["key"], df2["key"]).alias("key"))
> ret.show()
> ret.explain(){code}
> output for 2.4.4:
> {code:java}
> >>> sc.version
> u'2.4.4'
> >>> import pyspark.sql.functions as F
> >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
> >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 
> >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> >>> input_df = spark.createDataFrame(d)
> >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> >>> df1 = df1.filter(F.col("key") != F.lit("c"))
> >>> df2 = df2.filter(F.col("key") != F.lit("d"))
> >>> ret = df1.join(df2, df1.key == df2.key, "full").select(
> ... df1["key"].alias("df1_key"),
> ... df2["key"].alias("df2_key"),
> ... df1["sales"],
> ... df2["units"],
> ... F.coalesce(df1["key"], df2["key"]).alias("key"))
> 20/10/05 15:46:14 WARN Column: Constructing trivially true equals predicate, 
> 'key#213 = key#213'. Perhaps you need to use aliases.
> >>> ret.show()
> +---+---+-+-++
> |df1_key|df2_key|sales|units| key|
> +---+---+-+-++
> |  d|  d|3| null|   d|
> |   null|   null| null|2|null|
> |  b|  b|5|   10|   b|
> |  a|  a|3|6|   a|
> +---+---+-+-++>>> ret.explain()
> == Physical Plan ==
> *(5) Project [key#213 AS df1_key#258, key#213 AS df2_key#259, sales#223L, 
> units#230L, coalesce(key#213, key#213) AS key#260]
> +- SortMergeJoin [key#213], [key#237], FullOuter
>:- *(2) Sort [key#213 ASC NULLS FIRST], false, 0
>:  +- *(2) HashAggregate(keys=[key#213], functions=[sum(sales#214L)])
>: +- Exchange hashpartitioning(key#213, 200)
>:+- *(1) HashAggregate(keys=[key#213], 
> functions=[partial_sum(sales#214L)])
>:   +- *(1) Project [key#213, sales#214L]
>:  +- *(1) Filter (isnotnull(key#213) && NOT (key#213 = c))
>: +- Scan ExistingRDD[key#213,sales#214L,units#215L]
>+- *(4) Sort [key#237 ASC NULLS FIRST], false, 0
>   +- *(4) HashAggregate(keys=[key#237], functions=[sum(units#239L)])
>  +- Exchange hashpartitioning(key#237, 200)
> +- *(3) HashAggregate(keys=[key#237], 
> functions=[partial_sum(units#239L)])
>+- *(3) Project [key#237, units#239L]
>   +- *(3) Filter (isnotnull(key#237) && NOT (key#237 = d))
>  +- Scan ExistingRDD[key#237,sales#238L,units#239L]
> {code}
> output for 3.0.1:
> {code:java}
> // code placeholder
> >>> sc.version
> u'3.0.1'
> >>> import pyspark.sql.functions as F
> >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
> >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 
> >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> >>> input_df = spark.createDataFrame(d)
> /usr/local/lib/python2.7/site-packages/pyspark/sql/session.py:381: 
> UserWarning: inferring schema from dict is deprecated,please use 
> pyspark.sql.Row instead
>   warnings.warn("inferring schema from dict is deprecated,"
> >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> >>> df1 = df1.filt

[jira] [Commented] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213224#comment-17213224
 ] 

Apache Spark commented on SPARK-33134:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30032

> Incorrect nested complex JSON fields raise an exception
> ---
>
> Key: SPARK-33134
> URL: https://issues.apache.org/jira/browse/SPARK-33134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The code below:
> {code:scala}
> val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
> 583651}]""").toDF("events")
> val event = new StructType()
>   .add("playerId", LongType)
>   .add("cards", ArrayType(
> new StructType()
>   .add("id", LongType)
>   .add("rank", StringType)))
> val pokerhand_events = pokerhand_raw
>   .select(explode(from_json($"events", ArrayType(event))).as("event"))
> pokerhand_events.show
> {code}
> throw the exception in the PERMISSIVE mode (default):
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> org.apache.spark.sql.catalyst.util.ArrayData
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
> {code}
> The same works in Spark 2.4:
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.6
>   /_/
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
> ...
> scala> pokerhand_events.show()
> +-+
> |event|
> +-+
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'

2020-10-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33132:
-

Assignee: akiyamaneko

> The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 
> 'NaN Undefined'
> -
>
> Key: SPARK-33132
> URL: https://issues.apache.org/jira/browse/SPARK-33132
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Assignee: akiyamaneko
>Priority: Minor
> Attachments: Stage Summary shows NaN undefined.png
>
>
> Spark Version: 3.0.1
> Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics 
> was shown as '*NaN Undefind*'  when the *readBytes* value is negative,  as 
> the attachment shows.
> *curl 
> http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary*
> {
>  "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
>  "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ],
>   ...
>  "shuffleReadMetrics" :
> { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, 
> 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, 
> 2002.0, 2002.0, 2002.0, 2002.0 ],  ... }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'

2020-10-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33132.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30030
[https://github.com/apache/spark/pull/30030]

> The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 
> 'NaN Undefined'
> -
>
> Key: SPARK-33132
> URL: https://issues.apache.org/jira/browse/SPARK-33132
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Assignee: akiyamaneko
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: Stage Summary shows NaN undefined.png
>
>
> Spark Version: 3.0.1
> Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics 
> was shown as '*NaN Undefind*'  when the *readBytes* value is negative,  as 
> the attachment shows.
> *curl 
> http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary*
> {
>  "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
>  "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ],
>   ...
>  "shuffleReadMetrics" :
> { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, 
> 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, 
> 2002.0, 2002.0, 2002.0, 2002.0 ],  ... }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33071) Join with ambiguous column succeeding but giving wrong output

2020-10-13 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213205#comment-17213205
 ] 

Wenchen Fan commented on SPARK-33071:
-

To confirm: is this a long-standing bug that 2.4, 3.0, and the master branch 
all give wrong (but same) result?

> Join with ambiguous column succeeding but giving wrong output
> -
>
> Key: SPARK-33071
> URL: https://issues.apache.org/jira/browse/SPARK-33071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1
>Reporter: George
>Priority: Major
>  Labels: correctness
>
> When joining two datasets where one column in each dataset is sourced from 
> the same input dataset, the join successfully runs, but does not select the 
> correct columns, leading to incorrect output.
> Repro using pyspark:
> {code:java}
> sc.version
> import pyspark.sql.functions as F
> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 'units' 
> : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 1, 
> 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> input_df = spark.createDataFrame(d)
> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> df1 = df1.filter(F.col("key") != F.lit("c"))
> df2 = df2.filter(F.col("key") != F.lit("d"))
> ret = df1.join(df2, df1.key == df2.key, "full").select(
> df1["key"].alias("df1_key"),
> df2["key"].alias("df2_key"),
> df1["sales"],
> df2["units"],
> F.coalesce(df1["key"], df2["key"]).alias("key"))
> ret.show()
> ret.explain(){code}
> output for 2.4.4:
> {code:java}
> >>> sc.version
> u'2.4.4'
> >>> import pyspark.sql.functions as F
> >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
> >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 
> >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> >>> input_df = spark.createDataFrame(d)
> >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> >>> df1 = df1.filter(F.col("key") != F.lit("c"))
> >>> df2 = df2.filter(F.col("key") != F.lit("d"))
> >>> ret = df1.join(df2, df1.key == df2.key, "full").select(
> ... df1["key"].alias("df1_key"),
> ... df2["key"].alias("df2_key"),
> ... df1["sales"],
> ... df2["units"],
> ... F.coalesce(df1["key"], df2["key"]).alias("key"))
> 20/10/05 15:46:14 WARN Column: Constructing trivially true equals predicate, 
> 'key#213 = key#213'. Perhaps you need to use aliases.
> >>> ret.show()
> +---+---+-+-++
> |df1_key|df2_key|sales|units| key|
> +---+---+-+-++
> |  d|  d|3| null|   d|
> |   null|   null| null|2|null|
> |  b|  b|5|   10|   b|
> |  a|  a|3|6|   a|
> +---+---+-+-++>>> ret.explain()
> == Physical Plan ==
> *(5) Project [key#213 AS df1_key#258, key#213 AS df2_key#259, sales#223L, 
> units#230L, coalesce(key#213, key#213) AS key#260]
> +- SortMergeJoin [key#213], [key#237], FullOuter
>:- *(2) Sort [key#213 ASC NULLS FIRST], false, 0
>:  +- *(2) HashAggregate(keys=[key#213], functions=[sum(sales#214L)])
>: +- Exchange hashpartitioning(key#213, 200)
>:+- *(1) HashAggregate(keys=[key#213], 
> functions=[partial_sum(sales#214L)])
>:   +- *(1) Project [key#213, sales#214L]
>:  +- *(1) Filter (isnotnull(key#213) && NOT (key#213 = c))
>: +- Scan ExistingRDD[key#213,sales#214L,units#215L]
>+- *(4) Sort [key#237 ASC NULLS FIRST], false, 0
>   +- *(4) HashAggregate(keys=[key#237], functions=[sum(units#239L)])
>  +- Exchange hashpartitioning(key#237, 200)
> +- *(3) HashAggregate(keys=[key#237], 
> functions=[partial_sum(units#239L)])
>+- *(3) Project [key#237, units#239L]
>   +- *(3) Filter (isnotnull(key#237) && NOT (key#237 = d))
>  +- Scan ExistingRDD[key#237,sales#238L,units#239L]
> {code}
> output for 3.0.1:
> {code:java}
> // code placeholder
> >>> sc.version
> u'3.0.1'
> >>> import pyspark.sql.functions as F
> >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
> >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 
> >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> >>> input_df = spark.createDataFrame(d)
> /usr/local/lib/python2.7/site-packages/pyspark/sql/session.py:381: 
> UserWarning: inferring schema from dict is deprecated,please use 
> pyspark.sql.Row instead
>   warnings.warn("inferring schema from dict is deprecated,"
> >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> >>> df2 = inp

[jira] [Assigned] (SPARK-33135) Use listLocatedStatus from FileSystem implementations

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33135:


Assignee: Apache Spark

> Use listLocatedStatus from FileSystem implementations
> -
>
> Key: SPARK-33135
> URL: https://issues.apache.org/jira/browse/SPARK-33135
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> {{HadoopFsUtils.parallelListLeafFiles}} currently only calls 
> {{listLocatedStatus}} when the {{FileSystem}} impl is 
> {{DistributedFileSystem}} or {{ViewFileSystem}}. For other types of 
> {{FileSystem}}, it calls {{listStatus}} and then subsequently calls 
> {{getFileBlockLocations}} on all the result {{FileStatus}}es.
> In Hadoop client, {{listLocatedStatus}} is a well-defined API and in fact it 
> is often overridden by specific file system implementations, such as S3A. The 
> default {{listLocatedStatus}} also has similar behavior as it's done in Spark.
> Therefore, instead of re-implement the logic in Spark itself, it's better to 
> rely on the {{FileSystem}}-specific implementation for {{listLocatedStatus}}, 
> which could include its own optimizations in the code path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`

2020-10-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33129.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30028
[https://github.com/apache/spark/pull/30028]

> Since the sbt version is now upgraded, old `test-only` needs to be replaced 
> with `testOnly`
> ---
>
> Key: SPARK-33129
> URL: https://issues.apache.org/jira/browse/SPARK-33129
> Project: Spark
>  Issue Type: Bug
>  Components: Build, docs
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> Follow up to SPARK-21708, updating the references to test-only with testOnly. 
> As the older syntax no longer works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`

2020-10-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33129:
-

Assignee: Prashant Sharma

> Since the sbt version is now upgraded, old `test-only` needs to be replaced 
> with `testOnly`
> ---
>
> Key: SPARK-33129
> URL: https://issues.apache.org/jira/browse/SPARK-33129
> Project: Spark
>  Issue Type: Bug
>  Components: Build, docs
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
>
> Follow up to SPARK-21708, updating the references to test-only with testOnly. 
> As the older syntax no longer works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33135) Use listLocatedStatus from FileSystem implementations

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33135:


Assignee: (was: Apache Spark)

> Use listLocatedStatus from FileSystem implementations
> -
>
> Key: SPARK-33135
> URL: https://issues.apache.org/jira/browse/SPARK-33135
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Major
>
> {{HadoopFsUtils.parallelListLeafFiles}} currently only calls 
> {{listLocatedStatus}} when the {{FileSystem}} impl is 
> {{DistributedFileSystem}} or {{ViewFileSystem}}. For other types of 
> {{FileSystem}}, it calls {{listStatus}} and then subsequently calls 
> {{getFileBlockLocations}} on all the result {{FileStatus}}es.
> In Hadoop client, {{listLocatedStatus}} is a well-defined API and in fact it 
> is often overridden by specific file system implementations, such as S3A. The 
> default {{listLocatedStatus}} also has similar behavior as it's done in Spark.
> Therefore, instead of re-implement the logic in Spark itself, it's better to 
> rely on the {{FileSystem}}-specific implementation for {{listLocatedStatus}}, 
> which could include its own optimizations in the code path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33135) Use listLocatedStatus from FileSystem implementations

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213203#comment-17213203
 ] 

Apache Spark commented on SPARK-33135:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30019

> Use listLocatedStatus from FileSystem implementations
> -
>
> Key: SPARK-33135
> URL: https://issues.apache.org/jira/browse/SPARK-33135
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Major
>
> {{HadoopFsUtils.parallelListLeafFiles}} currently only calls 
> {{listLocatedStatus}} when the {{FileSystem}} impl is 
> {{DistributedFileSystem}} or {{ViewFileSystem}}. For other types of 
> {{FileSystem}}, it calls {{listStatus}} and then subsequently calls 
> {{getFileBlockLocations}} on all the result {{FileStatus}}es.
> In Hadoop client, {{listLocatedStatus}} is a well-defined API and in fact it 
> is often overridden by specific file system implementations, such as S3A. The 
> default {{listLocatedStatus}} also has similar behavior as it's done in Spark.
> Therefore, instead of re-implement the logic in Spark itself, it's better to 
> rely on the {{FileSystem}}-specific implementation for {{listLocatedStatus}}, 
> which could include its own optimizations in the code path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33135) Use listLocatedStatus from FileSystem implementations

2020-10-13 Thread Chao Sun (Jira)
Chao Sun created SPARK-33135:


 Summary: Use listLocatedStatus from FileSystem implementations
 Key: SPARK-33135
 URL: https://issues.apache.org/jira/browse/SPARK-33135
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Chao Sun


{{HadoopFsUtils.parallelListLeafFiles}} currently only calls 
{{listLocatedStatus}} when the {{FileSystem}} impl is {{DistributedFileSystem}} 
or {{ViewFileSystem}}. For other types of {{FileSystem}}, it calls 
{{listStatus}} and then subsequently calls {{getFileBlockLocations}} on all the 
result {{FileStatus}}es.

In Hadoop client, {{listLocatedStatus}} is a well-defined API and in fact it is 
often overridden by specific file system implementations, such as S3A. The 
default {{listLocatedStatus}} also has similar behavior as it's done in Spark.

Therefore, instead of re-implement the logic in Spark itself, it's better to 
rely on the {{FileSystem}}-specific implementation for {{listLocatedStatus}}, 
which could include its own optimizations in the code path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-13 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33134:
---
Description: 
The code below:
{code:scala}
val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
583651}]""").toDF("events")
val event = new StructType()
  .add("playerId", LongType)
  .add("cards", ArrayType(
new StructType()
  .add("id", LongType)
  .add("rank", StringType)))
val pokerhand_events = pokerhand_raw
  .select(explode(from_json($"events", ArrayType(event))).as("event"))
pokerhand_events.show
{code}
throw the exception in the PERMISSIVE mode (default):
{code:java}
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
org.apache.spark.sql.catalyst.util.ArrayData
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
  at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
  at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
{code}
The same works in Spark 2.4:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6
  /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
...
scala> pokerhand_events.show()
+-+
|event|
+-+
+-+
{code}

  was:
The code below:
{code:scala}
val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
583651}]""").toDF("events")
val event = new StructType()
  .add("playerId", LongType)
  .add("cards", ArrayType(
new StructType()
  .add("id", LongType)
  .add("rank", StringType)))
val pokerhand_events = pokerhand_raw
  .select(explode(from_json($"events", ArrayType(event))).as("event"))
pokerhand_events.show
{code}
throw the exception in the PERMISSIVE mode (default):
{code:java}
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
org.apache.spark.sql.catalyst.util.ArrayData
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
  at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
  at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
{code}
The same works in Spark 2.4:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6
  /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
...
scala> pokerhand_events.show()
+-+
|event|
+-+
+-+
{code}


> Incorrect nested complex JSON fields raise an exception
> ---
>
> Key: SPARK-33134
> URL: https://issues.apache.org/jira/browse/SPARK-33134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The code below:
> {code:scala}
> val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
> 583651}]""").toDF("events")
> val event = new StructType()
>   .add("playerId", LongType)
>   .add("cards", ArrayType(
> new StructType()
>   .add("id", LongType)
>   .add("rank", StringType)))
> val pokerhand_events = pokerhand_raw
>   .select(explode(from_json($"events", ArrayType(event))).as("event"))
> pokerhand_events.show
> {code}
> throw the exception in the PERMISSIVE mode (default):
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> org.apache.spark.sql.catalyst.util.ArrayData
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRo

[jira] [Commented] (SPARK-25547) Pluggable jdbc connection factory

2020-10-13 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213152#comment-17213152
 ] 

Gabor Somogyi commented on SPARK-25547:
---

[~fsauer65] JDBC connection provider API is added here: 
https://github.com/apache/spark/blob/dc697a8b598aea922ee6620d87f3ace2f7947231/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcConnectionProvider.scala#L36
Do you think we can close this jira?


> Pluggable jdbc connection factory
> -
>
> Key: SPARK-25547
> URL: https://issues.apache.org/jira/browse/SPARK-25547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Frank Sauer
>Priority: Major
>
> The ability to provide a custom connectionFactoryProvider via JDBCOptions so 
> that JdbcUtils.createConnectionFactory can produce a custom connection 
> factory would be very useful. In our case we needed to have the ability to 
> load balance connections to an AWS Aurora Postgres cluster by round-robining 
> through the endpoints of the read replicas since their own loan balancing was 
> insufficient. We got away with it by copying most of the spark jdbc package 
> and provide this feature there and changing the format from jdbc to our new 
> package. However it would be nice  if this were supported out of the box via 
> a new option in JDBCOptions providing the classname for a 
> ConnectionFactoryProvider. I'm creating this Jira in order to submit a PR 
> which I have ready to go.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33128) mismatched input since Spark 3.0

2020-10-13 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213137#comment-17213137
 ] 

Yang Jie commented on SPARK-33128:
--

set `spark.sql.ansi.enabled` to true can work with Spark 3.0,I'm not sure if 
this is a bug. cc [~maropu]

> mismatched input since Spark 3.0
> 
>
> Key: SPARK-33128
> URL: https://issues.apache.org/jira/browse/SPARK-33128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Spark 2.4:
> {noformat}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.4
>   /_/
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_221)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.sql("SELECT 1 UNION SELECT 1 UNION ALL SELECT 1").show
> +---+
> |  1|
> +---+
> |  1|
> |  1|
> +---+
> {noformat}
> Spark 3.x:
> {noformat}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 14.0.1)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.sql("SELECT 1 UNION SELECT 1 UNION ALL SELECT 1").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'SELECT' expecting {, ';'}(line 1, pos 15)
> == SQL ==
> SELECT 1 UNION SELECT 1 UNION ALL SELECT 1
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607)
>   ... 47 elided
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13860) TPCDS query 39 returns wrong results compared to TPC official result set

2020-10-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-13860:
---

Assignee: Leanken.Lin

> TPCDS query 39 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13860
> URL: https://issues.apache.org/jira/browse/SPARK-13860
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.1, 2.2.0
>Reporter: JESSE CHEN
>Assignee: Leanken.Lin
>Priority: Major
>  Labels: bulk-closed, tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 39 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> q39a - 3 extra rows in SparkSQL output (eg. 
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733])   ;  q39b 
> - 3 extra rows in SparkSQL output (eg. 
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733])
> Actual results 39a:
> {noformat}
> [1,265,1,324.75,1.2438391781531353,1,265,2,329.0,1.0151581328149208]
> [1,363,1,499.5,1.031941572270649,1,363,2,321.0,1.1411766752007977]
> [1,679,1,373.75,1.0955498064867504,1,679,2,417.5,1.042970994259454]
> [1,695,1,450.75,1.0835888283564505,1,695,2,368.75,1.1356494125569416]
> [1,789,1,357.25,1.03450938027956,1,789,2,410.0,1.0284221852702604]
> [1,815,1,216.5,1.1702270938111008,1,815,2,150.5,1.3057281471249382]
> [1,827,1,271.75,1.1046890134130438,1,827,2,424.75,1.1653198631238286]
> [1,1041,1,382.5,1.284808399803008,1,1041,2,424.75,1.000577271456812]
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]
> [1,1569,1,212.0,1.630213519639535,1,1569,2,239.25,1.2641513267800557]
> [1,1623,1,338.25,1.1285483279713715,1,1623,2,261.3,1.2717809002195564]
> [1,2581,1,448.5,1.060429041250449,1,2581,2,476.25,1.0362984739390064]
> [1,2705,1,246.25,1.0120308357959693,1,2705,2,294.7,1.0742134101583702]
> [1,3131,1,393.75,1.0037613982687346,1,3131,2,480.5,1.0669144981482768]
> [1,3291,1,374.5,1.195189833087008,1,3291,2,265.25,1.572972106948466]
> [1,3687,1,279.75,1.4260909081999698,1,3687,2,157.25,1.4534340882531784]
> [1,4955,1,495.25,1.0318296151625301,1,4955,2,322.5,1.1693842343776149]
> [1,5627,1,282.75,1.5657032366359889,1,5627,2,297.5,1.2084286841430678]
> [1,7017,1,175.5,1.0427454215644427,1,7017,2,321.3,1.0183356932936254]
> [1,7317,1,366.3,1.025466403613547,1,7317,2,378.0,1.2172513189920555]
> [1,7569,1,430.5,1.0874396852180854,1,7569,2,360.25,1.047005559314515]
> [1,7999,1,166.25,1.7924231710846223,1,7999,2,375.3,1.008092263550718]
> [1,8319,1,306.75,1.1615378040478215,1,8319,2,276.0,1.1420996385609428]
> [1,8443,1,327.75,1.256718374192724,1,8443,2,332.5,1.0044167259988928]
> [1,8583,1,319.5,1.024108893111539,1,8583,2,310.25,1.2358813775861328]
> [1,8591,1,398.0,1.1478168692042447,1,8591,2,355.75,1.0024472149348966]
> [1,8611,1,300.5,1.5191545184147954,1,8611,2,243.75,1.2342122780960432]
> [1,9081,1,367.0,1.0878932141280895,1,9081,2,435.0,1.0330530776324107]
> [1,9357,1,351.7,1.1902922622025887,1,9357,2,427.0,1.0438583026358363]
> [1,9449,1,406.25,1.0183183104803557,1,9449,2,175.0,1.0544779796296408]
> [1,9713,1,242.5,1.1035044355064203,1,9713,2,393.0,1.208474608738988]
> [1,9809,1,479.0,1.0189602512117633,1,9809,2,317.5,1.0614142074924882]
> [1,9993,1,417.75,1.0099832672435247,1,9993,2,204.5,1.552870745350107]
> [1,10127,1,239.75,1.0561770587198123,1,10127,2,359.25,1.1857980403742183]
> [1,11159,1,407.25,1.0785507154337637,1,11159,2,250.0,1.334757905639321]
> [1,11277,1,211.25,1.2615858275316627,1,11277,2,330.75,1.0808767951625093]
> [1,11937,1,344.5,1.085804026843784,1,11937,2,200.34,1.0638527063883725]
> [1,12373,1,387.75,1.1014904822941258,1,12373,2,306.0,1.0761744390394028]
> [1,12471,1,365.25,1.0607570183728479,1,12471,2,327.25,1.0547560580567852]
> [1,12625,1,279.0,1.3016560542373208,1,12625,2,443.25,1.0604958838068959]
> [1,12751,1,280.75,1.10833057888089,1,12751,2,369.3,1.3416504398884601]
> [1,12779,1,331.0,1.041690207320035,1,12779,2,359.0,1.028978056175258]
> [1,13077,1,367.7,1.345523904195734,1,13077,2,358.7,1.5132429058096555]
> [1,13191,1,260.25,1.063569632291568,1,13191,2,405.0,1.0197999172180061]
> [1,13561,1,335.25,1.2609616961776389,1,13561,2,240.0,1.0513604502245155]
> [1,13935,1,311.75,1.0399289695412326,1,13935,2,275.0,1.0367527180321774]
> [1,14687,1,358.0,1.4369356919381713,1,14687,2,187.0,1.5493631531474956]
> [1,14719,1,209.0,1.0411509639707628,1,14719,2,489.0,1.376616882800804]
> [1,15345,1,148.5,1.5295784035794024,1,15345,2,246.5,1.5087987747231526]
> [1,15427,1,482.75,1.0124238928335043,1,15427,2,333.25,1.2724770126308678]
> [1,15647,1,201.66,1.2857931876095743,1

[jira] [Reopened] (SPARK-13860) TPCDS query 39 returns wrong results compared to TPC official result set

2020-10-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-13860:
-

> TPCDS query 39 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13860
> URL: https://issues.apache.org/jira/browse/SPARK-13860
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.1, 2.2.0
>Reporter: JESSE CHEN
>Assignee: Leanken.Lin
>Priority: Major
>  Labels: bulk-closed, tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 39 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> q39a - 3 extra rows in SparkSQL output (eg. 
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733])   ;  q39b 
> - 3 extra rows in SparkSQL output (eg. 
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733])
> Actual results 39a:
> {noformat}
> [1,265,1,324.75,1.2438391781531353,1,265,2,329.0,1.0151581328149208]
> [1,363,1,499.5,1.031941572270649,1,363,2,321.0,1.1411766752007977]
> [1,679,1,373.75,1.0955498064867504,1,679,2,417.5,1.042970994259454]
> [1,695,1,450.75,1.0835888283564505,1,695,2,368.75,1.1356494125569416]
> [1,789,1,357.25,1.03450938027956,1,789,2,410.0,1.0284221852702604]
> [1,815,1,216.5,1.1702270938111008,1,815,2,150.5,1.3057281471249382]
> [1,827,1,271.75,1.1046890134130438,1,827,2,424.75,1.1653198631238286]
> [1,1041,1,382.5,1.284808399803008,1,1041,2,424.75,1.000577271456812]
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]
> [1,1569,1,212.0,1.630213519639535,1,1569,2,239.25,1.2641513267800557]
> [1,1623,1,338.25,1.1285483279713715,1,1623,2,261.3,1.2717809002195564]
> [1,2581,1,448.5,1.060429041250449,1,2581,2,476.25,1.0362984739390064]
> [1,2705,1,246.25,1.0120308357959693,1,2705,2,294.7,1.0742134101583702]
> [1,3131,1,393.75,1.0037613982687346,1,3131,2,480.5,1.0669144981482768]
> [1,3291,1,374.5,1.195189833087008,1,3291,2,265.25,1.572972106948466]
> [1,3687,1,279.75,1.4260909081999698,1,3687,2,157.25,1.4534340882531784]
> [1,4955,1,495.25,1.0318296151625301,1,4955,2,322.5,1.1693842343776149]
> [1,5627,1,282.75,1.5657032366359889,1,5627,2,297.5,1.2084286841430678]
> [1,7017,1,175.5,1.0427454215644427,1,7017,2,321.3,1.0183356932936254]
> [1,7317,1,366.3,1.025466403613547,1,7317,2,378.0,1.2172513189920555]
> [1,7569,1,430.5,1.0874396852180854,1,7569,2,360.25,1.047005559314515]
> [1,7999,1,166.25,1.7924231710846223,1,7999,2,375.3,1.008092263550718]
> [1,8319,1,306.75,1.1615378040478215,1,8319,2,276.0,1.1420996385609428]
> [1,8443,1,327.75,1.256718374192724,1,8443,2,332.5,1.0044167259988928]
> [1,8583,1,319.5,1.024108893111539,1,8583,2,310.25,1.2358813775861328]
> [1,8591,1,398.0,1.1478168692042447,1,8591,2,355.75,1.0024472149348966]
> [1,8611,1,300.5,1.5191545184147954,1,8611,2,243.75,1.2342122780960432]
> [1,9081,1,367.0,1.0878932141280895,1,9081,2,435.0,1.0330530776324107]
> [1,9357,1,351.7,1.1902922622025887,1,9357,2,427.0,1.0438583026358363]
> [1,9449,1,406.25,1.0183183104803557,1,9449,2,175.0,1.0544779796296408]
> [1,9713,1,242.5,1.1035044355064203,1,9713,2,393.0,1.208474608738988]
> [1,9809,1,479.0,1.0189602512117633,1,9809,2,317.5,1.0614142074924882]
> [1,9993,1,417.75,1.0099832672435247,1,9993,2,204.5,1.552870745350107]
> [1,10127,1,239.75,1.0561770587198123,1,10127,2,359.25,1.1857980403742183]
> [1,11159,1,407.25,1.0785507154337637,1,11159,2,250.0,1.334757905639321]
> [1,11277,1,211.25,1.2615858275316627,1,11277,2,330.75,1.0808767951625093]
> [1,11937,1,344.5,1.085804026843784,1,11937,2,200.34,1.0638527063883725]
> [1,12373,1,387.75,1.1014904822941258,1,12373,2,306.0,1.0761744390394028]
> [1,12471,1,365.25,1.0607570183728479,1,12471,2,327.25,1.0547560580567852]
> [1,12625,1,279.0,1.3016560542373208,1,12625,2,443.25,1.0604958838068959]
> [1,12751,1,280.75,1.10833057888089,1,12751,2,369.3,1.3416504398884601]
> [1,12779,1,331.0,1.041690207320035,1,12779,2,359.0,1.028978056175258]
> [1,13077,1,367.7,1.345523904195734,1,13077,2,358.7,1.5132429058096555]
> [1,13191,1,260.25,1.063569632291568,1,13191,2,405.0,1.0197999172180061]
> [1,13561,1,335.25,1.2609616961776389,1,13561,2,240.0,1.0513604502245155]
> [1,13935,1,311.75,1.0399289695412326,1,13935,2,275.0,1.0367527180321774]
> [1,14687,1,358.0,1.4369356919381713,1,14687,2,187.0,1.5493631531474956]
> [1,14719,1,209.0,1.0411509639707628,1,14719,2,489.0,1.376616882800804]
> [1,15345,1,148.5,1.5295784035794024,1,15345,2,246.5,1.5087987747231526]
> [1,15427,1,482.75,1.0124238928335043,1,15427,2,333.25,1.2724770126308678]
> [1,15647,1,201.66,1.2857931876095743,1,15647,2,249.25,1.3648172990142

[jira] [Resolved] (SPARK-13860) TPCDS query 39 returns wrong results compared to TPC official result set

2020-10-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-13860.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

> TPCDS query 39 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13860
> URL: https://issues.apache.org/jira/browse/SPARK-13860
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.1, 2.2.0
>Reporter: JESSE CHEN
>Assignee: Leanken.Lin
>Priority: Major
>  Labels: bulk-closed, tpcds-result-mismatch
> Fix For: 3.1.0
>
>
> Testing Spark SQL using TPC queries. Query 39 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> q39a - 3 extra rows in SparkSQL output (eg. 
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733])   ;  q39b 
> - 3 extra rows in SparkSQL output (eg. 
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733])
> Actual results 39a:
> {noformat}
> [1,265,1,324.75,1.2438391781531353,1,265,2,329.0,1.0151581328149208]
> [1,363,1,499.5,1.031941572270649,1,363,2,321.0,1.1411766752007977]
> [1,679,1,373.75,1.0955498064867504,1,679,2,417.5,1.042970994259454]
> [1,695,1,450.75,1.0835888283564505,1,695,2,368.75,1.1356494125569416]
> [1,789,1,357.25,1.03450938027956,1,789,2,410.0,1.0284221852702604]
> [1,815,1,216.5,1.1702270938111008,1,815,2,150.5,1.3057281471249382]
> [1,827,1,271.75,1.1046890134130438,1,827,2,424.75,1.1653198631238286]
> [1,1041,1,382.5,1.284808399803008,1,1041,2,424.75,1.000577271456812]
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]
> [1,1569,1,212.0,1.630213519639535,1,1569,2,239.25,1.2641513267800557]
> [1,1623,1,338.25,1.1285483279713715,1,1623,2,261.3,1.2717809002195564]
> [1,2581,1,448.5,1.060429041250449,1,2581,2,476.25,1.0362984739390064]
> [1,2705,1,246.25,1.0120308357959693,1,2705,2,294.7,1.0742134101583702]
> [1,3131,1,393.75,1.0037613982687346,1,3131,2,480.5,1.0669144981482768]
> [1,3291,1,374.5,1.195189833087008,1,3291,2,265.25,1.572972106948466]
> [1,3687,1,279.75,1.4260909081999698,1,3687,2,157.25,1.4534340882531784]
> [1,4955,1,495.25,1.0318296151625301,1,4955,2,322.5,1.1693842343776149]
> [1,5627,1,282.75,1.5657032366359889,1,5627,2,297.5,1.2084286841430678]
> [1,7017,1,175.5,1.0427454215644427,1,7017,2,321.3,1.0183356932936254]
> [1,7317,1,366.3,1.025466403613547,1,7317,2,378.0,1.2172513189920555]
> [1,7569,1,430.5,1.0874396852180854,1,7569,2,360.25,1.047005559314515]
> [1,7999,1,166.25,1.7924231710846223,1,7999,2,375.3,1.008092263550718]
> [1,8319,1,306.75,1.1615378040478215,1,8319,2,276.0,1.1420996385609428]
> [1,8443,1,327.75,1.256718374192724,1,8443,2,332.5,1.0044167259988928]
> [1,8583,1,319.5,1.024108893111539,1,8583,2,310.25,1.2358813775861328]
> [1,8591,1,398.0,1.1478168692042447,1,8591,2,355.75,1.0024472149348966]
> [1,8611,1,300.5,1.5191545184147954,1,8611,2,243.75,1.2342122780960432]
> [1,9081,1,367.0,1.0878932141280895,1,9081,2,435.0,1.0330530776324107]
> [1,9357,1,351.7,1.1902922622025887,1,9357,2,427.0,1.0438583026358363]
> [1,9449,1,406.25,1.0183183104803557,1,9449,2,175.0,1.0544779796296408]
> [1,9713,1,242.5,1.1035044355064203,1,9713,2,393.0,1.208474608738988]
> [1,9809,1,479.0,1.0189602512117633,1,9809,2,317.5,1.0614142074924882]
> [1,9993,1,417.75,1.0099832672435247,1,9993,2,204.5,1.552870745350107]
> [1,10127,1,239.75,1.0561770587198123,1,10127,2,359.25,1.1857980403742183]
> [1,11159,1,407.25,1.0785507154337637,1,11159,2,250.0,1.334757905639321]
> [1,11277,1,211.25,1.2615858275316627,1,11277,2,330.75,1.0808767951625093]
> [1,11937,1,344.5,1.085804026843784,1,11937,2,200.34,1.0638527063883725]
> [1,12373,1,387.75,1.1014904822941258,1,12373,2,306.0,1.0761744390394028]
> [1,12471,1,365.25,1.0607570183728479,1,12471,2,327.25,1.0547560580567852]
> [1,12625,1,279.0,1.3016560542373208,1,12625,2,443.25,1.0604958838068959]
> [1,12751,1,280.75,1.10833057888089,1,12751,2,369.3,1.3416504398884601]
> [1,12779,1,331.0,1.041690207320035,1,12779,2,359.0,1.028978056175258]
> [1,13077,1,367.7,1.345523904195734,1,13077,2,358.7,1.5132429058096555]
> [1,13191,1,260.25,1.063569632291568,1,13191,2,405.0,1.0197999172180061]
> [1,13561,1,335.25,1.2609616961776389,1,13561,2,240.0,1.0513604502245155]
> [1,13935,1,311.75,1.0399289695412326,1,13935,2,275.0,1.0367527180321774]
> [1,14687,1,358.0,1.4369356919381713,1,14687,2,187.0,1.5493631531474956]
> [1,14719,1,209.0,1.0411509639707628,1,14719,2,489.0,1.376616882800804]
> [1,15345,1,148.5,1.5295784035794024,1,15345,2,246.5,1.5087987747231526]
> [1,15427,1,482.75,1.0124238928335043,1,15427,2,333.25,1.2724770126308678]
> 

[jira] [Assigned] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33134:


Assignee: (was: Apache Spark)

> Incorrect nested complex JSON fields raise an exception
> ---
>
> Key: SPARK-33134
> URL: https://issues.apache.org/jira/browse/SPARK-33134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The code below:
> {code:scala}
> val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
> 583651}]""").toDF("events")
> val event = new StructType()
>   .add("playerId", LongType)
>   .add("cards", ArrayType(
> new StructType()
>   .add("id", LongType)
>   .add("rank", StringType)))
> val pokerhand_events = pokerhand_raw
>   .select(explode(from_json($"events", ArrayType(event))).as("event"))
> pokerhand_events.show
> {code}
> throw the exception in the PERMISSIVE mode (default):
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> org.apache.spark.sql.catalyst.util.ArrayData
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
> {code}
> The same works in Spark 2.4:
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.6
>   /_/
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
> ...
> scala> pokerhand_events.show()
> +-+
> |event|
> +-+
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33134:


Assignee: Apache Spark

> Incorrect nested complex JSON fields raise an exception
> ---
>
> Key: SPARK-33134
> URL: https://issues.apache.org/jira/browse/SPARK-33134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The code below:
> {code:scala}
> val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
> 583651}]""").toDF("events")
> val event = new StructType()
>   .add("playerId", LongType)
>   .add("cards", ArrayType(
> new StructType()
>   .add("id", LongType)
>   .add("rank", StringType)))
> val pokerhand_events = pokerhand_raw
>   .select(explode(from_json($"events", ArrayType(event))).as("event"))
> pokerhand_events.show
> {code}
> throw the exception in the PERMISSIVE mode (default):
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> org.apache.spark.sql.catalyst.util.ArrayData
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
> {code}
> The same works in Spark 2.4:
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.6
>   /_/
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
> ...
> scala> pokerhand_events.show()
> +-+
> |event|
> +-+
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213109#comment-17213109
 ] 

Apache Spark commented on SPARK-33134:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30031

> Incorrect nested complex JSON fields raise an exception
> ---
>
> Key: SPARK-33134
> URL: https://issues.apache.org/jira/browse/SPARK-33134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The code below:
> {code:scala}
> val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
> 583651}]""").toDF("events")
> val event = new StructType()
>   .add("playerId", LongType)
>   .add("cards", ArrayType(
> new StructType()
>   .add("id", LongType)
>   .add("rank", StringType)))
> val pokerhand_events = pokerhand_raw
>   .select(explode(from_json($"events", ArrayType(event))).as("event"))
> pokerhand_events.show
> {code}
> throw the exception in the PERMISSIVE mode (default):
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> org.apache.spark.sql.catalyst.util.ArrayData
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
> {code}
> The same works in Spark 2.4:
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.6
>   /_/
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
> ...
> scala> pokerhand_events.show()
> +-+
> |event|
> +-+
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213108#comment-17213108
 ] 

Apache Spark commented on SPARK-33134:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30031

> Incorrect nested complex JSON fields raise an exception
> ---
>
> Key: SPARK-33134
> URL: https://issues.apache.org/jira/browse/SPARK-33134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The code below:
> {code:scala}
> val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
> 583651}]""").toDF("events")
> val event = new StructType()
>   .add("playerId", LongType)
>   .add("cards", ArrayType(
> new StructType()
>   .add("id", LongType)
>   .add("rank", StringType)))
> val pokerhand_events = pokerhand_raw
>   .select(explode(from_json($"events", ArrayType(event))).as("event"))
> pokerhand_events.show
> {code}
> throw the exception in the PERMISSIVE mode (default):
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> org.apache.spark.sql.catalyst.util.ArrayData
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
> {code}
> The same works in Spark 2.4:
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.6
>   /_/
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
> ...
> scala> pokerhand_events.show()
> +-+
> |event|
> +-+
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'

2020-10-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33132:

Fix Version/s: (was: 3.0.2)

> The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 
> 'NaN Undefined'
> -
>
> Key: SPARK-33132
> URL: https://issues.apache.org/jira/browse/SPARK-33132
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Priority: Minor
> Attachments: Stage Summary shows NaN undefined.png
>
>
> Spark Version: 3.0.1
> Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics 
> was shown as '*NaN Undefind*'  when the *readBytes* value is negative,  as 
> the attachment shows.
> *curl 
> http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary*
> {
>  "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
>  "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ],
>   ...
>  "shuffleReadMetrics" :
> { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, 
> 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, 
> 2002.0, 2002.0, 2002.0, 2002.0 ],  ... }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame

2020-10-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33125:
---

Assignee: jiaan.geng

> Improve the error when Lead and Lag are not allowed to specify window frame
> ---
>
> Key: SPARK-33125
> URL: https://issues.apache.org/jira/browse/SPARK-33125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Except for Postgresql, other data sources (for example: vertica, oracle, 
> redshift, mysql, presto) are not allowed to specify window frame for the Lead 
> and Lag functions.
> But the current error message is not clear enough.
> {code:java}
> Window Frame $f must match the required frame
> {code}
> The following error message is better.
> {code:java}
> Cannot specify window frame for lead function
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame

2020-10-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33125.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30021
[https://github.com/apache/spark/pull/30021]

> Improve the error when Lead and Lag are not allowed to specify window frame
> ---
>
> Key: SPARK-33125
> URL: https://issues.apache.org/jira/browse/SPARK-33125
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> Except for Postgresql, other data sources (for example: vertica, oracle, 
> redshift, mysql, presto) are not allowed to specify window frame for the Lead 
> and Lag functions.
> But the current error message is not clear enough.
> {code:java}
> Window Frame $f must match the required frame
> {code}
> The following error message is better.
> {code:java}
> Cannot specify window frame for lead function
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-13 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33134:
--

 Summary: Incorrect nested complex JSON fields raise an exception
 Key: SPARK-33134
 URL: https://issues.apache.org/jira/browse/SPARK-33134
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.2, 3.1.0
Reporter: Maxim Gekk


The code below:
{code:scala}
val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
583651}]""").toDF("events")
val event = new StructType()
  .add("playerId", LongType)
  .add("cards", ArrayType(
new StructType()
  .add("id", LongType)
  .add("rank", StringType)))
val pokerhand_events = pokerhand_raw
  .select(explode(from_json($"events", ArrayType(event))).as("event"))
pokerhand_events.show
{code}
throw the exception in the PERMISSIVE mode (default):
{code:java}
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
org.apache.spark.sql.catalyst.util.ArrayData
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
  at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
  at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
{code}
The same works in Spark 2.4:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6
  /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
...
scala> pokerhand_events.show()
+-+
|event|
+-+
+-+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33081.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29972
[https://github.com/apache/spark/pull/29972]

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (DB2 dialect)
> --
>
> Key: SPARK-33081
> URL: https://issues.apache.org/jira/browse/SPARK-33081
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.1.0
>
>
> Override the default SQL strings for:
> * ALTER TABLE UPDATE COLUMN TYPE
> * ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following DB2 JDBC dialect according to official documentation.
> Write DB2 integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33081:
---

Assignee: Huaxin Gao

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (DB2 dialect)
> --
>
> Key: SPARK-33081
> URL: https://issues.apache.org/jira/browse/SPARK-33081
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Override the default SQL strings for:
> * ALTER TABLE UPDATE COLUMN TYPE
> * ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following DB2 JDBC dialect according to official documentation.
> Write DB2 integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33133) History server fails when loading invalid rolling event logs

2020-10-13 Thread Adam Binford (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Binford updated SPARK-33133:
-
Description: 
We have run into an issue where our history server fails to load new 
applications, and when restarted, fails to load any applications at all. This 
happens when it encounters invalid rolling event log files. We encounter this 
with long running streaming applications. There seems to be two issues here 
that lead to problems:
 * It looks like our long running streaming applications event log directory is 
being cleaned up. The next time the application logs event data, it recreates 
the event log directory but without recreating the "appstatus" file. I don't 
know the full extent of this behavior or if something "wrong" is happening here.
 * The history server then reads this new folder, and throws an exception 
because the "appstatus" file doesn't exist in the rolling event log folder. 
This exception breaks the entire listing process, so no new applications will 
be read, and if restarted no applications at all will be read.

There seems like a couple ways to go about fixing this, and I'm curious 
anyone's thoughts who knows more about how the history server works, 
specifically with rolling event logs:
 * Don't completely fail checking for new applications if one bad rolling event 
log folder is encountered. This seems like the simplest fix and makes sense to 
me, it already checks for a few other errors and ignores them. It doesn't 
necessarily fix the underlying issue that leads to this happening though.
 * Figure out why the in progress event log folder is being deleted and make 
sure that doesn't happen. Maybe this is supposed to happen? Or maybe we don't 
want to delete the top level folder and only delete event log files within the 
folders? Again I don't know the exact current behavior here with this.
 * When writing new event log data, make sure the folder and appstatus file 
exist every time, creating them again if not.

Here's the stack trace we encounter when this happens, from 3.0.1 with a couple 
extra MRs backported that I hoped would fix the issue:

{{2020-10-13 12:10:31,751 ERROR history.FsHistoryProvider: Exception in 
checking for event log updates2020-10-13 12:10:31,751 ERROR 
history.FsHistoryProvider: Exception in checking for event log 
updatesjava.lang.IllegalArgumentException: requirement failed: Log directory 
must contain an appstatus file! at scala.Predef$.require(Predef.scala:281) at 
org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:214)
 at 
org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files(EventLogFileReaders.scala:211)
 at 
org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles$lzycompute(EventLogFileReaders.scala:221)
 at 
org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles(EventLogFileReaders.scala:220)
 at 
org.apache.spark.deploy.history.RollingEventLogFilesFileReader.lastEventLogFile(EventLogFileReaders.scala:272)
 at 
org.apache.spark.deploy.history.RollingEventLogFilesFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:240)
 at 
org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:524)
 at 
org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466)
 at 
scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
 at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) at 
scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) at 
scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at 
scala.collection.TraversableLike.filter(TraversableLike.scala:347) at 
scala.collection.TraversableLike.filter$(TraversableLike.scala:347) at 
scala.collection.AbstractTraversable.filter(Traversable.scala:108) at 
org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466)
 at 
org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287)
 at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) at 
org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
jav

[jira] [Resolved] (SPARK-32858) UnwrapCastInBinaryComparison: support other numeric types

2020-10-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32858.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29792
[https://github.com/apache/spark/pull/29792]

> UnwrapCastInBinaryComparison: support other numeric types
> -
>
> Key: SPARK-32858
> URL: https://issues.apache.org/jira/browse/SPARK-32858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> After SPARK-24994 is done, we need to follow-up and support more types other 
> than integral, e.g., float, double, decimal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32858) UnwrapCastInBinaryComparison: support other numeric types

2020-10-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32858:
---

Assignee: Chao Sun

> UnwrapCastInBinaryComparison: support other numeric types
> -
>
> Key: SPARK-32858
> URL: https://issues.apache.org/jira/browse/SPARK-32858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> After SPARK-24994 is done, we need to follow-up and support more types other 
> than integral, e.g., float, double, decimal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33133) History server fails when loading invalid rolling event logs

2020-10-13 Thread Adam Binford (Jira)
Adam Binford created SPARK-33133:


 Summary: History server fails when loading invalid rolling event 
logs
 Key: SPARK-33133
 URL: https://issues.apache.org/jira/browse/SPARK-33133
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Adam Binford


We have run into an issue where our history server fails to load new 
applications, and when restarted, fails to load any applications at all. This 
happens when it encounters invalid rolling event log files. We encounter this 
with long running streaming applications. There seems to be two issues here 
that lead to problems:
 * It looks like our long running streaming applications event log directory is 
being cleaned up. The next time the application logs event data, it recreates 
the event log directory but without recreating the "appstatus" file. I don't 
know the full extent of this behavior or if something "wrong" is happening here.
 * The history server then reads this new folder, and throws an exception 
because the "appstatus" file doesn't exist in the rolling event log folder. 
This exception breaks the entire listing process, so no new applications will 
be read, and if restarted no applications at all will be read.

There seems like a couple ways to go about fixing this, and I'm curious 
anyone's thoughts who knows more about how the history server works, 
specifically with rolling event logs:
 * Don't completely fail checking for new applications if one bad rolling event 
log folder is encountered. This seems like the simplest fix and makes sense to 
me, it already checks for a few other errors and ignores them.
 * Figure out why the in progress event log folder is being deleted and make 
sure that doesn't happen. Maybe this is supposed to happen? Or maybe we don't 
want to delete the top level folder and only delete event log files within the 
folders? Again I don't know the exact current behavior here with this.
 * When writing new event log data, make sure the folder and appstatus file 
exist every time, creating them again if not.

Here's the stack trace we encounter when this happens, from 3.0.1 with a couple 
extra MRs backported that I hoped would fix the issue:

{{2020-10-13 12:10:31,751 ERROR history.FsHistoryProvider: Exception in 
checking for event log updates2020-10-13 12:10:31,751 ERROR 
history.FsHistoryProvider: Exception in checking for event log 
updatesjava.lang.IllegalArgumentException: requirement failed: Log directory 
must contain an appstatus file! at scala.Predef$.require(Predef.scala:281) at 
org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:214)
 at 
org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files(EventLogFileReaders.scala:211)
 at 
org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles$lzycompute(EventLogFileReaders.scala:221)
 at 
org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles(EventLogFileReaders.scala:220)
 at 
org.apache.spark.deploy.history.RollingEventLogFilesFileReader.lastEventLogFile(EventLogFileReaders.scala:272)
 at 
org.apache.spark.deploy.history.RollingEventLogFilesFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:240)
 at 
org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:524)
 at 
org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466)
 at 
scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
 at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) at 
scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) at 
scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at 
scala.collection.TraversableLike.filter(TraversableLike.scala:347) at 
scala.collection.TraversableLike.filter$(TraversableLike.scala:347) at 
scala.collection.AbstractTraversable.filter(Traversable.scala:108) at 
org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466)
 at 
org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287)
 at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) at 
org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.ut

[jira] [Resolved] (SPARK-33115) `kvstore` and `unsafe` doc tasks fail

2020-10-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33115.
--
Fix Version/s: 3.1.0
   3.0.2
 Assignee: Denis Pyshev
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/30007

> `kvstore` and `unsafe` doc tasks fail
> -
>
> Key: SPARK-33115
> URL: https://issues.apache.org/jira/browse/SPARK-33115
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Assignee: Denis Pyshev
>Priority: Minor
> Fix For: 3.0.2, 3.1.0
>
>
> `build/sbt publishLocal` task fails in two modules:
> {code:java}
> [error] stack trace is suppressed; run last kvstore / Compile / doc for the 
> full output
> [error] stack trace is suppressed; run last unsafe / Compile / doc for the 
> full output
> {code}
> {code:java}
>  sbt:spark-parent> kvstore/Compile/doc 
> [info] Main Java API documentation to 
> /home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: malformed HTML 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used 
> [error]    ^ 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: unknown tag: Object 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used 
> [error]   ^ 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: bad use of '>' 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used
> [error]   
>  ^
> {code}
> {code:java}
>  sbt:spark-parent> unsafe/Compile/doc 
> [info] Main Java API documentation to 
> /home/gemelen/work/src/spark/common/unsafe/target/scala-2.12/api... 
> [error] 
> /home/gemelen/work/src/spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:566:1:
>   error: malformed HTML 
> [error]    * Trims whitespaces (<= ASCII 32) from both ends of this string. 
> [error]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33132:


Assignee: Apache Spark

> The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 
> 'NaN Undefined'
> -
>
> Key: SPARK-33132
> URL: https://issues.apache.org/jira/browse/SPARK-33132
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: echohlne
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.0.2
>
> Attachments: Stage Summary shows NaN undefined.png
>
>
> Spark Version: 3.0.1
> Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics 
> was shown as '*NaN Undefind*'  when the *readBytes* value is negative,  as 
> the attachment shows.
> *curl 
> http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary*
> {
>  "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
>  "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ],
>   ...
>  "shuffleReadMetrics" :
> { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, 
> 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, 
> 2002.0, 2002.0, 2002.0, 2002.0 ],  ... }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213073#comment-17213073
 ] 

Apache Spark commented on SPARK-33132:
--

User 'akiyamaneko' has created a pull request for this issue:
https://github.com/apache/spark/pull/30030

> The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 
> 'NaN Undefined'
> -
>
> Key: SPARK-33132
> URL: https://issues.apache.org/jira/browse/SPARK-33132
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: echohlne
>Priority: Minor
> Fix For: 3.0.2
>
> Attachments: Stage Summary shows NaN undefined.png
>
>
> Spark Version: 3.0.1
> Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics 
> was shown as '*NaN Undefind*'  when the *readBytes* value is negative,  as 
> the attachment shows.
> *curl 
> http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary*
> {
>  "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
>  "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ],
>   ...
>  "shuffleReadMetrics" :
> { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, 
> 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, 
> 2002.0, 2002.0, 2002.0, 2002.0 ],  ... }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33132:


Assignee: (was: Apache Spark)

> The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 
> 'NaN Undefined'
> -
>
> Key: SPARK-33132
> URL: https://issues.apache.org/jira/browse/SPARK-33132
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: echohlne
>Priority: Minor
> Fix For: 3.0.2
>
> Attachments: Stage Summary shows NaN undefined.png
>
>
> Spark Version: 3.0.1
> Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics 
> was shown as '*NaN Undefind*'  when the *readBytes* value is negative,  as 
> the attachment shows.
> *curl 
> http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary*
> {
>  "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
>  "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ],
>   ...
>  "shuffleReadMetrics" :
> { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, 
> 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, 
> 2002.0, 2002.0, 2002.0, 2002.0 ],  ... }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'

2020-10-13 Thread echohlne (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

echohlne updated SPARK-33132:
-
Summary: The 'Shuffle Read Size / Records' field in Stage Summary metrics 
was shown as 'NaN Undefined'  (was: The 'Shuffle Read Size / Records' field in 
Stage Summary metrics was shown as 'NaN Undefind')

> The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 
> 'NaN Undefined'
> -
>
> Key: SPARK-33132
> URL: https://issues.apache.org/jira/browse/SPARK-33132
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: echohlne
>Priority: Minor
> Fix For: 3.0.2
>
> Attachments: Stage Summary shows NaN undefined.png
>
>
> Spark Version: 3.0.1
> Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics 
> was shown as '*NaN Undefind*'  when the *readBytes* value is negative,  as 
> the attachment shows.
> *curl 
> http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary*
> {
>  "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
>  "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ],
>   ...
>  "shuffleReadMetrics" :
> { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, 
> 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, 
> 2002.0, 2002.0, 2002.0, 2002.0 ],  ... }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefind'

2020-10-13 Thread echohlne (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

echohlne updated SPARK-33132:
-
Description: 
Spark Version: 3.0.1

Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics 
was shown as '*NaN Undefind*'  when the *readBytes* value is negative,  as the 
attachment shows.

*curl 
http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary*

{
 "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
 "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ],
  ...
 "shuffleReadMetrics" :

{ {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, 
1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, 2002.0, 
2002.0, 2002.0, 2002.0 ],  ... }

  was:
Spark Version: 3.0.1

Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics 
was shown as '*NaN Undefind*'  when the *readBytes* value is negative,  as the 
attachment shows.

*curl 
http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary*

{
 "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
 "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ],
  ...
 "shuffleReadMetrics" :

{ "*readBytes*" : [ *-2.0*, 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], 
"readRecords" : [ 2001.0, 2002.0, 2002.0, 2002.0, 2002.0 ],  ... }


> The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 
> 'NaN Undefind'
> 
>
> Key: SPARK-33132
> URL: https://issues.apache.org/jira/browse/SPARK-33132
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: echohlne
>Priority: Minor
> Fix For: 3.0.2
>
> Attachments: Stage Summary shows NaN undefined.png
>
>
> Spark Version: 3.0.1
> Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics 
> was shown as '*NaN Undefind*'  when the *readBytes* value is negative,  as 
> the attachment shows.
> *curl 
> http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary*
> {
>  "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
>  "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ],
>   ...
>  "shuffleReadMetrics" :
> { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, 
> 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, 
> 2002.0, 2002.0, 2002.0, 2002.0 ],  ... }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefind'

2020-10-13 Thread echohlne (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

echohlne updated SPARK-33132:
-
Attachment: Stage Summary shows NaN undefined.png

> The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 
> 'NaN Undefind'
> 
>
> Key: SPARK-33132
> URL: https://issues.apache.org/jira/browse/SPARK-33132
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: echohlne
>Priority: Minor
> Fix For: 3.0.2
>
> Attachments: Stage Summary shows NaN undefined.png
>
>
> Spark Version: 3.0.1
> Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics 
> was shown as '*NaN Undefind*'  when the *readBytes* value is negative,  as 
> the attachment shows.
> *curl 
> http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary*
> {
>  "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
>  "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ],
>   ...
>  "shuffleReadMetrics" :
> { "*readBytes*" : [ *-2.0*, 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], 
> "readRecords" : [ 2001.0, 2002.0, 2002.0, 2002.0, 2002.0 ],  ... }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefind'

2020-10-13 Thread echohlne (Jira)
echohlne created SPARK-33132:


 Summary: The 'Shuffle Read Size / Records' field in Stage Summary 
metrics was shown as 'NaN Undefind'
 Key: SPARK-33132
 URL: https://issues.apache.org/jira/browse/SPARK-33132
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: echohlne
 Fix For: 3.0.2


Spark Version: 3.0.1

Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics 
was shown as '*NaN Undefind*'  when the *readBytes* value is negative,  as the 
attachment shows.

*curl 
http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary*

{
 "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
 "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ],
  ...
 "shuffleReadMetrics" :

{ "*readBytes*" : [ *-2.0*, 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], 
"readRecords" : [ 2001.0, 2002.0, 2002.0, 2002.0, 2002.0 ],  ... }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33131) Fix grouping sets with having clause can not resolve qualified col name

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213068#comment-17213068
 ] 

Apache Spark commented on SPARK-33131:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/30029

> Fix grouping sets with having clause can not resolve qualified col name
> ---
>
> Key: SPARK-33131
> URL: https://issues.apache.org/jira/browse/SPARK-33131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> The method `ResolveAggregateFunctions.resolveFilterCondInAggregate` aims to 
> do the two things
> 1. resolve the expression in having.
> 2. push the having extra agg expression to `Aggregate`
> However we only care about 2 now. If having clause resolution is successful 
> but not exists extra agg expression, we will ignore the resolution. Here is a 
> example:
> {code:java}
> -- Works resolved by ResolveReferences
> select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 
> 1
> -- Works because of the extra expression c1
> select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) 
> having t1.c1 = 1
> -- Failed
> select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having 
> t1.c1 = 1{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33131) Fix grouping sets with having clause can not resolve qualified col name

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33131:


Assignee: Apache Spark

> Fix grouping sets with having clause can not resolve qualified col name
> ---
>
> Key: SPARK-33131
> URL: https://issues.apache.org/jira/browse/SPARK-33131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>
> The method `ResolveAggregateFunctions.resolveFilterCondInAggregate` aims to 
> do the two things
> 1. resolve the expression in having.
> 2. push the having extra agg expression to `Aggregate`
> However we only care about 2 now. If having clause resolution is successful 
> but not exists extra agg expression, we will ignore the resolution. Here is a 
> example:
> {code:java}
> -- Works resolved by ResolveReferences
> select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 
> 1
> -- Works because of the extra expression c1
> select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) 
> having t1.c1 = 1
> -- Failed
> select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having 
> t1.c1 = 1{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33131) Fix grouping sets with having clause can not resolve qualified col name

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33131:


Assignee: (was: Apache Spark)

> Fix grouping sets with having clause can not resolve qualified col name
> ---
>
> Key: SPARK-33131
> URL: https://issues.apache.org/jira/browse/SPARK-33131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> The method `ResolveAggregateFunctions.resolveFilterCondInAggregate` aims to 
> do the two things
> 1. resolve the expression in having.
> 2. push the having extra agg expression to `Aggregate`
> However we only care about 2 now. If having clause resolution is successful 
> but not exists extra agg expression, we will ignore the resolution. Here is a 
> example:
> {code:java}
> -- Works resolved by ResolveReferences
> select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 
> 1
> -- Works because of the extra expression c1
> select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) 
> having t1.c1 = 1
> -- Failed
> select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having 
> t1.c1 = 1{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33131) Fix grouping sets with having clause can not resolve qualified col name

2020-10-13 Thread ulysses you (Jira)
ulysses you created SPARK-33131:
---

 Summary: Fix grouping sets with having clause can not resolve 
qualified col name
 Key: SPARK-33131
 URL: https://issues.apache.org/jira/browse/SPARK-33131
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you


The method `ResolveAggregateFunctions.resolveFilterCondInAggregate` aims to do 
the two things
1. resolve the expression in having.
2. push the having extra agg expression to `Aggregate`

However we only care about 2 now. If having clause resolution is successful but 
not exists extra agg expression, we will ignore the resolution. Here is a 
example:


{code:java}
-- Works resolved by ResolveReferences
select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 1

-- Works because of the extra expression c1
select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) having 
t1.c1 = 1

-- Failed
select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 
= 1{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-32681) PySpark type hints support

2020-10-13 Thread echohlne (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

echohlne updated SPARK-32681:
-
Comment: was deleted

(was: (y))

> PySpark type hints support
> --
>
> Key: SPARK-32681
> URL: https://issues.apache.org/jira/browse/SPARK-32681
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Critical
>
>  https://github.com/zero323/pyspark-stubs demonstrates a lot of benefits to 
> improve usability in PySpark by leveraging Python type hints.
> By having the type hints in PySpark we can, for example:
> - automatically document the input and output types
> - leverage IDE for error detection and auto-completion
> - have a cleaner definition and easier to understand.
> This JIRA is an umbrella JIRA that targets to port 
> https://github.com/zero323/pyspark-stubs and related items to smoothly run 
> within PySpark.
> It was also discussed in the dev mailing list:  
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32681) PySpark type hints support

2020-10-13 Thread echohlne (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213065#comment-17213065
 ] 

echohlne commented on SPARK-32681:
--

(y)

> PySpark type hints support
> --
>
> Key: SPARK-32681
> URL: https://issues.apache.org/jira/browse/SPARK-32681
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Critical
>
>  https://github.com/zero323/pyspark-stubs demonstrates a lot of benefits to 
> improve usability in PySpark by leveraging Python type hints.
> By having the type hints in PySpark we can, for example:
> - automatically document the input and output types
> - leverage IDE for error detection and auto-completion
> - have a cleaner definition and easier to understand.
> This JIRA is an umbrella JIRA that targets to port 
> https://github.com/zero323/pyspark-stubs and related items to smoothly run 
> within PySpark.
> It was also discussed in the dev mailing list:  
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown

2020-10-13 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32295.
--
Fix Version/s: 3.1.0
 Assignee: Tanel Kiis
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/29092

> Add not null and size > 0 filters before inner explode to benefit from 
> predicate pushdown
> -
>
> Key: SPARK-32295
> URL: https://issues.apache.org/jira/browse/SPARK-32295
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
>  Labels: performance
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33130) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect)

2020-10-13 Thread Prashant Sharma (Jira)
Prashant Sharma created SPARK-33130:
---

 Summary: Support ALTER TABLE in JDBC v2 Table Catalog: add, update 
type and nullability of columns (MsSqlServer dialect)
 Key: SPARK-33130
 URL: https://issues.apache.org/jira/browse/SPARK-33130
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Prashant Sharma


Override the default SQL strings for:
ALTER TABLE RENAME COLUMN
ALTER TABLE UPDATE COLUMN NULLABILITY
in the following MsSQLServer JDBC dialect according to official documentation.
Write MsSqlServer integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21708) Migrate build to sbt 1.3.13

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213033#comment-17213033
 ] 

Apache Spark commented on SPARK-21708:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/30028

>  Migrate build to sbt 1.3.13
> 
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: PJ Fanning
>Assignee: Denis Pyshev
>Priority: Major
> Fix For: 3.1.0
>
>
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> According to https://github.com/sbt/sbt/issues/3424, we will need to change 
> the HTTP location where we get the sbt-launch jar.
> Other related issues:
> SPARK-14401
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21708) Migrate build to sbt 1.3.13

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213032#comment-17213032
 ] 

Apache Spark commented on SPARK-21708:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/30028

>  Migrate build to sbt 1.3.13
> 
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: PJ Fanning
>Assignee: Denis Pyshev
>Priority: Major
> Fix For: 3.1.0
>
>
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> According to https://github.com/sbt/sbt/issues/3424, we will need to change 
> the HTTP location where we get the sbt-launch jar.
> Other related issues:
> SPARK-14401
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213031#comment-17213031
 ] 

Apache Spark commented on SPARK-33129:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/30028

> Since the sbt version is now upgraded, old `test-only` needs to be replaced 
> with `testOnly`
> ---
>
> Key: SPARK-33129
> URL: https://issues.apache.org/jira/browse/SPARK-33129
> Project: Spark
>  Issue Type: Bug
>  Components: Build, docs
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Follow up to SPARK-21708, updating the references to test-only with testOnly. 
> As the older syntax no longer works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33129:


Assignee: (was: Apache Spark)

> Since the sbt version is now upgraded, old `test-only` needs to be replaced 
> with `testOnly`
> ---
>
> Key: SPARK-33129
> URL: https://issues.apache.org/jira/browse/SPARK-33129
> Project: Spark
>  Issue Type: Bug
>  Components: Build, docs
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Follow up to SPARK-21708, updating the references to test-only with testOnly. 
> As the older syntax no longer works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33129:


Assignee: Apache Spark

> Since the sbt version is now upgraded, old `test-only` needs to be replaced 
> with `testOnly`
> ---
>
> Key: SPARK-33129
> URL: https://issues.apache.org/jira/browse/SPARK-33129
> Project: Spark
>  Issue Type: Bug
>  Components: Build, docs
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Apache Spark
>Priority: Major
>
> Follow up to SPARK-21708, updating the references to test-only with testOnly. 
> As the older syntax no longer works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`

2020-10-13 Thread Prashant Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-33129:

Description: Follow up to SPARK-21708, updating the references to test-only 
with testOnly. As the older syntax no longer works.

> Since the sbt version is now upgraded, old `test-only` needs to be replaced 
> with `testOnly`
> ---
>
> Key: SPARK-33129
> URL: https://issues.apache.org/jira/browse/SPARK-33129
> Project: Spark
>  Issue Type: Bug
>  Components: Build, docs
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Follow up to SPARK-21708, updating the references to test-only with testOnly. 
> As the older syntax no longer works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`

2020-10-13 Thread Prashant Sharma (Jira)
Prashant Sharma created SPARK-33129:
---

 Summary: Since the sbt version is now upgraded, old `test-only` 
needs to be replaced with `testOnly`
 Key: SPARK-33129
 URL: https://issues.apache.org/jira/browse/SPARK-33129
 Project: Spark
  Issue Type: Bug
  Components: Build, docs
Affects Versions: 3.1.0
Reporter: Prashant Sharma






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32069) Improve error message on reading unexpected directory which is not a table partition

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32069:


Assignee: (was: Apache Spark)

> Improve error message on reading unexpected directory which is not a table 
> partition
> 
>
> Key: SPARK-32069
> URL: https://issues.apache.org/jira/browse/SPARK-32069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>  Labels: starter
>
> To reproduce:
> {code:java}
> spark-sql> create table test(i long);
> spark-sql> insert into test values(1);
> {code}
> {code:java}
> bash $ mkdir ./spark-warehouse/test/data
> {code}
> There will be such error messge
> {code:java}
> java.io.IOException: Not a file: 
> file:/Users/gengliang.wang/projects/spark/spark-warehouse/test/data
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2173)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>   at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:282)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   

[jira] [Assigned] (SPARK-32069) Improve error message on reading unexpected directory which is not a table partition

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32069:


Assignee: Apache Spark

> Improve error message on reading unexpected directory which is not a table 
> partition
> 
>
> Key: SPARK-32069
> URL: https://issues.apache.org/jira/browse/SPARK-32069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> To reproduce:
> {code:java}
> spark-sql> create table test(i long);
> spark-sql> insert into test values(1);
> {code}
> {code:java}
> bash $ mkdir ./spark-warehouse/test/data
> {code}
> There will be such error messge
> {code:java}
> java.io.IOException: Not a file: 
> file:/Users/gengliang.wang/projects/spark/spark-warehouse/test/data
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2173)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>   at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:282)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQL

[jira] [Commented] (SPARK-32069) Improve error message on reading unexpected directory which is not a table partition

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213012#comment-17213012
 ] 

Apache Spark commented on SPARK-32069:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30027

> Improve error message on reading unexpected directory which is not a table 
> partition
> 
>
> Key: SPARK-32069
> URL: https://issues.apache.org/jira/browse/SPARK-32069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>  Labels: starter
>
> To reproduce:
> {code:java}
> spark-sql> create table test(i long);
> spark-sql> insert into test values(1);
> {code}
> {code:java}
> bash $ mkdir ./spark-warehouse/test/data
> {code}
> There will be such error messge
> {code:java}
> java.io.IOException: Not a file: 
> file:/Users/gengliang.wang/projects/spark/spark-warehouse/test/data
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2173)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>   at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.s

[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0

2020-10-13 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213010#comment-17213010
 ] 

Ismaël Mejía commented on SPARK-27733:
--

[~csun] This is excellent news! We will get this done and knowing that we can 
have someone from the Hive side for help is definitely appreciated. Let's 
continue the discussion then in HIVE-21737 until we get it ready there.

> Upgrade to Avro 1.10.0
> --
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.1.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.2 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranamer, no shaded guava, security 
> updates, so probably a worth upgrade.
> Avro 1.10.0 was released and this is still not done.
> There is at the moment (2020/08) still a blocker because of Hive related 
> transitive dependencies bringing older versions of Avro, so we could say that 
> this is somehow still blocked until HIVE-21737 is solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32978) Incorrect number of dynamic part metric

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213002#comment-17213002
 ] 

Apache Spark commented on SPARK-32978:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/30026

> Incorrect number of dynamic part metric
> ---
>
> Key: SPARK-32978
> URL: https://issues.apache.org/jira/browse/SPARK-32978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table dynamic_partition(i bigint, part bigint) using parquet 
> partitioned by (part);
> insert overwrite table dynamic_partition partition(part) select id, id % 50 
> as part  from range(1);
> {code}
> The number of dynamic part should be 50, but it is 800 on web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32978) Incorrect number of dynamic part metric

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32978:


Assignee: Apache Spark

> Incorrect number of dynamic part metric
> ---
>
> Key: SPARK-32978
> URL: https://issues.apache.org/jira/browse/SPARK-32978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Attachments: screenshot-1.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table dynamic_partition(i bigint, part bigint) using parquet 
> partitioned by (part);
> insert overwrite table dynamic_partition partition(part) select id, id % 50 
> as part  from range(1);
> {code}
> The number of dynamic part should be 50, but it is 800 on web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32978) Incorrect number of dynamic part metric

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213001#comment-17213001
 ] 

Apache Spark commented on SPARK-32978:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/30026

> Incorrect number of dynamic part metric
> ---
>
> Key: SPARK-32978
> URL: https://issues.apache.org/jira/browse/SPARK-32978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table dynamic_partition(i bigint, part bigint) using parquet 
> partitioned by (part);
> insert overwrite table dynamic_partition partition(part) select id, id % 50 
> as part  from range(1);
> {code}
> The number of dynamic part should be 50, but it is 800 on web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32978) Incorrect number of dynamic part metric

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32978:


Assignee: (was: Apache Spark)

> Incorrect number of dynamic part metric
> ---
>
> Key: SPARK-32978
> URL: https://issues.apache.org/jira/browse/SPARK-32978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table dynamic_partition(i bigint, part bigint) using parquet 
> partitioned by (part);
> insert overwrite table dynamic_partition partition(part) select id, id % 50 
> as part  from range(1);
> {code}
> The number of dynamic part should be 50, but it is 800 on web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33128) mismatched input since Spark 3.0

2020-10-13 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33128:
---

 Summary: mismatched input since Spark 3.0
 Key: SPARK-33128
 URL: https://issues.apache.org/jira/browse/SPARK-33128
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.0.0, 3.1.0
Reporter: Yuming Wang


Spark 2.4:
{noformat}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
  /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_221)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("SELECT 1 UNION SELECT 1 UNION ALL SELECT 1").show
+---+
|  1|
+---+
|  1|
|  1|
+---+
{noformat}


Spark 3.x:
{noformat}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
  /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 14.0.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("SELECT 1 UNION SELECT 1 UNION ALL SELECT 1").show
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'SELECT' expecting {, ';'}(line 1, pos 15)

== SQL ==
SELECT 1 UNION SELECT 1 UNION ALL SELECT 1
---^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607)
  ... 47 elided
{noformat}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33095) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33095:


Assignee: Apache Spark

> Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and 
> nullability of columns (MySQL dialect)
> -
>
> Key: SPARK-33095
> URL: https://issues.apache.org/jira/browse/SPARK-33095
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Apache Spark
>Priority: Major
>
> Override the default SQL strings for:
> ALTER TABLE UPDATE COLUMN TYPE
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following MySQL JDBC dialect according to official documentation.
> Write MySQL integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33095) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33095:


Assignee: (was: Apache Spark)

> Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and 
> nullability of columns (MySQL dialect)
> -
>
> Key: SPARK-33095
> URL: https://issues.apache.org/jira/browse/SPARK-33095
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Override the default SQL strings for:
> ALTER TABLE UPDATE COLUMN TYPE
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following MySQL JDBC dialect according to official documentation.
> Write MySQL integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33095) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212989#comment-17212989
 ] 

Apache Spark commented on SPARK-33095:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/30025

> Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and 
> nullability of columns (MySQL dialect)
> -
>
> Key: SPARK-33095
> URL: https://issues.apache.org/jira/browse/SPARK-33095
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Override the default SQL strings for:
> ALTER TABLE UPDATE COLUMN TYPE
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following MySQL JDBC dialect according to official documentation.
> Write MySQL integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33095) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)

2020-10-13 Thread Prashant Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-33095:

Description: 
Override the default SQL strings for:
ALTER TABLE UPDATE COLUMN TYPE
ALTER TABLE UPDATE COLUMN NULLABILITY
in the following MySQL JDBC dialect according to official documentation.
Write MySQL integration tests for JDBC.

  was:
Override the default SQL strings for:
ALTER TABLE ADD COLUMN
ALTER TABLE UPDATE COLUMN TYPE
ALTER TABLE UPDATE COLUMN NULLABILITY
in the following MySQL JDBC dialect according to official documentation.
Write MySQL integration tests for JDBC.


> Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and 
> nullability of columns (MySQL dialect)
> -
>
> Key: SPARK-33095
> URL: https://issues.apache.org/jira/browse/SPARK-33095
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Override the default SQL strings for:
> ALTER TABLE UPDATE COLUMN TYPE
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following MySQL JDBC dialect according to official documentation.
> Write MySQL integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32229) Application entry parsing fails because DriverWrapper registered instead of the normal driver

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32229:


Assignee: (was: Apache Spark)

> Application entry parsing fails because DriverWrapper registered instead of 
> the normal driver
> -
>
> Key: SPARK-32229
> URL: https://issues.apache.org/jira/browse/SPARK-32229
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In some cases DriverWrapper registered by DriverRegistry which causes 
> exception in PostgresConnectionProvider:
> https://github.com/apache/spark/blob/371b35d2e0ab08ebd853147c6673de3adfad0553/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/DriverRegistry.scala#L53



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32229) Application entry parsing fails because DriverWrapper registered instead of the normal driver

2020-10-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212968#comment-17212968
 ] 

Apache Spark commented on SPARK-32229:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/30024

> Application entry parsing fails because DriverWrapper registered instead of 
> the normal driver
> -
>
> Key: SPARK-32229
> URL: https://issues.apache.org/jira/browse/SPARK-32229
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In some cases DriverWrapper registered by DriverRegistry which causes 
> exception in PostgresConnectionProvider:
> https://github.com/apache/spark/blob/371b35d2e0ab08ebd853147c6673de3adfad0553/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/DriverRegistry.scala#L53



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32229) Application entry parsing fails because DriverWrapper registered instead of the normal driver

2020-10-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32229:


Assignee: Apache Spark

> Application entry parsing fails because DriverWrapper registered instead of 
> the normal driver
> -
>
> Key: SPARK-32229
> URL: https://issues.apache.org/jira/browse/SPARK-32229
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
>
> In some cases DriverWrapper registered by DriverRegistry which causes 
> exception in PostgresConnectionProvider:
> https://github.com/apache/spark/blob/371b35d2e0ab08ebd853147c6673de3adfad0553/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/DriverRegistry.scala#L53



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >