[jira] [Commented] (SPARK-33085) "Master removed our application" error leads to FAILED driver status instead of KILLED driver status
[ https://issues.apache.org/jira/browse/SPARK-33085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213647#comment-17213647 ] Hyukjin Kwon commented on SPARK-33085: -- Can you show the reproducible steps? > "Master removed our application" error leads to FAILED driver status instead > of KILLED driver status > > > Key: SPARK-33085 > URL: https://issues.apache.org/jira/browse/SPARK-33085 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.6 >Reporter: t oo >Priority: Major > > > driver-20200930160855-0316 exited with status FAILED > > I am using Spark Standalone scheduler with spot ec2 workers. I confirmed that > myip.87 EC2 instance was terminated at 2020-09-30 16:16 > > *I would expect the overall driver status to be KILLED but instead it was > FAILED*, my goal is to interpret FAILED status as 'don't rerun as > non-transient error faced' but KILLED/ERROR status as 'yes, rerun as > transient error faced'. But it looks like FAILED status is being set in below > case of transient error: > > Below are driver logs > {code:java} > 2020-09-30 16:12:41,183 [main] INFO > com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to > s3a://redacted2020-09-30 16:12:41,183 [main] INFO > com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to > s3a://redacted20-09-30 16:16:40,366 [dispatcher-event-loop-15] ERROR > org.apache.spark.scheduler.TaskSchedulerImpl - Lost executor 0 on myip.87: > Remote RPC client disassociated. Likely due to containers exceeding > thresholds, or network issues. Check driver logs for WARN messages.2020-09-30 > 16:16:40,372 [dispatcher-event-loop-15] WARN > org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 6.0 (TID > 6, myip.87, executor 0): ExecutorLostFailure (executor 0 exited caused by one > of the running tasks) Reason: Remote RPC client disassociated. Likely due to > containers exceeding thresholds, or network issues. Check driver logs for > WARN messages.2020-09-30 16:16:40,376 [dispatcher-event-loop-13] WARN > org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas > available for rdd_3_0 !2020-09-30 16:16:40,398 [dispatcher-event-loop-2] INFO > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor > app-20200930160902-0895/0 removed: Worker shutting down2020-09-30 > 16:16:40,399 [dispatcher-event-loop-2] INFO > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted > executor ID app-20200930160902-0895/1 on hostPort myip.87:11647 with 2 > core(s), 5.0 GB RAM2020-09-30 16:16:40,401 [dispatcher-event-loop-5] INFO > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor > app-20200930160902-0895/1 removed: java.lang.IllegalStateException: Shutdown > hooks cannot be modified during shutdown.2020-09-30 16:16:40,402 > [dispatcher-event-loop-5] INFO > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted > executor ID app-20200930160902-0895/2 on hostPort myip.87:11647 with 2 > core(s), 5.0 GB RAM2020-09-30 16:16:40,403 [dispatcher-event-loop-11] INFO > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor > app-20200930160902-0895/2 removed: java.lang.IllegalStateException: Shutdown > hooks cannot be modified during shutdown.2020-09-30 16:16:40,404 > [dispatcher-event-loop-11] INFO > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted > executor ID app-20200930160902-0895/3 on hostPort myip.87:11647 with 2 > core(s), 5.0 GB RAM2020-09-30 16:16:40,405 [dispatcher-event-loop-1] INFO > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor > app-20200930160902-0895/3 removed: java.lang.IllegalStateException: Shutdown > hooks cannot be modified during shutdown.2020-09-30 16:16:40,406 > [dispatcher-event-loop-1] INFO > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted > executor ID app-20200930160902-0895/4 on hostPort myip.87:11647 with 2 > core(s), 5.0 GB RAM2020-09-30 16:16:40,407 [dispatcher-event-loop-12] INFO > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor > app-20200930160902-0895/4 removed: java.lang.IllegalStateException: Shutdown > hooks cannot be modified during shutdown.2020-09-30 16:16:40,408 > [dispatcher-event-loop-12] INFO > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted > executor ID app-20200930160902-0895/5 on hostPort myip.87:11647 with 2 > core(s), 5.0 GB RAM2020-09-30 16:16:40,409 [dispatcher-event-loop-4] INFO > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor > app-20200930160902-0895/5 re
[jira] [Commented] (SPARK-33113) [SparkR] gapply works with arrow disabled, fails with arrow enabled stringsAsFactors=TRUE
[ https://issues.apache.org/jira/browse/SPARK-33113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213643#comment-17213643 ] Hyukjin Kwon commented on SPARK-33113: -- It works with Spark 3.0.0 too. Can you show your versions of R and Arrow? > [SparkR] gapply works with arrow disabled, fails with arrow enabled > stringsAsFactors=TRUE > - > > Key: SPARK-33113 > URL: https://issues.apache.org/jira/browse/SPARK-33113 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 3.0.0, 3.0.1 >Reporter: Jacek Pliszka >Priority: Major > > Running in databricks on Azure > {code} > library("arrow") > library("SparkR") > df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA") > udf <- function(key, x) data.frame(out=c("dfs")) > {code} > > This works: > {code} > sparkR.session(master = "local[*]", > sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false")) > df1 <- gapply(df, c("ColumnA"), udf, "out String") > collect(df1) > {code} > This fails: > {code} > sparkR.session(master = "local[*]", > sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true")) > df2 <- gapply(df, c("ColumnA"), udf, "out String") > collect(df2) > {code} > > with error > {code} > Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : }}Error > in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' > argument > Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid > 'n' argument In addition: Warning messages: 1: Use 'read_ipc_stream' or > 'read_feather' instead. 2: Use 'read_ipc_stream' or 'read_feather' instead. > {code} > > Clicking through Failed Stages to Failure Reason: > > {code} > Job aborted due to stage failure: Task 49 in stage 1843.0 failed 4 times, > most recent failure: Lost task 49.3 in stage 1843.0 (TID 89810, 10.99.0.5, > executor 0): java.lang.UnsupportedOperationException > at > org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233) > at > org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109) > at > org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:131) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559) > at > org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:140) > at > org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:115) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) > at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) > at scala.collection.TraversableOnce.to(TraversableOnce.scala:315) > at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313) > at scala.collection.AbstractIterator.to(Iterator.scala:1429) > at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307) > at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429) > at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294) > at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1429) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$3(Dataset.scala:3589) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144) > at org.apache.spark.scheduler.Task.run(Task.scala:117) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642) > at > java.util.concurrent.ThreadP
[jira] [Commented] (SPARK-33113) [SparkR] gapply works with arrow disabled, fails with arrow enabled stringsAsFactors=TRUE
[ https://issues.apache.org/jira/browse/SPARK-33113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213640#comment-17213640 ] Hyukjin Kwon commented on SPARK-33113: -- It works in my local in Spark dev branch: {code:java} > df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA") > udf <- function(key, x) data.frame(out=c("dfs")) > sparkR.session(master = "local[*]", > sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false")) Java ref type org.apache.spark.sql.SparkSession id 1 > df1 <- gapply(df, c("ColumnA"), udf, "out String") > collect(df1) out 1 dfs 2 dfs 3 dfs > sparkR.session(master = "local[*]", > sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true")) Java ref type org.apache.spark.sql.SparkSession id 1 > df2 <- gapply(df, c("ColumnA"), udf, "out String") > collect(df2) out 1 dfs 2 dfs 3 dfs {code} > [SparkR] gapply works with arrow disabled, fails with arrow enabled > stringsAsFactors=TRUE > - > > Key: SPARK-33113 > URL: https://issues.apache.org/jira/browse/SPARK-33113 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 3.0.0, 3.0.1 >Reporter: Jacek Pliszka >Priority: Major > > Running in databricks on Azure > {code} > library("arrow") > library("SparkR") > df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA") > udf <- function(key, x) data.frame(out=c("dfs")) > {code} > > This works: > {code} > sparkR.session(master = "local[*]", > sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false")) > df1 <- gapply(df, c("ColumnA"), udf, "out String") > collect(df1) > {code} > This fails: > {code} > sparkR.session(master = "local[*]", > sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true")) > df2 <- gapply(df, c("ColumnA"), udf, "out String") > collect(df2) > {code} > > with error > {code} > Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : }}Error > in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' > argument > Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid > 'n' argument In addition: Warning messages: 1: Use 'read_ipc_stream' or > 'read_feather' instead. 2: Use 'read_ipc_stream' or 'read_feather' instead. > {code} > > Clicking through Failed Stages to Failure Reason: > > {code} > Job aborted due to stage failure: Task 49 in stage 1843.0 failed 4 times, > most recent failure: Lost task 49.3 in stage 1843.0 (TID 89810, 10.99.0.5, > executor 0): java.lang.UnsupportedOperationException > at > org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233) > at > org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109) > at > org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:131) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559) > at > org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:140) > at > org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:115) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) > at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) > at scala.collection.TraversableOnce.to(TraversableOnce.scala:315) > at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313) > at scala.collection.AbstractIterator.to(Iterator.scala:1429) > at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307) > at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429) > at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294) > at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288) > at scala.colle
[jira] [Updated] (SPARK-33113) [SparkR] gapply works with arrow disabled, fails with arrow enabled stringsAsFactors=TRUE
[ https://issues.apache.org/jira/browse/SPARK-33113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33113: - Description: Running in databricks on Azure {code} library("arrow") library("SparkR") df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA") udf <- function(key, x) data.frame(out=c("dfs")) {code} This works: {code} sparkR.session(master = "local[*]", sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false")) df1 <- gapply(df, c("ColumnA"), udf, "out String") collect(df1) {code} This fails: {code} sparkR.session(master = "local[*]", sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true")) df2 <- gapply(df, c("ColumnA"), udf, "out String") collect(df2) {code} with error {code} Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : }}Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument In addition: Warning messages: 1: Use 'read_ipc_stream' or 'read_feather' instead. 2: Use 'read_ipc_stream' or 'read_feather' instead. {cpde} Clicking through Failed Stages to Failure Reason: {code} Job aborted due to stage failure: Task 49 in stage 1843.0 failed 4 times, most recent failure: Lost task 49.3 in stage 1843.0 (TID 89810, 10.99.0.5, executor 0): java.lang.UnsupportedOperationException at org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233) at org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109) at org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:131) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:140) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:115) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:315) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313) at scala.collection.AbstractIterator.to(Iterator.scala:1429) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288) at scala.collection.AbstractIterator.toArray(Iterator.scala:1429) at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$3(Dataset.scala:3589) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144) at org.apache.spark.scheduler.Task.run(Task.scala:117) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: {code} was: Running in databricks on Azure {code} library("arrow") library("SparkR") df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA") udf <- function(key, x) data.frame(out=c("dfs")) {code} This works: {code} sparkR.session(master = "local[*]", sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false")) df1 <- gapply(df, c("ColumnA"), udf, "out String") collect(df1) {code} This fails: {code} sparkR.session(master = "local[*]", sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true")) df2 <- gapply(
[jira] [Updated] (SPARK-33113) [SparkR] gapply works with arrow disabled, fails with arrow enabled stringsAsFactors=TRUE
[ https://issues.apache.org/jira/browse/SPARK-33113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33113: - Description: Running in databricks on Azure {code} library("arrow") library("SparkR") df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA") udf <- function(key, x) data.frame(out=c("dfs")) {code} This works: {code} sparkR.session(master = "local[*]", sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false")) df1 <- gapply(df, c("ColumnA"), udf, "out String") collect(df1) {code} This fails: {code} sparkR.session(master = "local[*]", sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true")) df2 <- gapply(df, c("ColumnA"), udf, "out String") collect(df2) {code} with error {code} Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : }}Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument In addition: Warning messages: 1: Use 'read_ipc_stream' or 'read_feather' instead. 2: Use 'read_ipc_stream' or 'read_feather' instead. {code} Clicking through Failed Stages to Failure Reason: {code} Job aborted due to stage failure: Task 49 in stage 1843.0 failed 4 times, most recent failure: Lost task 49.3 in stage 1843.0 (TID 89810, 10.99.0.5, executor 0): java.lang.UnsupportedOperationException at org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233) at org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109) at org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:131) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:140) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:115) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:315) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313) at scala.collection.AbstractIterator.to(Iterator.scala:1429) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288) at scala.collection.AbstractIterator.toArray(Iterator.scala:1429) at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$3(Dataset.scala:3589) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144) at org.apache.spark.scheduler.Task.run(Task.scala:117) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: {code} was: Running in databricks on Azure {code} library("arrow") library("SparkR") df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA") udf <- function(key, x) data.frame(out=c("dfs")) {code} This works: {code} sparkR.session(master = "local[*]", sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false")) df1 <- gapply(df, c("ColumnA"), udf, "out String") collect(df1) {code} This fails: {code} sparkR.session(master = "local[*]", sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true")) df2 <- gapply(d
[jira] [Updated] (SPARK-33113) [SparkR] gapply works with arrow disabled, fails with arrow enabled stringsAsFactors=TRUE
[ https://issues.apache.org/jira/browse/SPARK-33113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33113: - Description: Running in databricks on Azure {code} library("arrow") library("SparkR") df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA") udf <- function(key, x) data.frame(out=c("dfs")) {code} This works: {code} sparkR.session(master = "local[*]", sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false")) df1 <- gapply(df, c("ColumnA"), udf, "out String") collect(df1) {code} This fails: {code} sparkR.session(master = "local[*]", sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true")) df2 <- gapply(df, c("ColumnA"), udf, "out String") collect(df2) {code} with error \{{ Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : }}Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument In addition: Warning messages: 1: Use 'read_ipc_stream' or 'read_feather' instead. 2: Use 'read_ipc_stream' or 'read_feather' instead. Clicking through Failed Stages to Failure Reason: {code} Job aborted due to stage failure: Task 49 in stage 1843.0 failed 4 times, most recent failure: Lost task 49.3 in stage 1843.0 (TID 89810, 10.99.0.5, executor 0): java.lang.UnsupportedOperationException at org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233) at org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109) at org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:131) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:140) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:115) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:315) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313) at scala.collection.AbstractIterator.to(Iterator.scala:1429) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288) at scala.collection.AbstractIterator.toArray(Iterator.scala:1429) at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$3(Dataset.scala:3589) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144) at org.apache.spark.scheduler.Task.run(Task.scala:117) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: {code} was: Running in databricks on Azure library("arrow") library("SparkR") df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA") udf <- function(key, x) data.frame(out=c("dfs")) This works: sparkR.session(master = "local[*]", sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false")) df1 <- gapply(df, c("ColumnA"), udf, "out String") collect(df1) This fails: sparkR.session(master = "local[*]", sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true")) df2 <- gapply(df, c("ColumnA"), udf, "out String")
[jira] [Commented] (SPARK-33120) Lazy Load of SparkContext.addFiles
[ https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213635#comment-17213635 ] Hyukjin Kwon commented on SPARK-33120: -- Why don't you just upload your files into HDFS or something else and use it? You could also leverage binaryFile source, etc. You could also think about using fuse if you should access it like a local file system. In my case, when I did some geographical stuff before, I used fuse instead of passing files over addFiles. So each task can random access and do some topographic correction and angle correction from the original large image. {{SparkContext.addFiles}} is usually for passing some meta stuff to handle data like jars. > Lazy Load of SparkContext.addFiles > -- > > Key: SPARK-33120 > URL: https://issues.apache.org/jira/browse/SPARK-33120 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 > Environment: Mac OS X (2 systems), workload to eventually be run on > Amazon EMR. > Java 11 application. >Reporter: Taylor Smock >Priority: Minor > > In my spark job, I may have various random files that may or may not be used > by each task. > I would like to avoid copying all of the files to every executor until it is > actually needed. > > What I've tried: > * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were > distributed to all clients. > * Broadcast variables. Since I _don't_ know what files I'm going to need > until I have started the task, I have to broadcast all the data at once, > which leads to nodes getting data, and then caching it to disk. In short, the > same issues as SparkContext.addFiles, but with the added benefit of having > the ability to create a mapping of paths to files. > What I would like to see: > * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, > Enum.WaitForAvailability) or Future future = SparkFiles.get(file) > > > Notes: > https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346 > indicated that `SparkFiles.get` would be required to get the data on the > local driver, but in my testing that did not appear to be the case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33133) History server fails when loading invalid rolling event logs
[ https://issues.apache.org/jira/browse/SPARK-33133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213629#comment-17213629 ] Hyukjin Kwon commented on SPARK-33133: -- cc [~kabhwan] FYI > History server fails when loading invalid rolling event logs > > > Key: SPARK-33133 > URL: https://issues.apache.org/jira/browse/SPARK-33133 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Adam Binford >Priority: Major > > We have run into an issue where our history server fails to load new > applications, and when restarted, fails to load any applications at all. This > happens when it encounters invalid rolling event log files. We encounter this > with long running streaming applications. There seems to be two issues here > that lead to problems: > * It looks like our long running streaming applications event log directory > is being cleaned up. The next time the application logs event data, it > recreates the event log directory but without recreating the "appstatus" > file. I don't know the full extent of this behavior or if something "wrong" > is happening here. > * The history server then reads this new folder, and throws an exception > because the "appstatus" file doesn't exist in the rolling event log folder. > This exception breaks the entire listing process, so no new applications will > be read, and if restarted no applications at all will be read. > There seems like a couple ways to go about fixing this, and I'm curious > anyone's thoughts who knows more about how the history server works, > specifically with rolling event logs: > * Don't completely fail checking for new applications if one bad rolling > event log folder is encountered. This seems like the simplest fix and makes > sense to me, it already checks for a few other errors and ignores them. It > doesn't necessarily fix the underlying issue that leads to this happening > though. > * Figure out why the in progress event log folder is being deleted and make > sure that doesn't happen. Maybe this is supposed to happen? Or maybe we don't > want to delete the top level folder and only delete event log files within > the folders? Again I don't know the exact current behavior here with this. > * When writing new event log data, make sure the folder and appstatus file > exist every time, creating them again if not. > Here's the stack trace we encounter when this happens, from 3.0.1 with a > couple extra MRs backported that I hoped would fix the issue: > {{2020-10-13 12:10:31,751 ERROR history.FsHistoryProvider: Exception in > checking for event log updates2020-10-13 12:10:31,751 ERROR > history.FsHistoryProvider: Exception in checking for event log > updatesjava.lang.IllegalArgumentException: requirement failed: Log directory > must contain an appstatus file! at scala.Predef$.require(Predef.scala:281) at > org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:214) > at > org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files(EventLogFileReaders.scala:211) > at > org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles$lzycompute(EventLogFileReaders.scala:221) > at > org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles(EventLogFileReaders.scala:220) > at > org.apache.spark.deploy.history.RollingEventLogFilesFileReader.lastEventLogFile(EventLogFileReaders.scala:272) > at > org.apache.spark.deploy.history.RollingEventLogFilesFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:240) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:524) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at > scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) at > scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) at > scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at > scala.collection.TraversableLike.filter(TraversableLike.scala:347) at > scala.collection.TraversableLike.filter$(TraversableLike.scala:347) at > scala.collection.AbstractTraversable.filter(Traversable.scala:108) at > org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466) > at > org.apache.spark.deploy.history.FsHistoryProvider.$
[jira] [Commented] (SPARK-33136) Handling nullability for complex types is broken during resolution of V2 write command
[ https://issues.apache.org/jira/browse/SPARK-33136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213612#comment-17213612 ] Jungtaek Lim commented on SPARK-33136: -- Note that AppendData in branch-2.4 is also broken as same, but the usage of AppendData is reverted in [{{b6e4aca}}|https://github.com/apache/spark/commit/b6e4aca0be7f3b863c326063a3c02aa8a1c266a3] for branch-2.4 and shipped to Spark 2.4.0. (That said, no Spark 2.x version is affected.) So while the code in AppendData for branch-2.4 is broken as well, it's a dead code. > Handling nullability for complex types is broken during resolution of V2 > write command > -- > > Key: SPARK-33136 > URL: https://issues.apache.org/jira/browse/SPARK-33136 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Jungtaek Lim >Priority: Major > > I figured out Spark 3.x cannot write to complex type with nullable if > matching column type in DataFrame is non-nullable. > For example, > {code:java} > case class StructData(a: String, b: Int) > case class Data(col_b: Boolean, col_i: Int, col_l: Long, col_f: Float, col_d: > Double, col_s: String, col_fi: Array[Byte], col_bi: Array[Byte], col_de: > Double, col_st: StructData, col_li: Seq[String], col_ma: Map[Int, > String]){code} > `col_st.b` would be non-nullable in DataFrame, which should not matter when > we insert from DataFrame to the table which has `col_st.b` as nullable. > (non-nullable to nullable should be possible) > This looks to be broken in V2 write command. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29536) PySpark does not work with Python 3.8.0
[ https://issues.apache.org/jira/browse/SPARK-29536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213558#comment-17213558 ] Dongjoon Hyun edited comment on SPARK-29536 at 10/14/20, 3:18 AM: -- Hi, [~hyukjin.kwon]. Apache Spark 2.4.7 also fails. I will update the affected version. {code} $ bin/pyspark Python 3.8.5 (default, Sep 10 2020, 11:46:28) [Clang 11.0.0 (clang-1100.0.33.16)] on darwin Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last): File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/shell.py", line 31, in from pyspark import SparkConf File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/__init__.py", line 51, in from pyspark.context import SparkContext File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/context.py", line 31, in from pyspark import accumulators File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/accumulators.py", line 97, in from pyspark.serializers import read_int, PickleSerializer File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/serializers.py", line 72, in from pyspark import cloudpickle File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/cloudpickle.py", line 145, in _cell_set_template_code = _make_cell_set_template_code() File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code return types.CodeType( TypeError: an integer is required (got type bytes) >>> {code} was (Author: dongjoon): Hi, [~hyukjin.kwon]. Apache Spark 2.4.7 also fails. I will update the affected version. {code} $ current_pyspark Python 3.8.5 (default, Sep 10 2020, 11:46:28) [Clang 11.0.0 (clang-1100.0.33.16)] on darwin Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last): File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/shell.py", line 31, in from pyspark import SparkConf File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/__init__.py", line 51, in from pyspark.context import SparkContext File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/context.py", line 31, in from pyspark import accumulators File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/accumulators.py", line 97, in from pyspark.serializers import read_int, PickleSerializer File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/serializers.py", line 72, in from pyspark import cloudpickle File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/cloudpickle.py", line 145, in _cell_set_template_code = _make_cell_set_template_code() File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code return types.CodeType( TypeError: an integer is required (got type bytes) >>> {code} > PySpark does not work with Python 3.8.0 > --- > > Key: SPARK-29536 > URL: https://issues.apache.org/jira/browse/SPARK-29536 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 3.0.0 > > > You open a shell and run arbitrary codes: > {code} > File "/.../3.8/lib/python3.8/runpy.py", line 183, in _run_module_as_main > mod_name, mod_spec, code = _get_module_details(mod_name, _Error) > File "/.../3.8/lib/python3.8/runpy.py", line 109, in _get_module_details > __import__(pkg_name) > File /.../workspace/forked/spark/python/pyspark/__init__.py", line 51, in > > from pyspark.context import SparkContext > File "/.../spark/python/pyspark/context.py", line 31, in > from pyspark import accumulators > File "/.../python/pyspark/accumulators.py", line 97, in > from pyspark.serializers import read_int, PickleSerializer > File "/.../python/pyspark/serializers.py", line 71, in > from pyspark import cloudpickle > File "/.../python/pyspark/cloudpickle.py", line 152, in > _cell_set_template_code = _make_cell_set_template_code() > File "/.../spark/python/pyspark/cloudpickle.py", line 133, in > _make_cell_set_template_code > return types.CodeType( > TypeError: an integer is required (got type bytes) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (SPARK-29536) PySpark does not work with Python 3.8.0
[ https://issues.apache.org/jira/browse/SPARK-29536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29536: -- Affects Version/s: 2.4.7 > PySpark does not work with Python 3.8.0 > --- > > Key: SPARK-29536 > URL: https://issues.apache.org/jira/browse/SPARK-29536 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 2.4.7, 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 3.0.0 > > > You open a shell and run arbitrary codes: > {code} > File "/.../3.8/lib/python3.8/runpy.py", line 183, in _run_module_as_main > mod_name, mod_spec, code = _get_module_details(mod_name, _Error) > File "/.../3.8/lib/python3.8/runpy.py", line 109, in _get_module_details > __import__(pkg_name) > File /.../workspace/forked/spark/python/pyspark/__init__.py", line 51, in > > from pyspark.context import SparkContext > File "/.../spark/python/pyspark/context.py", line 31, in > from pyspark import accumulators > File "/.../python/pyspark/accumulators.py", line 97, in > from pyspark.serializers import read_int, PickleSerializer > File "/.../python/pyspark/serializers.py", line 71, in > from pyspark import cloudpickle > File "/.../python/pyspark/cloudpickle.py", line 152, in > _cell_set_template_code = _make_cell_set_template_code() > File "/.../spark/python/pyspark/cloudpickle.py", line 133, in > _make_cell_set_template_code > return types.CodeType( > TypeError: an integer is required (got type bytes) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29536) PySpark does not work with Python 3.8.0
[ https://issues.apache.org/jira/browse/SPARK-29536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213558#comment-17213558 ] Dongjoon Hyun commented on SPARK-29536: --- Hi, [~hyukjin.kwon]. Apache Spark 2.4.7 also fails. I will update the affected version. {code} $ current_pyspark Python 3.8.5 (default, Sep 10 2020, 11:46:28) [Clang 11.0.0 (clang-1100.0.33.16)] on darwin Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last): File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/shell.py", line 31, in from pyspark import SparkConf File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/__init__.py", line 51, in from pyspark.context import SparkContext File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/context.py", line 31, in from pyspark import accumulators File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/accumulators.py", line 97, in from pyspark.serializers import read_int, PickleSerializer File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/serializers.py", line 72, in from pyspark import cloudpickle File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/cloudpickle.py", line 145, in _cell_set_template_code = _make_cell_set_template_code() File "/Users/dongjoon/APACHE/spark-release/spark-2.4.7-bin-hadoop2.7/python/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code return types.CodeType( TypeError: an integer is required (got type bytes) >>> {code} > PySpark does not work with Python 3.8.0 > --- > > Key: SPARK-29536 > URL: https://issues.apache.org/jira/browse/SPARK-29536 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 3.0.0 > > > You open a shell and run arbitrary codes: > {code} > File "/.../3.8/lib/python3.8/runpy.py", line 183, in _run_module_as_main > mod_name, mod_spec, code = _get_module_details(mod_name, _Error) > File "/.../3.8/lib/python3.8/runpy.py", line 109, in _get_module_details > __import__(pkg_name) > File /.../workspace/forked/spark/python/pyspark/__init__.py", line 51, in > > from pyspark.context import SparkContext > File "/.../spark/python/pyspark/context.py", line 31, in > from pyspark import accumulators > File "/.../python/pyspark/accumulators.py", line 97, in > from pyspark.serializers import read_int, PickleSerializer > File "/.../python/pyspark/serializers.py", line 71, in > from pyspark import cloudpickle > File "/.../python/pyspark/cloudpickle.py", line 152, in > _cell_set_template_code = _make_cell_set_template_code() > File "/.../spark/python/pyspark/cloudpickle.py", line 133, in > _make_cell_set_template_code > return types.CodeType( > TypeError: an integer is required (got type bytes) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33134) Incorrect nested complex JSON fields raise an exception
[ https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33134. -- Fix Version/s: 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 30032 [https://github.com/apache/spark/pull/30032] > Incorrect nested complex JSON fields raise an exception > --- > > Key: SPARK-33134 > URL: https://issues.apache.org/jira/browse/SPARK-33134 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > The code below: > {code:scala} > val pokerhand_raw = Seq("""[{"cards": [19], "playerId": > 123456}]""").toDF("events") > val event = new StructType() > .add("playerId", LongType) > .add("cards", ArrayType( > new StructType() > .add("id", LongType) > .add("rank", StringType))) > val pokerhand_events = pokerhand_raw > .select(explode(from_json($"events", ArrayType(event))).as("event")) > pokerhand_events.show > {code} > throw the exception in the PERMISSIVE mode (default): > {code:java} > Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to > org.apache.spark.sql.catalyst.util.ArrayData > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) > {code} > The same works in Spark 2.4: > {code:scala} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.6 > /_/ > Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) > ... > scala> pokerhand_events.show() > +-+ > |event| > +-+ > +-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33134) Incorrect nested complex JSON fields raise an exception
[ https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33134: Assignee: Maxim Gekk > Incorrect nested complex JSON fields raise an exception > --- > > Key: SPARK-33134 > URL: https://issues.apache.org/jira/browse/SPARK-33134 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > The code below: > {code:scala} > val pokerhand_raw = Seq("""[{"cards": [19], "playerId": > 123456}]""").toDF("events") > val event = new StructType() > .add("playerId", LongType) > .add("cards", ArrayType( > new StructType() > .add("id", LongType) > .add("rank", StringType))) > val pokerhand_events = pokerhand_raw > .select(explode(from_json($"events", ArrayType(event))).as("event")) > pokerhand_events.show > {code} > throw the exception in the PERMISSIVE mode (default): > {code:java} > Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to > org.apache.spark.sql.catalyst.util.ArrayData > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) > {code} > The same works in Spark 2.4: > {code:scala} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.6 > /_/ > Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) > ... > scala> pokerhand_events.show() > +-+ > |event| > +-+ > +-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33138) unify temp view and permanent view behaviors
[ https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213550#comment-17213550 ] Thejdeep commented on SPARK-33138: -- Can I look into this please ? > unify temp view and permanent view behaviors > > > Key: SPARK-33138 > URL: https://issues.apache.org/jira/browse/SPARK-33138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > Currently, temp view store mapping of temp view name and its logicalPlan, and > permanent view store in HMS stores its origin SQL text. > So for permanent view, when try to refer the permanent view, its SQL text > will be parse-analyze-optimize-plan again with current SQLConf and > SparkSession context, so it might keep changing when the SQLConf and context > is different each time. > In order the unify the behaviors of temp view and permanent view, proposed > that we keep its origin SQLText for both temp and permanent view, and also > keep record of the SQLConf when the view was created. Each time we try to > refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan > the SQLText, in this way, we can make sure the output of the created view to > be stable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33142) SQL temp view should store SQL text as well
Leanken.Lin created SPARK-33142: --- Summary: SQL temp view should store SQL text as well Key: SPARK-33142 URL: https://issues.apache.org/jira/browse/SPARK-33142 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33141) capture SQL configs when creating permanent views
Leanken.Lin created SPARK-33141: --- Summary: capture SQL configs when creating permanent views Key: SPARK-33141 URL: https://issues.apache.org/jira/browse/SPARK-33141 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33140) make Analyzer and Optimizer rules using SQLConf.get
Leanken.Lin created SPARK-33140: --- Summary: make Analyzer and Optimizer rules using SQLConf.get Key: SPARK-33140 URL: https://issues.apache.org/jira/browse/SPARK-33140 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33139) protect setActiveSession and clearActiveSession
Leanken.Lin created SPARK-33139: --- Summary: protect setActiveSession and clearActiveSession Key: SPARK-33139 URL: https://issues.apache.org/jira/browse/SPARK-33139 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33138) unify temp view and permanent view behaviors
[ https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33138: Description: Currently, temp view store mapping of temp view name and its logicalPlan, and permanent view store in HMS stores its origin SQL text. So for permanent view, when try to refer the permanent view, its SQL text will be parse-analyze-optimize-plan again with current SQLConf and SparkSession context, so it might keep changing when the SQLConf and context is different each time. In order the unify the behaviors of temp view and permanent view, proposed that we keep its origin SQLText for both temp and permanent view, and also keep record of the SQLConf when the view was created. Each time we try to refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan the SQLText, in this way, we can make sure the output of the created view to be stable. was:Currently, temp view store mapping of temp view name and its logicalPlan, and permanent view store in HMS stores its origin SQL text. So for permanent view, when try to refer the permanent view, its SQL text will be parse-analyze-optimize-plan again with current SQLConf and SparkSession context, so it might keep changing the SQLConf and context is different each time. So, in order the unify the behaviors of temp view and permanent view, propose that we keep its origin SQLText for both temp and permanent view, and also keep record of the SQLConf when the view was created. Each time we try to refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan the SQLText, in this way, we can make sure the output of the created view to be stable. > unify temp view and permanent view behaviors > > > Key: SPARK-33138 > URL: https://issues.apache.org/jira/browse/SPARK-33138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > Currently, temp view store mapping of temp view name and its logicalPlan, and > permanent view store in HMS stores its origin SQL text. > So for permanent view, when try to refer the permanent view, its SQL text > will be parse-analyze-optimize-plan again with current SQLConf and > SparkSession context, so it might keep changing when the SQLConf and context > is different each time. > In order the unify the behaviors of temp view and permanent view, proposed > that we keep its origin SQLText for both temp and permanent view, and also > keep record of the SQLConf when the view was created. Each time we try to > refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan > the SQLText, in this way, we can make sure the output of the created view to > be stable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33138) unify temp view and permanent view behaviors
Leanken.Lin created SPARK-33138: --- Summary: unify temp view and permanent view behaviors Key: SPARK-33138 URL: https://issues.apache.org/jira/browse/SPARK-33138 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Environment: Currently, temp view store mapping of temp view name and its logicalPlan, and permanent view store in HMS stores its origin SQL text. So for permanent view, when try to refer the permanent view, its SQL text will be parse-analyze-optimize-plan again with current SQLConf and SparkSession context, so it might keep changing the SQLConf and context is different each time. So, in order the unify the behaviors of temp view and permanent view, propose that we keep its origin SQLText for both temp and permanent view, and also keep record of the SQLConf when the view was created. Each time we try to refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan the SQLText, in this way, we can make sure the output of the created view to be stable. Reporter: Leanken.Lin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33138) unify temp view and permanent view behaviors
[ https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33138: Description: Currently, temp view store mapping of temp view name and its logicalPlan, and permanent view store in HMS stores its origin SQL text. So for permanent view, when try to refer the permanent view, its SQL text will be parse-analyze-optimize-plan again with current SQLConf and SparkSession context, so it might keep changing the SQLConf and context is different each time. So, in order the unify the behaviors of temp view and permanent view, propose that we keep its origin SQLText for both temp and permanent view, and also keep record of the SQLConf when the view was created. Each time we try to refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan the SQLText, in this way, we can make sure the output of the created view to be stable. Environment: (was: Currently, temp view store mapping of temp view name and its logicalPlan, and permanent view store in HMS stores its origin SQL text. So for permanent view, when try to refer the permanent view, its SQL text will be parse-analyze-optimize-plan again with current SQLConf and SparkSession context, so it might keep changing the SQLConf and context is different each time. So, in order the unify the behaviors of temp view and permanent view, propose that we keep its origin SQLText for both temp and permanent view, and also keep record of the SQLConf when the view was created. Each time we try to refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan the SQLText, in this way, we can make sure the output of the created view to be stable.) > unify temp view and permanent view behaviors > > > Key: SPARK-33138 > URL: https://issues.apache.org/jira/browse/SPARK-33138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > Currently, temp view store mapping of temp view name and its logicalPlan, and > permanent view store in HMS stores its origin SQL text. So for permanent > view, when try to refer the permanent view, its SQL text will be > parse-analyze-optimize-plan again with current SQLConf and SparkSession > context, so it might keep changing the SQLConf and context is different each > time. So, in order the unify the behaviors of temp view and permanent view, > propose that we keep its origin SQLText for both temp and permanent view, and > also keep record of the SQLConf when the view was created. Each time we try > to refer the view, we using the Snapshot SQLConf to > parse-analyze-optimize-plan the SQLText, in this way, we can make sure the > output of the created view to be stable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33137) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (PostgreSQL dialect)
Huaxin Gao created SPARK-33137: -- Summary: Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (PostgreSQL dialect) Key: SPARK-33137 URL: https://issues.apache.org/jira/browse/SPARK-33137 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Huaxin Gao Override the default SQL strings for: ALTER TABLE UPDATE COLUMN TYPE ALTER TABLE UPDATE COLUMN NULLABILITY in the following PostgreSQL JDBC dialect according to official documentation. Write PostgreSQL integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33136) Handling nullability for complex types is broken during resolution of V2 write command
[ https://issues.apache.org/jira/browse/SPARK-33136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213461#comment-17213461 ] Apache Spark commented on SPARK-33136: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/30033 > Handling nullability for complex types is broken during resolution of V2 > write command > -- > > Key: SPARK-33136 > URL: https://issues.apache.org/jira/browse/SPARK-33136 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Jungtaek Lim >Priority: Major > > I figured out Spark 3.x cannot write to complex type with nullable if > matching column type in DataFrame is non-nullable. > For example, > {code:java} > case class StructData(a: String, b: Int) > case class Data(col_b: Boolean, col_i: Int, col_l: Long, col_f: Float, col_d: > Double, col_s: String, col_fi: Array[Byte], col_bi: Array[Byte], col_de: > Double, col_st: StructData, col_li: Seq[String], col_ma: Map[Int, > String]){code} > `col_st.b` would be non-nullable in DataFrame, which should not matter when > we insert from DataFrame to the table which has `col_st.b` as nullable. > (non-nullable to nullable should be possible) > This looks to be broken in V2 write command. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33136) Handling nullability for complex types is broken during resolution of V2 write command
[ https://issues.apache.org/jira/browse/SPARK-33136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33136: Assignee: Apache Spark > Handling nullability for complex types is broken during resolution of V2 > write command > -- > > Key: SPARK-33136 > URL: https://issues.apache.org/jira/browse/SPARK-33136 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > I figured out Spark 3.x cannot write to complex type with nullable if > matching column type in DataFrame is non-nullable. > For example, > {code:java} > case class StructData(a: String, b: Int) > case class Data(col_b: Boolean, col_i: Int, col_l: Long, col_f: Float, col_d: > Double, col_s: String, col_fi: Array[Byte], col_bi: Array[Byte], col_de: > Double, col_st: StructData, col_li: Seq[String], col_ma: Map[Int, > String]){code} > `col_st.b` would be non-nullable in DataFrame, which should not matter when > we insert from DataFrame to the table which has `col_st.b` as nullable. > (non-nullable to nullable should be possible) > This looks to be broken in V2 write command. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33136) Handling nullability for complex types is broken during resolution of V2 write command
[ https://issues.apache.org/jira/browse/SPARK-33136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33136: Assignee: (was: Apache Spark) > Handling nullability for complex types is broken during resolution of V2 > write command > -- > > Key: SPARK-33136 > URL: https://issues.apache.org/jira/browse/SPARK-33136 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Jungtaek Lim >Priority: Major > > I figured out Spark 3.x cannot write to complex type with nullable if > matching column type in DataFrame is non-nullable. > For example, > {code:java} > case class StructData(a: String, b: Int) > case class Data(col_b: Boolean, col_i: Int, col_l: Long, col_f: Float, col_d: > Double, col_s: String, col_fi: Array[Byte], col_bi: Array[Byte], col_de: > Double, col_st: StructData, col_li: Seq[String], col_ma: Map[Int, > String]){code} > `col_st.b` would be non-nullable in DataFrame, which should not matter when > we insert from DataFrame to the table which has `col_st.b` as nullable. > (non-nullable to nullable should be possible) > This looks to be broken in V2 write command. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33136) Handling nullability for complex types is broken during resolution of V2 write command
[ https://issues.apache.org/jira/browse/SPARK-33136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213448#comment-17213448 ] Jungtaek Lim commented on SPARK-33136: -- will submit a PR soon. > Handling nullability for complex types is broken during resolution of V2 > write command > -- > > Key: SPARK-33136 > URL: https://issues.apache.org/jira/browse/SPARK-33136 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Jungtaek Lim >Priority: Major > > I figured out Spark 3.x cannot write to complex type with nullable if > matching column type in DataFrame is non-nullable. > For example, > {code:java} > case class StructData(a: String, b: Int) > case class Data(col_b: Boolean, col_i: Int, col_l: Long, col_f: Float, col_d: > Double, col_s: String, col_fi: Array[Byte], col_bi: Array[Byte], col_de: > Double, col_st: StructData, col_li: Seq[String], col_ma: Map[Int, > String]){code} > `col_st.b` would be non-nullable in DataFrame, which should not matter when > we insert from DataFrame to the table which has `col_st.b` as nullable. > (non-nullable to nullable should be possible) > This looks to be broken in V2 write command. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33136) Handling nullability for complex types is broken during resolution of V2 write command
Jungtaek Lim created SPARK-33136: Summary: Handling nullability for complex types is broken during resolution of V2 write command Key: SPARK-33136 URL: https://issues.apache.org/jira/browse/SPARK-33136 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.0.0, 3.1.0 Reporter: Jungtaek Lim I figured out Spark 3.x cannot write to complex type with nullable if matching column type in DataFrame is non-nullable. For example, {code:java} case class StructData(a: String, b: Int) case class Data(col_b: Boolean, col_i: Int, col_l: Long, col_f: Float, col_d: Double, col_s: String, col_fi: Array[Byte], col_bi: Array[Byte], col_de: Double, col_st: StructData, col_li: Seq[String], col_ma: Map[Int, String]){code} `col_st.b` would be non-nullable in DataFrame, which should not matter when we insert from DataFrame to the table which has `col_st.b` as nullable. (non-nullable to nullable should be possible) This looks to be broken in V2 write command. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice
[ https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213336#comment-17213336 ] L. C. Hsieh commented on SPARK-29625: - How did you specify the starting offset? > Spark Structure Streaming Kafka Wrong Reset Offset twice > > > Key: SPARK-29625 > URL: https://issues.apache.org/jira/browse/SPARK-29625 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Sandish Kumar HN >Priority: Major > > Spark Structure Streaming Kafka Reset Offset twice, once with right offsets > and second time with very old offsets > {code} > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-151 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-118 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-85 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 122677634. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-19 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 120504922.* > [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO ContextCleaner: Cleaned accumulator 810 > {code} > which is causing a Data loss issue. > {code} > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, > runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > java.lang.IllegalStateException: Partition topic-52's offset was changed from > 122677598 to 120504922, some data may have been missed. > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have > been lost because they are not available in Kafka any more; either the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - data was aged out > by Kafka or the topic may have been deleted before all the data in the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - topic was > processed. If you don't want your streaming query to fail on such cases, set > the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - source option > "failOnDataLoss" to "false". > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$class.filter(TraversableLike.scala:259) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} I
[jira] [Updated] (SPARK-33134) Incorrect nested complex JSON fields raise an exception
[ https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33134: --- Description: The code below: {code:scala} val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events") val event = new StructType() .add("playerId", LongType) .add("cards", ArrayType( new StructType() .add("id", LongType) .add("rank", StringType))) val pokerhand_events = pokerhand_raw .select(explode(from_json($"events", ArrayType(event))).as("event")) pokerhand_events.show {code} throw the exception in the PERMISSIVE mode (default): {code:java} Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) {code} The same works in Spark 2.4: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.6 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) ... scala> pokerhand_events.show() +-+ |event| +-+ +-+ {code} was: The code below: {code:scala} val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 583651}]""").toDF("events") val event = new StructType() .add("playerId", LongType) .add("cards", ArrayType( new StructType() .add("id", LongType) .add("rank", StringType))) val pokerhand_events = pokerhand_raw .select(explode(from_json($"events", ArrayType(event))).as("event")) pokerhand_events.show {code} throw the exception in the PERMISSIVE mode (default): {code:java} Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) {code} The same works in Spark 2.4: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.6 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) ... scala> pokerhand_events.show() +-+ |event| +-+ +-+ {code} > Incorrect nested complex JSON fields raise an exception > --- > > Key: SPARK-33134 > URL: https://issues.apache.org/jira/browse/SPARK-33134 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The code below: > {code:scala} > val pokerhand_raw = Seq("""[{"cards": [19], "playerId": > 123456}]""").toDF("events") > val event = new StructType() > .add("playerId", LongType) > .add("cards", ArrayType( > new StructType() > .add("id", LongType) > .add("rank", StringType))) > val pokerhand_events = pokerhand_raw > .select(explode(from_json($"events", ArrayType(event))).as("event")) > pokerhand_events.show > {code} > throw the exception in the PERMISSIVE mode (default): > {code:java} > Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to > org.apache.spark.sql.catalyst.util.ArrayData > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.ge
[jira] [Commented] (SPARK-33071) Join with ambiguous column succeeding but giving wrong output
[ https://issues.apache.org/jira/browse/SPARK-33071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213312#comment-17213312 ] Aoyuan Liao commented on SPARK-33071: - The master branch has the same bug. > Join with ambiguous column succeeding but giving wrong output > - > > Key: SPARK-33071 > URL: https://issues.apache.org/jira/browse/SPARK-33071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 >Reporter: George >Priority: Major > Labels: correctness > > When joining two datasets where one column in each dataset is sourced from > the same input dataset, the join successfully runs, but does not select the > correct columns, leading to incorrect output. > Repro using pyspark: > {code:java} > sc.version > import pyspark.sql.functions as F > d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 'units' > : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 1, > 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > input_df = spark.createDataFrame(d) > df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > df1 = df1.filter(F.col("key") != F.lit("c")) > df2 = df2.filter(F.col("key") != F.lit("d")) > ret = df1.join(df2, df1.key == df2.key, "full").select( > df1["key"].alias("df1_key"), > df2["key"].alias("df2_key"), > df1["sales"], > df2["units"], > F.coalesce(df1["key"], df2["key"]).alias("key")) > ret.show() > ret.explain(){code} > output for 2.4.4: > {code:java} > >>> sc.version > u'2.4.4' > >>> import pyspark.sql.functions as F > >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, > >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', > >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > >>> input_df = spark.createDataFrame(d) > >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > >>> df1 = df1.filter(F.col("key") != F.lit("c")) > >>> df2 = df2.filter(F.col("key") != F.lit("d")) > >>> ret = df1.join(df2, df1.key == df2.key, "full").select( > ... df1["key"].alias("df1_key"), > ... df2["key"].alias("df2_key"), > ... df1["sales"], > ... df2["units"], > ... F.coalesce(df1["key"], df2["key"]).alias("key")) > 20/10/05 15:46:14 WARN Column: Constructing trivially true equals predicate, > 'key#213 = key#213'. Perhaps you need to use aliases. > >>> ret.show() > +---+---+-+-++ > |df1_key|df2_key|sales|units| key| > +---+---+-+-++ > | d| d|3| null| d| > | null| null| null|2|null| > | b| b|5| 10| b| > | a| a|3|6| a| > +---+---+-+-++>>> ret.explain() > == Physical Plan == > *(5) Project [key#213 AS df1_key#258, key#213 AS df2_key#259, sales#223L, > units#230L, coalesce(key#213, key#213) AS key#260] > +- SortMergeJoin [key#213], [key#237], FullOuter >:- *(2) Sort [key#213 ASC NULLS FIRST], false, 0 >: +- *(2) HashAggregate(keys=[key#213], functions=[sum(sales#214L)]) >: +- Exchange hashpartitioning(key#213, 200) >:+- *(1) HashAggregate(keys=[key#213], > functions=[partial_sum(sales#214L)]) >: +- *(1) Project [key#213, sales#214L] >: +- *(1) Filter (isnotnull(key#213) && NOT (key#213 = c)) >: +- Scan ExistingRDD[key#213,sales#214L,units#215L] >+- *(4) Sort [key#237 ASC NULLS FIRST], false, 0 > +- *(4) HashAggregate(keys=[key#237], functions=[sum(units#239L)]) > +- Exchange hashpartitioning(key#237, 200) > +- *(3) HashAggregate(keys=[key#237], > functions=[partial_sum(units#239L)]) >+- *(3) Project [key#237, units#239L] > +- *(3) Filter (isnotnull(key#237) && NOT (key#237 = d)) > +- Scan ExistingRDD[key#237,sales#238L,units#239L] > {code} > output for 3.0.1: > {code:java} > // code placeholder > >>> sc.version > u'3.0.1' > >>> import pyspark.sql.functions as F > >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, > >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', > >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > >>> input_df = spark.createDataFrame(d) > /usr/local/lib/python2.7/site-packages/pyspark/sql/session.py:381: > UserWarning: inferring schema from dict is deprecated,please use > pyspark.sql.Row instead > warnings.warn("inferring schema from dict is deprecated," > >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > >>> df1 = df1.filt
[jira] [Commented] (SPARK-33134) Incorrect nested complex JSON fields raise an exception
[ https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213224#comment-17213224 ] Apache Spark commented on SPARK-33134: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30032 > Incorrect nested complex JSON fields raise an exception > --- > > Key: SPARK-33134 > URL: https://issues.apache.org/jira/browse/SPARK-33134 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The code below: > {code:scala} > val pokerhand_raw = Seq("""[{"cards": [11], "playerId": > 583651}]""").toDF("events") > val event = new StructType() > .add("playerId", LongType) > .add("cards", ArrayType( > new StructType() > .add("id", LongType) > .add("rank", StringType))) > val pokerhand_events = pokerhand_raw > .select(explode(from_json($"events", ArrayType(event))).as("event")) > pokerhand_events.show > {code} > throw the exception in the PERMISSIVE mode (default): > {code:java} > Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to > org.apache.spark.sql.catalyst.util.ArrayData > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) > {code} > The same works in Spark 2.4: > {code:scala} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.6 > /_/ > Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) > ... > scala> pokerhand_events.show() > +-+ > |event| > +-+ > +-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'
[ https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33132: - Assignee: akiyamaneko > The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as > 'NaN Undefined' > - > > Key: SPARK-33132 > URL: https://issues.apache.org/jira/browse/SPARK-33132 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Assignee: akiyamaneko >Priority: Minor > Attachments: Stage Summary shows NaN undefined.png > > > Spark Version: 3.0.1 > Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics > was shown as '*NaN Undefind*' when the *readBytes* value is negative, as > the attachment shows. > *curl > http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary* > { > "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ], > "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ], > ... > "shuffleReadMetrics" : > { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, > 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, > 2002.0, 2002.0, 2002.0, 2002.0 ], ... } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'
[ https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33132. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30030 [https://github.com/apache/spark/pull/30030] > The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as > 'NaN Undefined' > - > > Key: SPARK-33132 > URL: https://issues.apache.org/jira/browse/SPARK-33132 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Assignee: akiyamaneko >Priority: Minor > Fix For: 3.1.0 > > Attachments: Stage Summary shows NaN undefined.png > > > Spark Version: 3.0.1 > Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics > was shown as '*NaN Undefind*' when the *readBytes* value is negative, as > the attachment shows. > *curl > http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary* > { > "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ], > "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ], > ... > "shuffleReadMetrics" : > { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, > 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, > 2002.0, 2002.0, 2002.0, 2002.0 ], ... } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33071) Join with ambiguous column succeeding but giving wrong output
[ https://issues.apache.org/jira/browse/SPARK-33071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213205#comment-17213205 ] Wenchen Fan commented on SPARK-33071: - To confirm: is this a long-standing bug that 2.4, 3.0, and the master branch all give wrong (but same) result? > Join with ambiguous column succeeding but giving wrong output > - > > Key: SPARK-33071 > URL: https://issues.apache.org/jira/browse/SPARK-33071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 >Reporter: George >Priority: Major > Labels: correctness > > When joining two datasets where one column in each dataset is sourced from > the same input dataset, the join successfully runs, but does not select the > correct columns, leading to incorrect output. > Repro using pyspark: > {code:java} > sc.version > import pyspark.sql.functions as F > d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 'units' > : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 1, > 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > input_df = spark.createDataFrame(d) > df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > df1 = df1.filter(F.col("key") != F.lit("c")) > df2 = df2.filter(F.col("key") != F.lit("d")) > ret = df1.join(df2, df1.key == df2.key, "full").select( > df1["key"].alias("df1_key"), > df2["key"].alias("df2_key"), > df1["sales"], > df2["units"], > F.coalesce(df1["key"], df2["key"]).alias("key")) > ret.show() > ret.explain(){code} > output for 2.4.4: > {code:java} > >>> sc.version > u'2.4.4' > >>> import pyspark.sql.functions as F > >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, > >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', > >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > >>> input_df = spark.createDataFrame(d) > >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > >>> df1 = df1.filter(F.col("key") != F.lit("c")) > >>> df2 = df2.filter(F.col("key") != F.lit("d")) > >>> ret = df1.join(df2, df1.key == df2.key, "full").select( > ... df1["key"].alias("df1_key"), > ... df2["key"].alias("df2_key"), > ... df1["sales"], > ... df2["units"], > ... F.coalesce(df1["key"], df2["key"]).alias("key")) > 20/10/05 15:46:14 WARN Column: Constructing trivially true equals predicate, > 'key#213 = key#213'. Perhaps you need to use aliases. > >>> ret.show() > +---+---+-+-++ > |df1_key|df2_key|sales|units| key| > +---+---+-+-++ > | d| d|3| null| d| > | null| null| null|2|null| > | b| b|5| 10| b| > | a| a|3|6| a| > +---+---+-+-++>>> ret.explain() > == Physical Plan == > *(5) Project [key#213 AS df1_key#258, key#213 AS df2_key#259, sales#223L, > units#230L, coalesce(key#213, key#213) AS key#260] > +- SortMergeJoin [key#213], [key#237], FullOuter >:- *(2) Sort [key#213 ASC NULLS FIRST], false, 0 >: +- *(2) HashAggregate(keys=[key#213], functions=[sum(sales#214L)]) >: +- Exchange hashpartitioning(key#213, 200) >:+- *(1) HashAggregate(keys=[key#213], > functions=[partial_sum(sales#214L)]) >: +- *(1) Project [key#213, sales#214L] >: +- *(1) Filter (isnotnull(key#213) && NOT (key#213 = c)) >: +- Scan ExistingRDD[key#213,sales#214L,units#215L] >+- *(4) Sort [key#237 ASC NULLS FIRST], false, 0 > +- *(4) HashAggregate(keys=[key#237], functions=[sum(units#239L)]) > +- Exchange hashpartitioning(key#237, 200) > +- *(3) HashAggregate(keys=[key#237], > functions=[partial_sum(units#239L)]) >+- *(3) Project [key#237, units#239L] > +- *(3) Filter (isnotnull(key#237) && NOT (key#237 = d)) > +- Scan ExistingRDD[key#237,sales#238L,units#239L] > {code} > output for 3.0.1: > {code:java} > // code placeholder > >>> sc.version > u'3.0.1' > >>> import pyspark.sql.functions as F > >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, > >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', > >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > >>> input_df = spark.createDataFrame(d) > /usr/local/lib/python2.7/site-packages/pyspark/sql/session.py:381: > UserWarning: inferring schema from dict is deprecated,please use > pyspark.sql.Row instead > warnings.warn("inferring schema from dict is deprecated," > >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > >>> df2 = inp
[jira] [Assigned] (SPARK-33135) Use listLocatedStatus from FileSystem implementations
[ https://issues.apache.org/jira/browse/SPARK-33135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33135: Assignee: Apache Spark > Use listLocatedStatus from FileSystem implementations > - > > Key: SPARK-33135 > URL: https://issues.apache.org/jira/browse/SPARK-33135 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Major > > {{HadoopFsUtils.parallelListLeafFiles}} currently only calls > {{listLocatedStatus}} when the {{FileSystem}} impl is > {{DistributedFileSystem}} or {{ViewFileSystem}}. For other types of > {{FileSystem}}, it calls {{listStatus}} and then subsequently calls > {{getFileBlockLocations}} on all the result {{FileStatus}}es. > In Hadoop client, {{listLocatedStatus}} is a well-defined API and in fact it > is often overridden by specific file system implementations, such as S3A. The > default {{listLocatedStatus}} also has similar behavior as it's done in Spark. > Therefore, instead of re-implement the logic in Spark itself, it's better to > rely on the {{FileSystem}}-specific implementation for {{listLocatedStatus}}, > which could include its own optimizations in the code path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`
[ https://issues.apache.org/jira/browse/SPARK-33129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33129. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30028 [https://github.com/apache/spark/pull/30028] > Since the sbt version is now upgraded, old `test-only` needs to be replaced > with `testOnly` > --- > > Key: SPARK-33129 > URL: https://issues.apache.org/jira/browse/SPARK-33129 > Project: Spark > Issue Type: Bug > Components: Build, docs >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > Follow up to SPARK-21708, updating the references to test-only with testOnly. > As the older syntax no longer works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`
[ https://issues.apache.org/jira/browse/SPARK-33129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33129: - Assignee: Prashant Sharma > Since the sbt version is now upgraded, old `test-only` needs to be replaced > with `testOnly` > --- > > Key: SPARK-33129 > URL: https://issues.apache.org/jira/browse/SPARK-33129 > Project: Spark > Issue Type: Bug > Components: Build, docs >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > > Follow up to SPARK-21708, updating the references to test-only with testOnly. > As the older syntax no longer works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33135) Use listLocatedStatus from FileSystem implementations
[ https://issues.apache.org/jira/browse/SPARK-33135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33135: Assignee: (was: Apache Spark) > Use listLocatedStatus from FileSystem implementations > - > > Key: SPARK-33135 > URL: https://issues.apache.org/jira/browse/SPARK-33135 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Chao Sun >Priority: Major > > {{HadoopFsUtils.parallelListLeafFiles}} currently only calls > {{listLocatedStatus}} when the {{FileSystem}} impl is > {{DistributedFileSystem}} or {{ViewFileSystem}}. For other types of > {{FileSystem}}, it calls {{listStatus}} and then subsequently calls > {{getFileBlockLocations}} on all the result {{FileStatus}}es. > In Hadoop client, {{listLocatedStatus}} is a well-defined API and in fact it > is often overridden by specific file system implementations, such as S3A. The > default {{listLocatedStatus}} also has similar behavior as it's done in Spark. > Therefore, instead of re-implement the logic in Spark itself, it's better to > rely on the {{FileSystem}}-specific implementation for {{listLocatedStatus}}, > which could include its own optimizations in the code path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33135) Use listLocatedStatus from FileSystem implementations
[ https://issues.apache.org/jira/browse/SPARK-33135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213203#comment-17213203 ] Apache Spark commented on SPARK-33135: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/30019 > Use listLocatedStatus from FileSystem implementations > - > > Key: SPARK-33135 > URL: https://issues.apache.org/jira/browse/SPARK-33135 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Chao Sun >Priority: Major > > {{HadoopFsUtils.parallelListLeafFiles}} currently only calls > {{listLocatedStatus}} when the {{FileSystem}} impl is > {{DistributedFileSystem}} or {{ViewFileSystem}}. For other types of > {{FileSystem}}, it calls {{listStatus}} and then subsequently calls > {{getFileBlockLocations}} on all the result {{FileStatus}}es. > In Hadoop client, {{listLocatedStatus}} is a well-defined API and in fact it > is often overridden by specific file system implementations, such as S3A. The > default {{listLocatedStatus}} also has similar behavior as it's done in Spark. > Therefore, instead of re-implement the logic in Spark itself, it's better to > rely on the {{FileSystem}}-specific implementation for {{listLocatedStatus}}, > which could include its own optimizations in the code path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33135) Use listLocatedStatus from FileSystem implementations
Chao Sun created SPARK-33135: Summary: Use listLocatedStatus from FileSystem implementations Key: SPARK-33135 URL: https://issues.apache.org/jira/browse/SPARK-33135 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: Chao Sun {{HadoopFsUtils.parallelListLeafFiles}} currently only calls {{listLocatedStatus}} when the {{FileSystem}} impl is {{DistributedFileSystem}} or {{ViewFileSystem}}. For other types of {{FileSystem}}, it calls {{listStatus}} and then subsequently calls {{getFileBlockLocations}} on all the result {{FileStatus}}es. In Hadoop client, {{listLocatedStatus}} is a well-defined API and in fact it is often overridden by specific file system implementations, such as S3A. The default {{listLocatedStatus}} also has similar behavior as it's done in Spark. Therefore, instead of re-implement the logic in Spark itself, it's better to rely on the {{FileSystem}}-specific implementation for {{listLocatedStatus}}, which could include its own optimizations in the code path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33134) Incorrect nested complex JSON fields raise an exception
[ https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33134: --- Description: The code below: {code:scala} val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 583651}]""").toDF("events") val event = new StructType() .add("playerId", LongType) .add("cards", ArrayType( new StructType() .add("id", LongType) .add("rank", StringType))) val pokerhand_events = pokerhand_raw .select(explode(from_json($"events", ArrayType(event))).as("event")) pokerhand_events.show {code} throw the exception in the PERMISSIVE mode (default): {code:java} Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) {code} The same works in Spark 2.4: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.6 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) ... scala> pokerhand_events.show() +-+ |event| +-+ +-+ {code} was: The code below: {code:scala} val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 583651}]""").toDF("events") val event = new StructType() .add("playerId", LongType) .add("cards", ArrayType( new StructType() .add("id", LongType) .add("rank", StringType))) val pokerhand_events = pokerhand_raw .select(explode(from_json($"events", ArrayType(event))).as("event")) pokerhand_events.show {code} throw the exception in the PERMISSIVE mode (default): {code:java} Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) {code} The same works in Spark 2.4: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.6 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) ... scala> pokerhand_events.show() +-+ |event| +-+ +-+ {code} > Incorrect nested complex JSON fields raise an exception > --- > > Key: SPARK-33134 > URL: https://issues.apache.org/jira/browse/SPARK-33134 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The code below: > {code:scala} > val pokerhand_raw = Seq("""[{"cards": [11], "playerId": > 583651}]""").toDF("events") > val event = new StructType() > .add("playerId", LongType) > .add("cards", ArrayType( > new StructType() > .add("id", LongType) > .add("rank", StringType))) > val pokerhand_events = pokerhand_raw > .select(explode(from_json($"events", ArrayType(event))).as("event")) > pokerhand_events.show > {code} > throw the exception in the PERMISSIVE mode (default): > {code:java} > Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to > org.apache.spark.sql.catalyst.util.ArrayData > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRo
[jira] [Commented] (SPARK-25547) Pluggable jdbc connection factory
[ https://issues.apache.org/jira/browse/SPARK-25547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213152#comment-17213152 ] Gabor Somogyi commented on SPARK-25547: --- [~fsauer65] JDBC connection provider API is added here: https://github.com/apache/spark/blob/dc697a8b598aea922ee6620d87f3ace2f7947231/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcConnectionProvider.scala#L36 Do you think we can close this jira? > Pluggable jdbc connection factory > - > > Key: SPARK-25547 > URL: https://issues.apache.org/jira/browse/SPARK-25547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Frank Sauer >Priority: Major > > The ability to provide a custom connectionFactoryProvider via JDBCOptions so > that JdbcUtils.createConnectionFactory can produce a custom connection > factory would be very useful. In our case we needed to have the ability to > load balance connections to an AWS Aurora Postgres cluster by round-robining > through the endpoints of the read replicas since their own loan balancing was > insufficient. We got away with it by copying most of the spark jdbc package > and provide this feature there and changing the format from jdbc to our new > package. However it would be nice if this were supported out of the box via > a new option in JDBCOptions providing the classname for a > ConnectionFactoryProvider. I'm creating this Jira in order to submit a PR > which I have ready to go. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33128) mismatched input since Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-33128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213137#comment-17213137 ] Yang Jie commented on SPARK-33128: -- set `spark.sql.ansi.enabled` to true can work with Spark 3.0,I'm not sure if this is a bug. cc [~maropu] > mismatched input since Spark 3.0 > > > Key: SPARK-33128 > URL: https://issues.apache.org/jira/browse/SPARK-33128 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Spark 2.4: > {noformat} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.4 > /_/ > Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_221) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.sql("SELECT 1 UNION SELECT 1 UNION ALL SELECT 1").show > +---+ > | 1| > +---+ > | 1| > | 1| > +---+ > {noformat} > Spark 3.x: > {noformat} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 14.0.1) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.sql("SELECT 1 UNION SELECT 1 UNION ALL SELECT 1").show > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'SELECT' expecting {, ';'}(line 1, pos 15) > == SQL == > SELECT 1 UNION SELECT 1 UNION ALL SELECT 1 > ---^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) > at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607) > ... 47 elided > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13860) TPCDS query 39 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-13860: --- Assignee: Leanken.Lin > TPCDS query 39 returns wrong results compared to TPC official result set > - > > Key: SPARK-13860 > URL: https://issues.apache.org/jira/browse/SPARK-13860 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.1.1, 2.2.0 >Reporter: JESSE CHEN >Assignee: Leanken.Lin >Priority: Major > Labels: bulk-closed, tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 39 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > q39a - 3 extra rows in SparkSQL output (eg. > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]) ; q39b > - 3 extra rows in SparkSQL output (eg. > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]) > Actual results 39a: > {noformat} > [1,265,1,324.75,1.2438391781531353,1,265,2,329.0,1.0151581328149208] > [1,363,1,499.5,1.031941572270649,1,363,2,321.0,1.1411766752007977] > [1,679,1,373.75,1.0955498064867504,1,679,2,417.5,1.042970994259454] > [1,695,1,450.75,1.0835888283564505,1,695,2,368.75,1.1356494125569416] > [1,789,1,357.25,1.03450938027956,1,789,2,410.0,1.0284221852702604] > [1,815,1,216.5,1.1702270938111008,1,815,2,150.5,1.3057281471249382] > [1,827,1,271.75,1.1046890134130438,1,827,2,424.75,1.1653198631238286] > [1,1041,1,382.5,1.284808399803008,1,1041,2,424.75,1.000577271456812] > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733] > [1,1569,1,212.0,1.630213519639535,1,1569,2,239.25,1.2641513267800557] > [1,1623,1,338.25,1.1285483279713715,1,1623,2,261.3,1.2717809002195564] > [1,2581,1,448.5,1.060429041250449,1,2581,2,476.25,1.0362984739390064] > [1,2705,1,246.25,1.0120308357959693,1,2705,2,294.7,1.0742134101583702] > [1,3131,1,393.75,1.0037613982687346,1,3131,2,480.5,1.0669144981482768] > [1,3291,1,374.5,1.195189833087008,1,3291,2,265.25,1.572972106948466] > [1,3687,1,279.75,1.4260909081999698,1,3687,2,157.25,1.4534340882531784] > [1,4955,1,495.25,1.0318296151625301,1,4955,2,322.5,1.1693842343776149] > [1,5627,1,282.75,1.5657032366359889,1,5627,2,297.5,1.2084286841430678] > [1,7017,1,175.5,1.0427454215644427,1,7017,2,321.3,1.0183356932936254] > [1,7317,1,366.3,1.025466403613547,1,7317,2,378.0,1.2172513189920555] > [1,7569,1,430.5,1.0874396852180854,1,7569,2,360.25,1.047005559314515] > [1,7999,1,166.25,1.7924231710846223,1,7999,2,375.3,1.008092263550718] > [1,8319,1,306.75,1.1615378040478215,1,8319,2,276.0,1.1420996385609428] > [1,8443,1,327.75,1.256718374192724,1,8443,2,332.5,1.0044167259988928] > [1,8583,1,319.5,1.024108893111539,1,8583,2,310.25,1.2358813775861328] > [1,8591,1,398.0,1.1478168692042447,1,8591,2,355.75,1.0024472149348966] > [1,8611,1,300.5,1.5191545184147954,1,8611,2,243.75,1.2342122780960432] > [1,9081,1,367.0,1.0878932141280895,1,9081,2,435.0,1.0330530776324107] > [1,9357,1,351.7,1.1902922622025887,1,9357,2,427.0,1.0438583026358363] > [1,9449,1,406.25,1.0183183104803557,1,9449,2,175.0,1.0544779796296408] > [1,9713,1,242.5,1.1035044355064203,1,9713,2,393.0,1.208474608738988] > [1,9809,1,479.0,1.0189602512117633,1,9809,2,317.5,1.0614142074924882] > [1,9993,1,417.75,1.0099832672435247,1,9993,2,204.5,1.552870745350107] > [1,10127,1,239.75,1.0561770587198123,1,10127,2,359.25,1.1857980403742183] > [1,11159,1,407.25,1.0785507154337637,1,11159,2,250.0,1.334757905639321] > [1,11277,1,211.25,1.2615858275316627,1,11277,2,330.75,1.0808767951625093] > [1,11937,1,344.5,1.085804026843784,1,11937,2,200.34,1.0638527063883725] > [1,12373,1,387.75,1.1014904822941258,1,12373,2,306.0,1.0761744390394028] > [1,12471,1,365.25,1.0607570183728479,1,12471,2,327.25,1.0547560580567852] > [1,12625,1,279.0,1.3016560542373208,1,12625,2,443.25,1.0604958838068959] > [1,12751,1,280.75,1.10833057888089,1,12751,2,369.3,1.3416504398884601] > [1,12779,1,331.0,1.041690207320035,1,12779,2,359.0,1.028978056175258] > [1,13077,1,367.7,1.345523904195734,1,13077,2,358.7,1.5132429058096555] > [1,13191,1,260.25,1.063569632291568,1,13191,2,405.0,1.0197999172180061] > [1,13561,1,335.25,1.2609616961776389,1,13561,2,240.0,1.0513604502245155] > [1,13935,1,311.75,1.0399289695412326,1,13935,2,275.0,1.0367527180321774] > [1,14687,1,358.0,1.4369356919381713,1,14687,2,187.0,1.5493631531474956] > [1,14719,1,209.0,1.0411509639707628,1,14719,2,489.0,1.376616882800804] > [1,15345,1,148.5,1.5295784035794024,1,15345,2,246.5,1.5087987747231526] > [1,15427,1,482.75,1.0124238928335043,1,15427,2,333.25,1.2724770126308678] > [1,15647,1,201.66,1.2857931876095743,1
[jira] [Reopened] (SPARK-13860) TPCDS query 39 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-13860: - > TPCDS query 39 returns wrong results compared to TPC official result set > - > > Key: SPARK-13860 > URL: https://issues.apache.org/jira/browse/SPARK-13860 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.1.1, 2.2.0 >Reporter: JESSE CHEN >Assignee: Leanken.Lin >Priority: Major > Labels: bulk-closed, tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 39 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > q39a - 3 extra rows in SparkSQL output (eg. > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]) ; q39b > - 3 extra rows in SparkSQL output (eg. > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]) > Actual results 39a: > {noformat} > [1,265,1,324.75,1.2438391781531353,1,265,2,329.0,1.0151581328149208] > [1,363,1,499.5,1.031941572270649,1,363,2,321.0,1.1411766752007977] > [1,679,1,373.75,1.0955498064867504,1,679,2,417.5,1.042970994259454] > [1,695,1,450.75,1.0835888283564505,1,695,2,368.75,1.1356494125569416] > [1,789,1,357.25,1.03450938027956,1,789,2,410.0,1.0284221852702604] > [1,815,1,216.5,1.1702270938111008,1,815,2,150.5,1.3057281471249382] > [1,827,1,271.75,1.1046890134130438,1,827,2,424.75,1.1653198631238286] > [1,1041,1,382.5,1.284808399803008,1,1041,2,424.75,1.000577271456812] > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733] > [1,1569,1,212.0,1.630213519639535,1,1569,2,239.25,1.2641513267800557] > [1,1623,1,338.25,1.1285483279713715,1,1623,2,261.3,1.2717809002195564] > [1,2581,1,448.5,1.060429041250449,1,2581,2,476.25,1.0362984739390064] > [1,2705,1,246.25,1.0120308357959693,1,2705,2,294.7,1.0742134101583702] > [1,3131,1,393.75,1.0037613982687346,1,3131,2,480.5,1.0669144981482768] > [1,3291,1,374.5,1.195189833087008,1,3291,2,265.25,1.572972106948466] > [1,3687,1,279.75,1.4260909081999698,1,3687,2,157.25,1.4534340882531784] > [1,4955,1,495.25,1.0318296151625301,1,4955,2,322.5,1.1693842343776149] > [1,5627,1,282.75,1.5657032366359889,1,5627,2,297.5,1.2084286841430678] > [1,7017,1,175.5,1.0427454215644427,1,7017,2,321.3,1.0183356932936254] > [1,7317,1,366.3,1.025466403613547,1,7317,2,378.0,1.2172513189920555] > [1,7569,1,430.5,1.0874396852180854,1,7569,2,360.25,1.047005559314515] > [1,7999,1,166.25,1.7924231710846223,1,7999,2,375.3,1.008092263550718] > [1,8319,1,306.75,1.1615378040478215,1,8319,2,276.0,1.1420996385609428] > [1,8443,1,327.75,1.256718374192724,1,8443,2,332.5,1.0044167259988928] > [1,8583,1,319.5,1.024108893111539,1,8583,2,310.25,1.2358813775861328] > [1,8591,1,398.0,1.1478168692042447,1,8591,2,355.75,1.0024472149348966] > [1,8611,1,300.5,1.5191545184147954,1,8611,2,243.75,1.2342122780960432] > [1,9081,1,367.0,1.0878932141280895,1,9081,2,435.0,1.0330530776324107] > [1,9357,1,351.7,1.1902922622025887,1,9357,2,427.0,1.0438583026358363] > [1,9449,1,406.25,1.0183183104803557,1,9449,2,175.0,1.0544779796296408] > [1,9713,1,242.5,1.1035044355064203,1,9713,2,393.0,1.208474608738988] > [1,9809,1,479.0,1.0189602512117633,1,9809,2,317.5,1.0614142074924882] > [1,9993,1,417.75,1.0099832672435247,1,9993,2,204.5,1.552870745350107] > [1,10127,1,239.75,1.0561770587198123,1,10127,2,359.25,1.1857980403742183] > [1,11159,1,407.25,1.0785507154337637,1,11159,2,250.0,1.334757905639321] > [1,11277,1,211.25,1.2615858275316627,1,11277,2,330.75,1.0808767951625093] > [1,11937,1,344.5,1.085804026843784,1,11937,2,200.34,1.0638527063883725] > [1,12373,1,387.75,1.1014904822941258,1,12373,2,306.0,1.0761744390394028] > [1,12471,1,365.25,1.0607570183728479,1,12471,2,327.25,1.0547560580567852] > [1,12625,1,279.0,1.3016560542373208,1,12625,2,443.25,1.0604958838068959] > [1,12751,1,280.75,1.10833057888089,1,12751,2,369.3,1.3416504398884601] > [1,12779,1,331.0,1.041690207320035,1,12779,2,359.0,1.028978056175258] > [1,13077,1,367.7,1.345523904195734,1,13077,2,358.7,1.5132429058096555] > [1,13191,1,260.25,1.063569632291568,1,13191,2,405.0,1.0197999172180061] > [1,13561,1,335.25,1.2609616961776389,1,13561,2,240.0,1.0513604502245155] > [1,13935,1,311.75,1.0399289695412326,1,13935,2,275.0,1.0367527180321774] > [1,14687,1,358.0,1.4369356919381713,1,14687,2,187.0,1.5493631531474956] > [1,14719,1,209.0,1.0411509639707628,1,14719,2,489.0,1.376616882800804] > [1,15345,1,148.5,1.5295784035794024,1,15345,2,246.5,1.5087987747231526] > [1,15427,1,482.75,1.0124238928335043,1,15427,2,333.25,1.2724770126308678] > [1,15647,1,201.66,1.2857931876095743,1,15647,2,249.25,1.3648172990142
[jira] [Resolved] (SPARK-13860) TPCDS query 39 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-13860. - Fix Version/s: 3.1.0 Resolution: Fixed > TPCDS query 39 returns wrong results compared to TPC official result set > - > > Key: SPARK-13860 > URL: https://issues.apache.org/jira/browse/SPARK-13860 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.1.1, 2.2.0 >Reporter: JESSE CHEN >Assignee: Leanken.Lin >Priority: Major > Labels: bulk-closed, tpcds-result-mismatch > Fix For: 3.1.0 > > > Testing Spark SQL using TPC queries. Query 39 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > q39a - 3 extra rows in SparkSQL output (eg. > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]) ; q39b > - 3 extra rows in SparkSQL output (eg. > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]) > Actual results 39a: > {noformat} > [1,265,1,324.75,1.2438391781531353,1,265,2,329.0,1.0151581328149208] > [1,363,1,499.5,1.031941572270649,1,363,2,321.0,1.1411766752007977] > [1,679,1,373.75,1.0955498064867504,1,679,2,417.5,1.042970994259454] > [1,695,1,450.75,1.0835888283564505,1,695,2,368.75,1.1356494125569416] > [1,789,1,357.25,1.03450938027956,1,789,2,410.0,1.0284221852702604] > [1,815,1,216.5,1.1702270938111008,1,815,2,150.5,1.3057281471249382] > [1,827,1,271.75,1.1046890134130438,1,827,2,424.75,1.1653198631238286] > [1,1041,1,382.5,1.284808399803008,1,1041,2,424.75,1.000577271456812] > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733] > [1,1569,1,212.0,1.630213519639535,1,1569,2,239.25,1.2641513267800557] > [1,1623,1,338.25,1.1285483279713715,1,1623,2,261.3,1.2717809002195564] > [1,2581,1,448.5,1.060429041250449,1,2581,2,476.25,1.0362984739390064] > [1,2705,1,246.25,1.0120308357959693,1,2705,2,294.7,1.0742134101583702] > [1,3131,1,393.75,1.0037613982687346,1,3131,2,480.5,1.0669144981482768] > [1,3291,1,374.5,1.195189833087008,1,3291,2,265.25,1.572972106948466] > [1,3687,1,279.75,1.4260909081999698,1,3687,2,157.25,1.4534340882531784] > [1,4955,1,495.25,1.0318296151625301,1,4955,2,322.5,1.1693842343776149] > [1,5627,1,282.75,1.5657032366359889,1,5627,2,297.5,1.2084286841430678] > [1,7017,1,175.5,1.0427454215644427,1,7017,2,321.3,1.0183356932936254] > [1,7317,1,366.3,1.025466403613547,1,7317,2,378.0,1.2172513189920555] > [1,7569,1,430.5,1.0874396852180854,1,7569,2,360.25,1.047005559314515] > [1,7999,1,166.25,1.7924231710846223,1,7999,2,375.3,1.008092263550718] > [1,8319,1,306.75,1.1615378040478215,1,8319,2,276.0,1.1420996385609428] > [1,8443,1,327.75,1.256718374192724,1,8443,2,332.5,1.0044167259988928] > [1,8583,1,319.5,1.024108893111539,1,8583,2,310.25,1.2358813775861328] > [1,8591,1,398.0,1.1478168692042447,1,8591,2,355.75,1.0024472149348966] > [1,8611,1,300.5,1.5191545184147954,1,8611,2,243.75,1.2342122780960432] > [1,9081,1,367.0,1.0878932141280895,1,9081,2,435.0,1.0330530776324107] > [1,9357,1,351.7,1.1902922622025887,1,9357,2,427.0,1.0438583026358363] > [1,9449,1,406.25,1.0183183104803557,1,9449,2,175.0,1.0544779796296408] > [1,9713,1,242.5,1.1035044355064203,1,9713,2,393.0,1.208474608738988] > [1,9809,1,479.0,1.0189602512117633,1,9809,2,317.5,1.0614142074924882] > [1,9993,1,417.75,1.0099832672435247,1,9993,2,204.5,1.552870745350107] > [1,10127,1,239.75,1.0561770587198123,1,10127,2,359.25,1.1857980403742183] > [1,11159,1,407.25,1.0785507154337637,1,11159,2,250.0,1.334757905639321] > [1,11277,1,211.25,1.2615858275316627,1,11277,2,330.75,1.0808767951625093] > [1,11937,1,344.5,1.085804026843784,1,11937,2,200.34,1.0638527063883725] > [1,12373,1,387.75,1.1014904822941258,1,12373,2,306.0,1.0761744390394028] > [1,12471,1,365.25,1.0607570183728479,1,12471,2,327.25,1.0547560580567852] > [1,12625,1,279.0,1.3016560542373208,1,12625,2,443.25,1.0604958838068959] > [1,12751,1,280.75,1.10833057888089,1,12751,2,369.3,1.3416504398884601] > [1,12779,1,331.0,1.041690207320035,1,12779,2,359.0,1.028978056175258] > [1,13077,1,367.7,1.345523904195734,1,13077,2,358.7,1.5132429058096555] > [1,13191,1,260.25,1.063569632291568,1,13191,2,405.0,1.0197999172180061] > [1,13561,1,335.25,1.2609616961776389,1,13561,2,240.0,1.0513604502245155] > [1,13935,1,311.75,1.0399289695412326,1,13935,2,275.0,1.0367527180321774] > [1,14687,1,358.0,1.4369356919381713,1,14687,2,187.0,1.5493631531474956] > [1,14719,1,209.0,1.0411509639707628,1,14719,2,489.0,1.376616882800804] > [1,15345,1,148.5,1.5295784035794024,1,15345,2,246.5,1.5087987747231526] > [1,15427,1,482.75,1.0124238928335043,1,15427,2,333.25,1.2724770126308678] >
[jira] [Assigned] (SPARK-33134) Incorrect nested complex JSON fields raise an exception
[ https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33134: Assignee: (was: Apache Spark) > Incorrect nested complex JSON fields raise an exception > --- > > Key: SPARK-33134 > URL: https://issues.apache.org/jira/browse/SPARK-33134 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The code below: > {code:scala} > val pokerhand_raw = Seq("""[{"cards": [11], "playerId": > 583651}]""").toDF("events") > val event = new StructType() > .add("playerId", LongType) > .add("cards", ArrayType( > new StructType() > .add("id", LongType) > .add("rank", StringType))) > val pokerhand_events = pokerhand_raw > .select(explode(from_json($"events", ArrayType(event))).as("event")) > pokerhand_events.show > {code} > throw the exception in the PERMISSIVE mode (default): > {code:java} > Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to > org.apache.spark.sql.catalyst.util.ArrayData > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) > {code} > The same works in Spark 2.4: > {code:scala} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.6 > /_/ > Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) > ... > scala> pokerhand_events.show() > +-+ > |event| > +-+ > +-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33134) Incorrect nested complex JSON fields raise an exception
[ https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33134: Assignee: Apache Spark > Incorrect nested complex JSON fields raise an exception > --- > > Key: SPARK-33134 > URL: https://issues.apache.org/jira/browse/SPARK-33134 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > The code below: > {code:scala} > val pokerhand_raw = Seq("""[{"cards": [11], "playerId": > 583651}]""").toDF("events") > val event = new StructType() > .add("playerId", LongType) > .add("cards", ArrayType( > new StructType() > .add("id", LongType) > .add("rank", StringType))) > val pokerhand_events = pokerhand_raw > .select(explode(from_json($"events", ArrayType(event))).as("event")) > pokerhand_events.show > {code} > throw the exception in the PERMISSIVE mode (default): > {code:java} > Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to > org.apache.spark.sql.catalyst.util.ArrayData > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) > {code} > The same works in Spark 2.4: > {code:scala} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.6 > /_/ > Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) > ... > scala> pokerhand_events.show() > +-+ > |event| > +-+ > +-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33134) Incorrect nested complex JSON fields raise an exception
[ https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213109#comment-17213109 ] Apache Spark commented on SPARK-33134: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30031 > Incorrect nested complex JSON fields raise an exception > --- > > Key: SPARK-33134 > URL: https://issues.apache.org/jira/browse/SPARK-33134 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The code below: > {code:scala} > val pokerhand_raw = Seq("""[{"cards": [11], "playerId": > 583651}]""").toDF("events") > val event = new StructType() > .add("playerId", LongType) > .add("cards", ArrayType( > new StructType() > .add("id", LongType) > .add("rank", StringType))) > val pokerhand_events = pokerhand_raw > .select(explode(from_json($"events", ArrayType(event))).as("event")) > pokerhand_events.show > {code} > throw the exception in the PERMISSIVE mode (default): > {code:java} > Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to > org.apache.spark.sql.catalyst.util.ArrayData > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) > {code} > The same works in Spark 2.4: > {code:scala} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.6 > /_/ > Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) > ... > scala> pokerhand_events.show() > +-+ > |event| > +-+ > +-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33134) Incorrect nested complex JSON fields raise an exception
[ https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213108#comment-17213108 ] Apache Spark commented on SPARK-33134: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30031 > Incorrect nested complex JSON fields raise an exception > --- > > Key: SPARK-33134 > URL: https://issues.apache.org/jira/browse/SPARK-33134 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The code below: > {code:scala} > val pokerhand_raw = Seq("""[{"cards": [11], "playerId": > 583651}]""").toDF("events") > val event = new StructType() > .add("playerId", LongType) > .add("cards", ArrayType( > new StructType() > .add("id", LongType) > .add("rank", StringType))) > val pokerhand_events = pokerhand_raw > .select(explode(from_json($"events", ArrayType(event))).as("event")) > pokerhand_events.show > {code} > throw the exception in the PERMISSIVE mode (default): > {code:java} > Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to > org.apache.spark.sql.catalyst.util.ArrayData > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) > {code} > The same works in Spark 2.4: > {code:scala} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.6 > /_/ > Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) > ... > scala> pokerhand_events.show() > +-+ > |event| > +-+ > +-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'
[ https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33132: Fix Version/s: (was: 3.0.2) > The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as > 'NaN Undefined' > - > > Key: SPARK-33132 > URL: https://issues.apache.org/jira/browse/SPARK-33132 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Priority: Minor > Attachments: Stage Summary shows NaN undefined.png > > > Spark Version: 3.0.1 > Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics > was shown as '*NaN Undefind*' when the *readBytes* value is negative, as > the attachment shows. > *curl > http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary* > { > "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ], > "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ], > ... > "shuffleReadMetrics" : > { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, > 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, > 2002.0, 2002.0, 2002.0, 2002.0 ], ... } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame
[ https://issues.apache.org/jira/browse/SPARK-33125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33125: --- Assignee: jiaan.geng > Improve the error when Lead and Lag are not allowed to specify window frame > --- > > Key: SPARK-33125 > URL: https://issues.apache.org/jira/browse/SPARK-33125 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > Except for Postgresql, other data sources (for example: vertica, oracle, > redshift, mysql, presto) are not allowed to specify window frame for the Lead > and Lag functions. > But the current error message is not clear enough. > {code:java} > Window Frame $f must match the required frame > {code} > The following error message is better. > {code:java} > Cannot specify window frame for lead function > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33125) Improve the error when Lead and Lag are not allowed to specify window frame
[ https://issues.apache.org/jira/browse/SPARK-33125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33125. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30021 [https://github.com/apache/spark/pull/30021] > Improve the error when Lead and Lag are not allowed to specify window frame > --- > > Key: SPARK-33125 > URL: https://issues.apache.org/jira/browse/SPARK-33125 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > Except for Postgresql, other data sources (for example: vertica, oracle, > redshift, mysql, presto) are not allowed to specify window frame for the Lead > and Lag functions. > But the current error message is not clear enough. > {code:java} > Window Frame $f must match the required frame > {code} > The following error message is better. > {code:java} > Cannot specify window frame for lead function > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33134) Incorrect nested complex JSON fields raise an exception
Maxim Gekk created SPARK-33134: -- Summary: Incorrect nested complex JSON fields raise an exception Key: SPARK-33134 URL: https://issues.apache.org/jira/browse/SPARK-33134 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.2, 3.1.0 Reporter: Maxim Gekk The code below: {code:scala} val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 583651}]""").toDF("events") val event = new StructType() .add("playerId", LongType) .add("cards", ArrayType( new StructType() .add("id", LongType) .add("rank", StringType))) val pokerhand_events = pokerhand_raw .select(explode(from_json($"events", ArrayType(event))).as("event")) pokerhand_events.show {code} throw the exception in the PERMISSIVE mode (default): {code:java} Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) {code} The same works in Spark 2.4: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.6 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265) ... scala> pokerhand_events.show() +-+ |event| +-+ +-+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)
[ https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33081. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29972 [https://github.com/apache/spark/pull/29972] > Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of > columns (DB2 dialect) > -- > > Key: SPARK-33081 > URL: https://issues.apache.org/jira/browse/SPARK-33081 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.1.0 > > > Override the default SQL strings for: > * ALTER TABLE UPDATE COLUMN TYPE > * ALTER TABLE UPDATE COLUMN NULLABILITY > in the following DB2 JDBC dialect according to official documentation. > Write DB2 integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)
[ https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33081: --- Assignee: Huaxin Gao > Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of > columns (DB2 dialect) > -- > > Key: SPARK-33081 > URL: https://issues.apache.org/jira/browse/SPARK-33081 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > Override the default SQL strings for: > * ALTER TABLE UPDATE COLUMN TYPE > * ALTER TABLE UPDATE COLUMN NULLABILITY > in the following DB2 JDBC dialect according to official documentation. > Write DB2 integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33133) History server fails when loading invalid rolling event logs
[ https://issues.apache.org/jira/browse/SPARK-33133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Binford updated SPARK-33133: - Description: We have run into an issue where our history server fails to load new applications, and when restarted, fails to load any applications at all. This happens when it encounters invalid rolling event log files. We encounter this with long running streaming applications. There seems to be two issues here that lead to problems: * It looks like our long running streaming applications event log directory is being cleaned up. The next time the application logs event data, it recreates the event log directory but without recreating the "appstatus" file. I don't know the full extent of this behavior or if something "wrong" is happening here. * The history server then reads this new folder, and throws an exception because the "appstatus" file doesn't exist in the rolling event log folder. This exception breaks the entire listing process, so no new applications will be read, and if restarted no applications at all will be read. There seems like a couple ways to go about fixing this, and I'm curious anyone's thoughts who knows more about how the history server works, specifically with rolling event logs: * Don't completely fail checking for new applications if one bad rolling event log folder is encountered. This seems like the simplest fix and makes sense to me, it already checks for a few other errors and ignores them. It doesn't necessarily fix the underlying issue that leads to this happening though. * Figure out why the in progress event log folder is being deleted and make sure that doesn't happen. Maybe this is supposed to happen? Or maybe we don't want to delete the top level folder and only delete event log files within the folders? Again I don't know the exact current behavior here with this. * When writing new event log data, make sure the folder and appstatus file exist every time, creating them again if not. Here's the stack trace we encounter when this happens, from 3.0.1 with a couple extra MRs backported that I hoped would fix the issue: {{2020-10-13 12:10:31,751 ERROR history.FsHistoryProvider: Exception in checking for event log updates2020-10-13 12:10:31,751 ERROR history.FsHistoryProvider: Exception in checking for event log updatesjava.lang.IllegalArgumentException: requirement failed: Log directory must contain an appstatus file! at scala.Predef$.require(Predef.scala:281) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:214) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files(EventLogFileReaders.scala:211) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles$lzycompute(EventLogFileReaders.scala:221) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles(EventLogFileReaders.scala:220) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.lastEventLogFile(EventLogFileReaders.scala:272) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:240) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:524) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466) at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:347) at scala.collection.TraversableLike.filter$(TraversableLike.scala:347) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287) at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at jav
[jira] [Resolved] (SPARK-32858) UnwrapCastInBinaryComparison: support other numeric types
[ https://issues.apache.org/jira/browse/SPARK-32858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32858. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29792 [https://github.com/apache/spark/pull/29792] > UnwrapCastInBinaryComparison: support other numeric types > - > > Key: SPARK-32858 > URL: https://issues.apache.org/jira/browse/SPARK-32858 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.1.0 > > > After SPARK-24994 is done, we need to follow-up and support more types other > than integral, e.g., float, double, decimal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32858) UnwrapCastInBinaryComparison: support other numeric types
[ https://issues.apache.org/jira/browse/SPARK-32858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32858: --- Assignee: Chao Sun > UnwrapCastInBinaryComparison: support other numeric types > - > > Key: SPARK-32858 > URL: https://issues.apache.org/jira/browse/SPARK-32858 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > After SPARK-24994 is done, we need to follow-up and support more types other > than integral, e.g., float, double, decimal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33133) History server fails when loading invalid rolling event logs
Adam Binford created SPARK-33133: Summary: History server fails when loading invalid rolling event logs Key: SPARK-33133 URL: https://issues.apache.org/jira/browse/SPARK-33133 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1 Reporter: Adam Binford We have run into an issue where our history server fails to load new applications, and when restarted, fails to load any applications at all. This happens when it encounters invalid rolling event log files. We encounter this with long running streaming applications. There seems to be two issues here that lead to problems: * It looks like our long running streaming applications event log directory is being cleaned up. The next time the application logs event data, it recreates the event log directory but without recreating the "appstatus" file. I don't know the full extent of this behavior or if something "wrong" is happening here. * The history server then reads this new folder, and throws an exception because the "appstatus" file doesn't exist in the rolling event log folder. This exception breaks the entire listing process, so no new applications will be read, and if restarted no applications at all will be read. There seems like a couple ways to go about fixing this, and I'm curious anyone's thoughts who knows more about how the history server works, specifically with rolling event logs: * Don't completely fail checking for new applications if one bad rolling event log folder is encountered. This seems like the simplest fix and makes sense to me, it already checks for a few other errors and ignores them. * Figure out why the in progress event log folder is being deleted and make sure that doesn't happen. Maybe this is supposed to happen? Or maybe we don't want to delete the top level folder and only delete event log files within the folders? Again I don't know the exact current behavior here with this. * When writing new event log data, make sure the folder and appstatus file exist every time, creating them again if not. Here's the stack trace we encounter when this happens, from 3.0.1 with a couple extra MRs backported that I hoped would fix the issue: {{2020-10-13 12:10:31,751 ERROR history.FsHistoryProvider: Exception in checking for event log updates2020-10-13 12:10:31,751 ERROR history.FsHistoryProvider: Exception in checking for event log updatesjava.lang.IllegalArgumentException: requirement failed: Log directory must contain an appstatus file! at scala.Predef$.require(Predef.scala:281) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:214) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files(EventLogFileReaders.scala:211) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles$lzycompute(EventLogFileReaders.scala:221) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles(EventLogFileReaders.scala:220) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.lastEventLogFile(EventLogFileReaders.scala:272) at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:240) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:524) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466) at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:347) at scala.collection.TraversableLike.filter$(TraversableLike.scala:347) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287) at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.ut
[jira] [Resolved] (SPARK-33115) `kvstore` and `unsafe` doc tasks fail
[ https://issues.apache.org/jira/browse/SPARK-33115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33115. -- Fix Version/s: 3.1.0 3.0.2 Assignee: Denis Pyshev Resolution: Fixed Fixed in https://github.com/apache/spark/pull/30007 > `kvstore` and `unsafe` doc tasks fail > - > > Key: SPARK-33115 > URL: https://issues.apache.org/jira/browse/SPARK-33115 > Project: Spark > Issue Type: Bug > Components: Build, Documentation >Affects Versions: 3.1.0 >Reporter: Denis Pyshev >Assignee: Denis Pyshev >Priority: Minor > Fix For: 3.0.2, 3.1.0 > > > `build/sbt publishLocal` task fails in two modules: > {code:java} > [error] stack trace is suppressed; run last kvstore / Compile / doc for the > full output > [error] stack trace is suppressed; run last unsafe / Compile / doc for the > full output > {code} > {code:java} > sbt:spark-parent> kvstore/Compile/doc > [info] Main Java API documentation to > /home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... > [error] > /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1: > error: malformed HTML > [error] * An alias class for the type > "ConcurrentHashMap, Boolean>", which is used > [error] ^ > [error] > /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1: > error: unknown tag: Object > [error] * An alias class for the type > "ConcurrentHashMap, Boolean>", which is used > [error] ^ > [error] > /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1: > error: bad use of '>' > [error] * An alias class for the type > "ConcurrentHashMap, Boolean>", which is used > [error] > ^ > {code} > {code:java} > sbt:spark-parent> unsafe/Compile/doc > [info] Main Java API documentation to > /home/gemelen/work/src/spark/common/unsafe/target/scala-2.12/api... > [error] > /home/gemelen/work/src/spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:566:1: > error: malformed HTML > [error] * Trims whitespaces (<= ASCII 32) from both ends of this string. > [error] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'
[ https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33132: Assignee: Apache Spark > The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as > 'NaN Undefined' > - > > Key: SPARK-33132 > URL: https://issues.apache.org/jira/browse/SPARK-33132 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: echohlne >Assignee: Apache Spark >Priority: Minor > Fix For: 3.0.2 > > Attachments: Stage Summary shows NaN undefined.png > > > Spark Version: 3.0.1 > Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics > was shown as '*NaN Undefind*' when the *readBytes* value is negative, as > the attachment shows. > *curl > http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary* > { > "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ], > "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ], > ... > "shuffleReadMetrics" : > { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, > 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, > 2002.0, 2002.0, 2002.0, 2002.0 ], ... } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'
[ https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213073#comment-17213073 ] Apache Spark commented on SPARK-33132: -- User 'akiyamaneko' has created a pull request for this issue: https://github.com/apache/spark/pull/30030 > The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as > 'NaN Undefined' > - > > Key: SPARK-33132 > URL: https://issues.apache.org/jira/browse/SPARK-33132 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: echohlne >Priority: Minor > Fix For: 3.0.2 > > Attachments: Stage Summary shows NaN undefined.png > > > Spark Version: 3.0.1 > Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics > was shown as '*NaN Undefind*' when the *readBytes* value is negative, as > the attachment shows. > *curl > http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary* > { > "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ], > "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ], > ... > "shuffleReadMetrics" : > { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, > 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, > 2002.0, 2002.0, 2002.0, 2002.0 ], ... } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'
[ https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33132: Assignee: (was: Apache Spark) > The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as > 'NaN Undefined' > - > > Key: SPARK-33132 > URL: https://issues.apache.org/jira/browse/SPARK-33132 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: echohlne >Priority: Minor > Fix For: 3.0.2 > > Attachments: Stage Summary shows NaN undefined.png > > > Spark Version: 3.0.1 > Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics > was shown as '*NaN Undefind*' when the *readBytes* value is negative, as > the attachment shows. > *curl > http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary* > { > "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ], > "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ], > ... > "shuffleReadMetrics" : > { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, > 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, > 2002.0, 2002.0, 2002.0, 2002.0 ], ... } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined'
[ https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] echohlne updated SPARK-33132: - Summary: The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefined' (was: The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefind') > The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as > 'NaN Undefined' > - > > Key: SPARK-33132 > URL: https://issues.apache.org/jira/browse/SPARK-33132 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: echohlne >Priority: Minor > Fix For: 3.0.2 > > Attachments: Stage Summary shows NaN undefined.png > > > Spark Version: 3.0.1 > Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics > was shown as '*NaN Undefind*' when the *readBytes* value is negative, as > the attachment shows. > *curl > http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary* > { > "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ], > "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ], > ... > "shuffleReadMetrics" : > { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, > 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, > 2002.0, 2002.0, 2002.0, 2002.0 ], ... } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefind'
[ https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] echohlne updated SPARK-33132: - Description: Spark Version: 3.0.1 Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as '*NaN Undefind*' when the *readBytes* value is negative, as the attachment shows. *curl http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary* { "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ], "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ], ... "shuffleReadMetrics" : { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, 2002.0, 2002.0, 2002.0, 2002.0 ], ... } was: Spark Version: 3.0.1 Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as '*NaN Undefind*' when the *readBytes* value is negative, as the attachment shows. *curl http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary* { "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ], "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ], ... "shuffleReadMetrics" : { "*readBytes*" : [ *-2.0*, 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, 2002.0, 2002.0, 2002.0, 2002.0 ], ... } > The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as > 'NaN Undefind' > > > Key: SPARK-33132 > URL: https://issues.apache.org/jira/browse/SPARK-33132 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: echohlne >Priority: Minor > Fix For: 3.0.2 > > Attachments: Stage Summary shows NaN undefined.png > > > Spark Version: 3.0.1 > Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics > was shown as '*NaN Undefind*' when the *readBytes* value is negative, as > the attachment shows. > *curl > http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary* > { > "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ], > "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ], > ... > "shuffleReadMetrics" : > { {color:#de350b}"*readBytes*"{color} : [ {color:#de350b}*-2.0*{color}, > 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, > 2002.0, 2002.0, 2002.0, 2002.0 ], ... } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefind'
[ https://issues.apache.org/jira/browse/SPARK-33132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] echohlne updated SPARK-33132: - Attachment: Stage Summary shows NaN undefined.png > The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as > 'NaN Undefind' > > > Key: SPARK-33132 > URL: https://issues.apache.org/jira/browse/SPARK-33132 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: echohlne >Priority: Minor > Fix For: 3.0.2 > > Attachments: Stage Summary shows NaN undefined.png > > > Spark Version: 3.0.1 > Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics > was shown as '*NaN Undefind*' when the *readBytes* value is negative, as > the attachment shows. > *curl > http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary* > { > "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ], > "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ], > ... > "shuffleReadMetrics" : > { "*readBytes*" : [ *-2.0*, 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], > "readRecords" : [ 2001.0, 2002.0, 2002.0, 2002.0, 2002.0 ], ... } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33132) The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefind'
echohlne created SPARK-33132: Summary: The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as 'NaN Undefind' Key: SPARK-33132 URL: https://issues.apache.org/jira/browse/SPARK-33132 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: echohlne Fix For: 3.0.2 Spark Version: 3.0.1 Description: The 'Shuffle Read Size / Records' field in Stage Summary metrics was shown as '*NaN Undefind*' when the *readBytes* value is negative, as the attachment shows. *curl http:/hadoop001:18081/api/v1/applications/application_1601774913550_0225/stages/2/0/taskSummary* { "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ], "executorDeserializeTime" : [ 7.0, 357.0, 390.0, 484.0, 492.0 ], ... "shuffleReadMetrics" : { "*readBytes*" : [ *-2.0*, 1775984.0, 1779759.0, 1781727.0, 1788426.0 ], "readRecords" : [ 2001.0, 2002.0, 2002.0, 2002.0, 2002.0 ], ... } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33131) Fix grouping sets with having clause can not resolve qualified col name
[ https://issues.apache.org/jira/browse/SPARK-33131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213068#comment-17213068 ] Apache Spark commented on SPARK-33131: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/30029 > Fix grouping sets with having clause can not resolve qualified col name > --- > > Key: SPARK-33131 > URL: https://issues.apache.org/jira/browse/SPARK-33131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > > The method `ResolveAggregateFunctions.resolveFilterCondInAggregate` aims to > do the two things > 1. resolve the expression in having. > 2. push the having extra agg expression to `Aggregate` > However we only care about 2 now. If having clause resolution is successful > but not exists extra agg expression, we will ignore the resolution. Here is a > example: > {code:java} > -- Works resolved by ResolveReferences > select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = > 1 > -- Works because of the extra expression c1 > select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) > having t1.c1 = 1 > -- Failed > select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having > t1.c1 = 1{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33131) Fix grouping sets with having clause can not resolve qualified col name
[ https://issues.apache.org/jira/browse/SPARK-33131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33131: Assignee: Apache Spark > Fix grouping sets with having clause can not resolve qualified col name > --- > > Key: SPARK-33131 > URL: https://issues.apache.org/jira/browse/SPARK-33131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: Apache Spark >Priority: Minor > > The method `ResolveAggregateFunctions.resolveFilterCondInAggregate` aims to > do the two things > 1. resolve the expression in having. > 2. push the having extra agg expression to `Aggregate` > However we only care about 2 now. If having clause resolution is successful > but not exists extra agg expression, we will ignore the resolution. Here is a > example: > {code:java} > -- Works resolved by ResolveReferences > select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = > 1 > -- Works because of the extra expression c1 > select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) > having t1.c1 = 1 > -- Failed > select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having > t1.c1 = 1{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33131) Fix grouping sets with having clause can not resolve qualified col name
[ https://issues.apache.org/jira/browse/SPARK-33131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33131: Assignee: (was: Apache Spark) > Fix grouping sets with having clause can not resolve qualified col name > --- > > Key: SPARK-33131 > URL: https://issues.apache.org/jira/browse/SPARK-33131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > > The method `ResolveAggregateFunctions.resolveFilterCondInAggregate` aims to > do the two things > 1. resolve the expression in having. > 2. push the having extra agg expression to `Aggregate` > However we only care about 2 now. If having clause resolution is successful > but not exists extra agg expression, we will ignore the resolution. Here is a > example: > {code:java} > -- Works resolved by ResolveReferences > select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = > 1 > -- Works because of the extra expression c1 > select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) > having t1.c1 = 1 > -- Failed > select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having > t1.c1 = 1{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33131) Fix grouping sets with having clause can not resolve qualified col name
ulysses you created SPARK-33131: --- Summary: Fix grouping sets with having clause can not resolve qualified col name Key: SPARK-33131 URL: https://issues.apache.org/jira/browse/SPARK-33131 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: ulysses you The method `ResolveAggregateFunctions.resolveFilterCondInAggregate` aims to do the two things 1. resolve the expression in having. 2. push the having extra agg expression to `Aggregate` However we only care about 2 now. If having clause resolution is successful but not exists extra agg expression, we will ignore the resolution. Here is a example: {code:java} -- Works resolved by ResolveReferences select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 1 -- Works because of the extra expression c1 select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1 -- Failed select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-32681) PySpark type hints support
[ https://issues.apache.org/jira/browse/SPARK-32681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] echohlne updated SPARK-32681: - Comment: was deleted (was: (y)) > PySpark type hints support > -- > > Key: SPARK-32681 > URL: https://issues.apache.org/jira/browse/SPARK-32681 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Critical > > https://github.com/zero323/pyspark-stubs demonstrates a lot of benefits to > improve usability in PySpark by leveraging Python type hints. > By having the type hints in PySpark we can, for example: > - automatically document the input and output types > - leverage IDE for error detection and auto-completion > - have a cleaner definition and easier to understand. > This JIRA is an umbrella JIRA that targets to port > https://github.com/zero323/pyspark-stubs and related items to smoothly run > within PySpark. > It was also discussed in the dev mailing list: > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32681) PySpark type hints support
[ https://issues.apache.org/jira/browse/SPARK-32681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213065#comment-17213065 ] echohlne commented on SPARK-32681: -- (y) > PySpark type hints support > -- > > Key: SPARK-32681 > URL: https://issues.apache.org/jira/browse/SPARK-32681 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Critical > > https://github.com/zero323/pyspark-stubs demonstrates a lot of benefits to > improve usability in PySpark by leveraging Python type hints. > By having the type hints in PySpark we can, for example: > - automatically document the input and output types > - leverage IDE for error detection and auto-completion > - have a cleaner definition and easier to understand. > This JIRA is an umbrella JIRA that targets to port > https://github.com/zero323/pyspark-stubs and related items to smoothly run > within PySpark. > It was also discussed in the dev mailing list: > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-32295. -- Fix Version/s: 3.1.0 Assignee: Tanel Kiis Resolution: Fixed Resolved by https://github.com/apache/spark/pull/29092 > Add not null and size > 0 filters before inner explode to benefit from > predicate pushdown > - > > Key: SPARK-32295 > URL: https://issues.apache.org/jira/browse/SPARK-32295 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Assignee: Tanel Kiis >Priority: Major > Labels: performance > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33130) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect)
Prashant Sharma created SPARK-33130: --- Summary: Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect) Key: SPARK-33130 URL: https://issues.apache.org/jira/browse/SPARK-33130 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Prashant Sharma Override the default SQL strings for: ALTER TABLE RENAME COLUMN ALTER TABLE UPDATE COLUMN NULLABILITY in the following MsSQLServer JDBC dialect according to official documentation. Write MsSqlServer integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21708) Migrate build to sbt 1.3.13
[ https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213033#comment-17213033 ] Apache Spark commented on SPARK-21708: -- User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/30028 > Migrate build to sbt 1.3.13 > > > Key: SPARK-21708 > URL: https://issues.apache.org/jira/browse/SPARK-21708 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: PJ Fanning >Assignee: Denis Pyshev >Priority: Major > Fix For: 3.1.0 > > > Should improve sbt build times. > http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html > According to https://github.com/sbt/sbt/issues/3424, we will need to change > the HTTP location where we get the sbt-launch jar. > Other related issues: > SPARK-14401 > https://github.com/typesafehub/sbteclipse/issues/343 > https://github.com/jrudolph/sbt-dependency-graph/issues/134 > https://github.com/AlpineNow/junit_xml_listener/issues/6 > https://github.com/spray/sbt-revolver/issues/62 > https://github.com/ihji/sbt-antlr4/issues/14 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21708) Migrate build to sbt 1.3.13
[ https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213032#comment-17213032 ] Apache Spark commented on SPARK-21708: -- User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/30028 > Migrate build to sbt 1.3.13 > > > Key: SPARK-21708 > URL: https://issues.apache.org/jira/browse/SPARK-21708 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: PJ Fanning >Assignee: Denis Pyshev >Priority: Major > Fix For: 3.1.0 > > > Should improve sbt build times. > http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html > According to https://github.com/sbt/sbt/issues/3424, we will need to change > the HTTP location where we get the sbt-launch jar. > Other related issues: > SPARK-14401 > https://github.com/typesafehub/sbteclipse/issues/343 > https://github.com/jrudolph/sbt-dependency-graph/issues/134 > https://github.com/AlpineNow/junit_xml_listener/issues/6 > https://github.com/spray/sbt-revolver/issues/62 > https://github.com/ihji/sbt-antlr4/issues/14 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`
[ https://issues.apache.org/jira/browse/SPARK-33129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213031#comment-17213031 ] Apache Spark commented on SPARK-33129: -- User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/30028 > Since the sbt version is now upgraded, old `test-only` needs to be replaced > with `testOnly` > --- > > Key: SPARK-33129 > URL: https://issues.apache.org/jira/browse/SPARK-33129 > Project: Spark > Issue Type: Bug > Components: Build, docs >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > Follow up to SPARK-21708, updating the references to test-only with testOnly. > As the older syntax no longer works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`
[ https://issues.apache.org/jira/browse/SPARK-33129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33129: Assignee: (was: Apache Spark) > Since the sbt version is now upgraded, old `test-only` needs to be replaced > with `testOnly` > --- > > Key: SPARK-33129 > URL: https://issues.apache.org/jira/browse/SPARK-33129 > Project: Spark > Issue Type: Bug > Components: Build, docs >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > Follow up to SPARK-21708, updating the references to test-only with testOnly. > As the older syntax no longer works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`
[ https://issues.apache.org/jira/browse/SPARK-33129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33129: Assignee: Apache Spark > Since the sbt version is now upgraded, old `test-only` needs to be replaced > with `testOnly` > --- > > Key: SPARK-33129 > URL: https://issues.apache.org/jira/browse/SPARK-33129 > Project: Spark > Issue Type: Bug > Components: Build, docs >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Apache Spark >Priority: Major > > Follow up to SPARK-21708, updating the references to test-only with testOnly. > As the older syntax no longer works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`
[ https://issues.apache.org/jira/browse/SPARK-33129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-33129: Description: Follow up to SPARK-21708, updating the references to test-only with testOnly. As the older syntax no longer works. > Since the sbt version is now upgraded, old `test-only` needs to be replaced > with `testOnly` > --- > > Key: SPARK-33129 > URL: https://issues.apache.org/jira/browse/SPARK-33129 > Project: Spark > Issue Type: Bug > Components: Build, docs >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > Follow up to SPARK-21708, updating the references to test-only with testOnly. > As the older syntax no longer works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33129) Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly`
Prashant Sharma created SPARK-33129: --- Summary: Since the sbt version is now upgraded, old `test-only` needs to be replaced with `testOnly` Key: SPARK-33129 URL: https://issues.apache.org/jira/browse/SPARK-33129 Project: Spark Issue Type: Bug Components: Build, docs Affects Versions: 3.1.0 Reporter: Prashant Sharma -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32069) Improve error message on reading unexpected directory which is not a table partition
[ https://issues.apache.org/jira/browse/SPARK-32069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32069: Assignee: (was: Apache Spark) > Improve error message on reading unexpected directory which is not a table > partition > > > Key: SPARK-32069 > URL: https://issues.apache.org/jira/browse/SPARK-32069 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Minor > Labels: starter > > To reproduce: > {code:java} > spark-sql> create table test(i long); > spark-sql> insert into test values(1); > {code} > {code:java} > bash $ mkdir ./spark-warehouse/test/data > {code} > There will be such error messge > {code:java} > java.io.IOException: Not a file: > file:/Users/gengliang.wang/projects/spark/spark-warehouse/test/data > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2173) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:282) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) >
[jira] [Assigned] (SPARK-32069) Improve error message on reading unexpected directory which is not a table partition
[ https://issues.apache.org/jira/browse/SPARK-32069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32069: Assignee: Apache Spark > Improve error message on reading unexpected directory which is not a table > partition > > > Key: SPARK-32069 > URL: https://issues.apache.org/jira/browse/SPARK-32069 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > Labels: starter > > To reproduce: > {code:java} > spark-sql> create table test(i long); > spark-sql> insert into test values(1); > {code} > {code:java} > bash $ mkdir ./spark-warehouse/test/data > {code} > There will be such error messge > {code:java} > java.io.IOException: Not a file: > file:/Users/gengliang.wang/projects/spark/spark-warehouse/test/data > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2173) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:282) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQL
[jira] [Commented] (SPARK-32069) Improve error message on reading unexpected directory which is not a table partition
[ https://issues.apache.org/jira/browse/SPARK-32069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213012#comment-17213012 ] Apache Spark commented on SPARK-32069: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/30027 > Improve error message on reading unexpected directory which is not a table > partition > > > Key: SPARK-32069 > URL: https://issues.apache.org/jira/browse/SPARK-32069 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Minor > Labels: starter > > To reproduce: > {code:java} > spark-sql> create table test(i long); > spark-sql> insert into test values(1); > {code} > {code:java} > bash $ mkdir ./spark-warehouse/test/data > {code} > There will be such error messge > {code:java} > java.io.IOException: Not a file: > file:/Users/gengliang.wang/projects/spark/spark-warehouse/test/data > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2173) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.s
[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213010#comment-17213010 ] Ismaël Mejía commented on SPARK-27733: -- [~csun] This is excellent news! We will get this done and knowing that we can have someone from the Hive side for help is definitely appreciated. Let's continue the discussion then in HIVE-21737 until we get it ready there. > Upgrade to Avro 1.10.0 > -- > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.1.0 >Reporter: Ismaël Mejía >Priority: Minor > > Avro 1.9.2 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranamer, no shaded guava, security > updates, so probably a worth upgrade. > Avro 1.10.0 was released and this is still not done. > There is at the moment (2020/08) still a blocker because of Hive related > transitive dependencies bringing older versions of Avro, so we could say that > this is somehow still blocked until HIVE-21737 is solved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32978) Incorrect number of dynamic part metric
[ https://issues.apache.org/jira/browse/SPARK-32978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213002#comment-17213002 ] Apache Spark commented on SPARK-32978: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/30026 > Incorrect number of dynamic part metric > --- > > Key: SPARK-32978 > URL: https://issues.apache.org/jira/browse/SPARK-32978 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > How to reproduce this issue: > {code:sql} > create table dynamic_partition(i bigint, part bigint) using parquet > partitioned by (part); > insert overwrite table dynamic_partition partition(part) select id, id % 50 > as part from range(1); > {code} > The number of dynamic part should be 50, but it is 800 on web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32978) Incorrect number of dynamic part metric
[ https://issues.apache.org/jira/browse/SPARK-32978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32978: Assignee: Apache Spark > Incorrect number of dynamic part metric > --- > > Key: SPARK-32978 > URL: https://issues.apache.org/jira/browse/SPARK-32978 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > Attachments: screenshot-1.png > > > How to reproduce this issue: > {code:sql} > create table dynamic_partition(i bigint, part bigint) using parquet > partitioned by (part); > insert overwrite table dynamic_partition partition(part) select id, id % 50 > as part from range(1); > {code} > The number of dynamic part should be 50, but it is 800 on web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32978) Incorrect number of dynamic part metric
[ https://issues.apache.org/jira/browse/SPARK-32978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213001#comment-17213001 ] Apache Spark commented on SPARK-32978: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/30026 > Incorrect number of dynamic part metric > --- > > Key: SPARK-32978 > URL: https://issues.apache.org/jira/browse/SPARK-32978 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > How to reproduce this issue: > {code:sql} > create table dynamic_partition(i bigint, part bigint) using parquet > partitioned by (part); > insert overwrite table dynamic_partition partition(part) select id, id % 50 > as part from range(1); > {code} > The number of dynamic part should be 50, but it is 800 on web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32978) Incorrect number of dynamic part metric
[ https://issues.apache.org/jira/browse/SPARK-32978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32978: Assignee: (was: Apache Spark) > Incorrect number of dynamic part metric > --- > > Key: SPARK-32978 > URL: https://issues.apache.org/jira/browse/SPARK-32978 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > How to reproduce this issue: > {code:sql} > create table dynamic_partition(i bigint, part bigint) using parquet > partitioned by (part); > insert overwrite table dynamic_partition partition(part) select id, id % 50 > as part from range(1); > {code} > The number of dynamic part should be 50, but it is 800 on web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33128) mismatched input since Spark 3.0
Yuming Wang created SPARK-33128: --- Summary: mismatched input since Spark 3.0 Key: SPARK-33128 URL: https://issues.apache.org/jira/browse/SPARK-33128 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.0.0, 3.1.0 Reporter: Yuming Wang Spark 2.4: {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_221) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("SELECT 1 UNION SELECT 1 UNION ALL SELECT 1").show +---+ | 1| +---+ | 1| | 1| +---+ {noformat} Spark 3.x: {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 14.0.1) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("SELECT 1 UNION SELECT 1 UNION ALL SELECT 1").show org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'SELECT' expecting {, ';'}(line 1, pos 15) == SQL == SELECT 1 UNION SELECT 1 UNION ALL SELECT 1 ---^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607) ... 47 elided {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33095) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)
[ https://issues.apache.org/jira/browse/SPARK-33095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33095: Assignee: Apache Spark > Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and > nullability of columns (MySQL dialect) > - > > Key: SPARK-33095 > URL: https://issues.apache.org/jira/browse/SPARK-33095 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Apache Spark >Priority: Major > > Override the default SQL strings for: > ALTER TABLE UPDATE COLUMN TYPE > ALTER TABLE UPDATE COLUMN NULLABILITY > in the following MySQL JDBC dialect according to official documentation. > Write MySQL integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33095) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)
[ https://issues.apache.org/jira/browse/SPARK-33095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33095: Assignee: (was: Apache Spark) > Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and > nullability of columns (MySQL dialect) > - > > Key: SPARK-33095 > URL: https://issues.apache.org/jira/browse/SPARK-33095 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > Override the default SQL strings for: > ALTER TABLE UPDATE COLUMN TYPE > ALTER TABLE UPDATE COLUMN NULLABILITY > in the following MySQL JDBC dialect according to official documentation. > Write MySQL integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33095) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)
[ https://issues.apache.org/jira/browse/SPARK-33095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212989#comment-17212989 ] Apache Spark commented on SPARK-33095: -- User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/30025 > Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and > nullability of columns (MySQL dialect) > - > > Key: SPARK-33095 > URL: https://issues.apache.org/jira/browse/SPARK-33095 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > Override the default SQL strings for: > ALTER TABLE UPDATE COLUMN TYPE > ALTER TABLE UPDATE COLUMN NULLABILITY > in the following MySQL JDBC dialect according to official documentation. > Write MySQL integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33095) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)
[ https://issues.apache.org/jira/browse/SPARK-33095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-33095: Description: Override the default SQL strings for: ALTER TABLE UPDATE COLUMN TYPE ALTER TABLE UPDATE COLUMN NULLABILITY in the following MySQL JDBC dialect according to official documentation. Write MySQL integration tests for JDBC. was: Override the default SQL strings for: ALTER TABLE ADD COLUMN ALTER TABLE UPDATE COLUMN TYPE ALTER TABLE UPDATE COLUMN NULLABILITY in the following MySQL JDBC dialect according to official documentation. Write MySQL integration tests for JDBC. > Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and > nullability of columns (MySQL dialect) > - > > Key: SPARK-33095 > URL: https://issues.apache.org/jira/browse/SPARK-33095 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > Override the default SQL strings for: > ALTER TABLE UPDATE COLUMN TYPE > ALTER TABLE UPDATE COLUMN NULLABILITY > in the following MySQL JDBC dialect according to official documentation. > Write MySQL integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32229) Application entry parsing fails because DriverWrapper registered instead of the normal driver
[ https://issues.apache.org/jira/browse/SPARK-32229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32229: Assignee: (was: Apache Spark) > Application entry parsing fails because DriverWrapper registered instead of > the normal driver > - > > Key: SPARK-32229 > URL: https://issues.apache.org/jira/browse/SPARK-32229 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > In some cases DriverWrapper registered by DriverRegistry which causes > exception in PostgresConnectionProvider: > https://github.com/apache/spark/blob/371b35d2e0ab08ebd853147c6673de3adfad0553/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/DriverRegistry.scala#L53 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32229) Application entry parsing fails because DriverWrapper registered instead of the normal driver
[ https://issues.apache.org/jira/browse/SPARK-32229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212968#comment-17212968 ] Apache Spark commented on SPARK-32229: -- User 'gaborgsomogyi' has created a pull request for this issue: https://github.com/apache/spark/pull/30024 > Application entry parsing fails because DriverWrapper registered instead of > the normal driver > - > > Key: SPARK-32229 > URL: https://issues.apache.org/jira/browse/SPARK-32229 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > In some cases DriverWrapper registered by DriverRegistry which causes > exception in PostgresConnectionProvider: > https://github.com/apache/spark/blob/371b35d2e0ab08ebd853147c6673de3adfad0553/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/DriverRegistry.scala#L53 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32229) Application entry parsing fails because DriverWrapper registered instead of the normal driver
[ https://issues.apache.org/jira/browse/SPARK-32229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32229: Assignee: Apache Spark > Application entry parsing fails because DriverWrapper registered instead of > the normal driver > - > > Key: SPARK-32229 > URL: https://issues.apache.org/jira/browse/SPARK-32229 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Major > > In some cases DriverWrapper registered by DriverRegistry which causes > exception in PostgresConnectionProvider: > https://github.com/apache/spark/blob/371b35d2e0ab08ebd853147c6673de3adfad0553/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/DriverRegistry.scala#L53 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org