[jira] [Updated] (SPARK-39198) Cannot refer to nested CTE within a nested CTE in a subquery.

2022-08-23 Thread Jarno Rajala (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarno Rajala updated SPARK-39198:
-
Affects Version/s: 3.3.0

> Cannot refer to nested CTE within a nested CTE in a subquery.
> -
>
> Key: SPARK-39198
> URL: https://issues.apache.org/jira/browse/SPARK-39198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
> Environment: Tested on
>  * Databricks runtime 10.4
>  * Spark 3.2.1 from [https://spark.apache.org/downloads.html]
>  * GitHub apache/spark 'master' commit 17b85ff9
>Reporter: Jarno Rajala
>Priority: Major
>
> The following query fails with {color:#ff}Table or view not found: 
> cte1;{color}
> {code:java}
> set spark.sql.legacy.ctePrecedencePolicy=CORRECTED;
> with
> cte1 as (select 1)
> select * from (
>   with
>     cte2 as (select * from cte1)
>     select * from cte2
> ); {code}
> Or Spark 3.1.1 it returns 1 as expected.
> This is related to SPARK-38404, but different, since the query fails with 
> Spark built from 'master' (commit 17b85ff9). The [PR 
> #36146|https://github.com/apache/spark/pull/36146] therefore does not fix 
> this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40186) mergedShuffleCleaner should have been shutdown before db closed

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40186:


Assignee: Apache Spark

> mergedShuffleCleaner should have been shutdown before db closed
> ---
>
> Key: SPARK-40186
> URL: https://issues.apache.org/jira/browse/SPARK-40186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> Should ensure `RemoteBlockPushResolver#mergedShuffleCleaner` have been 
> shutdown before `RemoteBlockPushResolver#db` closed, otherwise, 
> `RemoteBlockPushResolver#applicationRemoved` may perform delete operations on 
> a closed db.
>  
> https://github.com/apache/spark/pull/37610#discussion_r951185256
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40186) mergedShuffleCleaner should have been shutdown before db closed

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583435#comment-17583435
 ] 

Apache Spark commented on SPARK-40186:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37624

> mergedShuffleCleaner should have been shutdown before db closed
> ---
>
> Key: SPARK-40186
> URL: https://issues.apache.org/jira/browse/SPARK-40186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> Should ensure `RemoteBlockPushResolver#mergedShuffleCleaner` have been 
> shutdown before `RemoteBlockPushResolver#db` closed, otherwise, 
> `RemoteBlockPushResolver#applicationRemoved` may perform delete operations on 
> a closed db.
>  
> https://github.com/apache/spark/pull/37610#discussion_r951185256
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40186) mergedShuffleCleaner should have been shutdown before db closed

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40186:


Assignee: (was: Apache Spark)

> mergedShuffleCleaner should have been shutdown before db closed
> ---
>
> Key: SPARK-40186
> URL: https://issues.apache.org/jira/browse/SPARK-40186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> Should ensure `RemoteBlockPushResolver#mergedShuffleCleaner` have been 
> shutdown before `RemoteBlockPushResolver#db` closed, otherwise, 
> `RemoteBlockPushResolver#applicationRemoved` may perform delete operations on 
> a closed db.
>  
> https://github.com/apache/spark/pull/37610#discussion_r951185256
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40188) Spark Direct Streaming: Read messages of a certain bytes or count in batches from Kafka is not working.

2022-08-23 Thread Madhav Madhu (Jira)
Madhav Madhu created SPARK-40188:


 Summary: Spark Direct Streaming: Read messages of a certain bytes 
or count in batches from Kafka is not working.
 Key: SPARK-40188
 URL: https://issues.apache.org/jira/browse/SPARK-40188
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 3.2.1
 Environment: Spark Version: 3.2.1

Kafka version: 3.2.0

 
Reporter: Madhav Madhu


Spark Kafka consumer is unable to read messages, of a certain size or count in 
batches. I have tried few approaches as mentioned in Kafka docs but with no 
success. Here is a link to Stack Overflow where I asked the same question with 
no response and think this is a possible bug here. Same configuration works 
fine when the consumer is a java code.
https://stackoverflow.com/questions/73398533/spark-streaming-context-kafka-consumer-read-messages-of-a-certain-byte-size-in

Here is the consumer code which fetches data from Kafka,


{code:scala}
val streamingContext = new StreamingContext(sparkSession.sparkContext, 
Seconds(10))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "test",
"fetch.max.bytes" -> "65536",
"max.partition.fetch.bytes" -> "8192",
"max.poll.records" -> "100",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean),
"sasl.jaas.config"-> "org.apache.kafka.common.security.plain.PlainLoginModule 
required username=\"admin\" password=\"admin\";",
"sasl.mechanism" -> "PLAIN",
"security.protocol" -> "SASL_PLAINTEXT",
  )

val topics = Array("test.topic") 
val stream = KafkaUtils.createDirectStream[String, String](
    streamingContext,
    PreferConsistent,
    Subscribe[String, String](topics, kafkaParams)
)

stream.foreachRDD {
rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  println(offsetRanges.foreach(a => println(a.topic + ":" + a.partition + ":" + 
a.fromOffset + ":" + a.untilOffset + ":" + a.count(

  val df = rdd.map(a => a.value().split(",")).toDF()
  val selectCols = columns.indices.map(i => $"value"(i))
  var newDF = df.select(selectCols: _*).toDF(columns: _*)

  // Some business operations here and then write to back to kafka.
  
  newDF.write
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("topic", "topic.ouput")
    .option("kafka.sasl.jaas.config", 
"org.apache.kafka.common.security.plain.PlainLoginModule required 
username=\"admin\" password=\"admin\";")
    .option("kafka.sasl.mechanism", "PLAIN")
    .option("kafka.security.protocol", "SASL_PLAINTEXT")
    .save()

  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)

  sparkSession.catalog.clearCache()
}

streamingContext.start()
streamingContext.awaitTermination()

{code}

Output:

{code:java}

test.topic:6:1345075:4163058:2817983
test.topic:0:1339456:4144190:2804734
test.topic:3:1354266:4189336:2835070
test.topic:7:1353542:4186148:2832606
test.topic:5:1355140:4189071:2833931
test.topic:2:1351162:4173375:2822213
test.topic:1:1352801:4184073:2831272
test.topic:4:1348558:4166749:2818191
()
test.topic:6:4163058:4163058:0
test.topic:0:4144190:4144190:0
test.topic:3:4189336:4189336:0
test.topic:7:4186148:4186148:0
test.topic:5:4189071:4189071:0
test.topic:2:4173375:4173375:0
test.topic:1:4184073:4184073:0
test.topic:4:4166749:4166749:0
{code}


 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40188) Spark Direct Streaming: Read messages of a certain bytes or count in batches from Kafka is not working.

2022-08-23 Thread Madhav Madhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Madhav Madhu updated SPARK-40188:
-
Description: 
Spark Kafka consumer is unable to read messages, of a certain size or count in 
batches. I have tried few approaches as mentioned in Kafka docs but with no 
success. Here is a link to Stack Overflow where I asked the same question with 
no response and think this is a possible bug here. Same configuration works 
fine when the consumer is a java code.
https://stackoverflow.com/questions/73398533/spark-streaming-context-kafka-consumer-read-messages-of-a-certain-byte-size-in

Here is the consumer code which fetches data from Kafka,


{code:scala}
val streamingContext = new StreamingContext(sparkSession.sparkContext, 
Seconds(10))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "test",
"fetch.max.bytes" -> "65536",
"max.partition.fetch.bytes" -> "8192",
"max.poll.records" -> "100",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean),
"sasl.jaas.config"-> "org.apache.kafka.common.security.plain.PlainLoginModule 
required username=\"admin\" password=\"admin\";",
"sasl.mechanism" -> "PLAIN",
"security.protocol" -> "SASL_PLAINTEXT",
  )

val topics = Array("test.topic") 
val stream = KafkaUtils.createDirectStream[String, String](
    streamingContext,
    PreferConsistent,
    Subscribe[String, String](topics, kafkaParams)
)

stream.foreachRDD {
rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  println(offsetRanges.foreach(a => println(a.topic + ":" + a.partition + ":" + 
a.fromOffset + ":" + a.untilOffset + ":" + a.count(

  val df = rdd.map(a => a.value().split(",")).toDF()
  val selectCols = columns.indices.map(i => $"value"(i))
  var newDF = df.select(selectCols: _*).toDF(columns: _*)

  // Some business operations here and then write to back to kafka.
  
  newDF.write
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("topic", "topic.ouput")
    .option("kafka.sasl.jaas.config", 
"org.apache.kafka.common.security.plain.PlainLoginModule required 
username=\"admin\" password=\"admin\";")
    .option("kafka.sasl.mechanism", "PLAIN")
    .option("kafka.security.protocol", "SASL_PLAINTEXT")
    .save()

  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)

  sparkSession.catalog.clearCache()
}

streamingContext.start()
streamingContext.awaitTermination()

{code}

Output:

{code:java}

test.topic:6:1345075:4163058:2817983
test.topic:0:1339456:4144190:2804734
test.topic:3:1354266:4189336:2835070
test.topic:7:1353542:4186148:2832606
test.topic:5:1355140:4189071:2833931
test.topic:2:1351162:4173375:2822213
test.topic:1:1352801:4184073:2831272
test.topic:4:1348558:4166749:2818191
()
test.topic:6:4163058:4163058:0
test.topic:0:4144190:4144190:0
test.topic:3:4189336:4189336:0
test.topic:7:4186148:4186148:0
test.topic:5:4189071:4189071:0
test.topic:2:4173375:4173375:0
test.topic:1:4184073:4184073:0
test.topic:4:4166749:4166749:0
{code}

I tried different options as followed,
Option 1:

Topic Partition 8
Streaming Context 1 sec:
"fetch.max.bytes" -> "65536", // 64 Kb
"max.partition.fetch.bytes" -> "8192" // 8Kb
"max.poll.records" -> "100"

DataFrame count which it read from Kafka in the very first batch: 120

Option 2:
Partition 1
Streaming Context 1 sec
"fetch.max.bytes" -> "65536",
"max.partition.fetch.bytes" -> "8192"
"max.poll.records" -> "100"

Kafka Lag: 126360469
DataFrame count which it read from Kafka in the very first batch: 126360469.
 

  was:
Spark Kafka consumer is unable to read messages, of a certain size or count in 
batches. I have tried few approaches as mentioned in Kafka docs but with no 
success. Here is a link to Stack Overflow where I asked the same question with 
no response and think this is a possible bug here. Same configuration works 
fine when the consumer is a java code.
https://stackoverflow.com/questions/73398533/spark-streaming-context-kafka-consumer-read-messages-of-a-certain-byte-size-in

Here is the consumer code which fetches data from Kafka,


{code:scala}
val streamingContext = new StreamingContext(sparkSession.sparkContext, 
Seconds(10))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "test",
"fetch.max.bytes" -> "65536",
"max.partition.fetch.bytes" -> "8192",
"max.poll.records" -> "100",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean),
"sasl.jaas.config"-> "org.apache.kafka.common.security.plain.PlainLoginModule 
required username=\"admin\" password=\"admin\";",
"sasl.mechanism" -> "PLAIN",
"security.protocol" -> "SASL_PLAI

[jira] [Assigned] (SPARK-40173) Make pyspark.taskcontext examples self-contained

2022-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40173:


Assignee: Hyukjin Kwon

> Make pyspark.taskcontext examples self-contained
> 
>
> Key: SPARK-40173
> URL: https://issues.apache.org/jira/browse/SPARK-40173
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40173) Make pyspark.taskcontext examples self-contained

2022-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40173.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37623
[https://github.com/apache/spark/pull/37623]

> Make pyspark.taskcontext examples self-contained
> 
>
> Key: SPARK-40173
> URL: https://issues.apache.org/jira/browse/SPARK-40173
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40177) Simplify join condition of form (a==b) || (a==null&&b==null) to a<=>b

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40177:


Assignee: (was: Apache Spark)

> Simplify join condition of form (a==b) || (a==null&&b==null) to a<=>b
> -
>
> Key: SPARK-40177
> URL: https://issues.apache.org/jira/browse/SPARK-40177
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Ayushi Agarwal
>Priority: Major
> Fix For: 3.3.1
>
>
> If the join condition is like key1==key2 || (key1==null && key2==null), join 
> is executed as Broadcast Nested Loop Join as this condition doesn't satisfy 
> equi join condition. BNLJ takes more time as compared to Sort merge or 
> broadcast join. This condition can be converted to key1<=>key2 to make the 
> join execute as Broadcast or sort merge join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40177) Simplify join condition of form (a==b) || (a==null&&b==null) to a<=>b

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40177:


Assignee: Apache Spark

> Simplify join condition of form (a==b) || (a==null&&b==null) to a<=>b
> -
>
> Key: SPARK-40177
> URL: https://issues.apache.org/jira/browse/SPARK-40177
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Ayushi Agarwal
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.1
>
>
> If the join condition is like key1==key2 || (key1==null && key2==null), join 
> is executed as Broadcast Nested Loop Join as this condition doesn't satisfy 
> equi join condition. BNLJ takes more time as compared to Sort merge or 
> broadcast join. This condition can be converted to key1<=>key2 to make the 
> join execute as Broadcast or sort merge join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40177) Simplify join condition of form (a==b) || (a==null&&b==null) to a<=>b

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583450#comment-17583450
 ] 

Apache Spark commented on SPARK-40177:
--

User 'ayushi-agarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/37625

> Simplify join condition of form (a==b) || (a==null&&b==null) to a<=>b
> -
>
> Key: SPARK-40177
> URL: https://issues.apache.org/jira/browse/SPARK-40177
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Ayushi Agarwal
>Priority: Major
> Fix For: 3.3.1
>
>
> If the join condition is like key1==key2 || (key1==null && key2==null), join 
> is executed as Broadcast Nested Loop Join as this condition doesn't satisfy 
> equi join condition. BNLJ takes more time as compared to Sort merge or 
> broadcast join. This condition can be converted to key1<=>key2 to make the 
> join execute as Broadcast or sort merge join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40177) Simplify join condition of form (a==b) || (a==null&&b==null) to a<=>b

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583454#comment-17583454
 ] 

Apache Spark commented on SPARK-40177:
--

User 'ayushi-agarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/37625

> Simplify join condition of form (a==b) || (a==null&&b==null) to a<=>b
> -
>
> Key: SPARK-40177
> URL: https://issues.apache.org/jira/browse/SPARK-40177
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Ayushi Agarwal
>Priority: Major
> Fix For: 3.3.1
>
>
> If the join condition is like key1==key2 || (key1==null && key2==null), join 
> is executed as Broadcast Nested Loop Join as this condition doesn't satisfy 
> equi join condition. BNLJ takes more time as compared to Sort merge or 
> broadcast join. This condition can be converted to key1<=>key2 to make the 
> join execute as Broadcast or sort merge join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40189) Support json_array_get/json_array_length function

2022-08-23 Thread melin (Jira)
melin created SPARK-40189:
-

 Summary: Support json_array_get/json_array_length function
 Key: SPARK-40189
 URL: https://issues.apache.org/jira/browse/SPARK-40189
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: melin


presto provides these two functions,frequently used

https://prestodb.io/docs/current/functions/json.html#json-functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40190) Support json_array_get and json_array_length function

2022-08-23 Thread melin (Jira)
melin created SPARK-40190:
-

 Summary: Support json_array_get and json_array_length function
 Key: SPARK-40190
 URL: https://issues.apache.org/jira/browse/SPARK-40190
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: melin


presto provides these two functions, which are often used:

https://prestodb.io/docs/current/functions/json.html#json-functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40190) Support json_array_get and json_array_length function

2022-08-23 Thread melin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin resolved SPARK-40190.
---
Resolution: Duplicate

> Support json_array_get and json_array_length function
> -
>
> Key: SPARK-40190
> URL: https://issues.apache.org/jira/browse/SPARK-40190
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: melin
>Priority: Major
>
> presto provides these two functions, which are often used:
> https://prestodb.io/docs/current/functions/json.html#json-functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40189) Support json_array_get/json_array_length function

2022-08-23 Thread melin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583473#comment-17583473
 ] 

melin commented on SPARK-40189:
---

[~maxgekk] 

> Support json_array_get/json_array_length function
> -
>
> Key: SPARK-40189
> URL: https://issues.apache.org/jira/browse/SPARK-40189
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: melin
>Priority: Major
>
> presto provides these two functions,frequently used
> https://prestodb.io/docs/current/functions/json.html#json-functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40152) Codegen compilation error when using split_part

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583498#comment-17583498
 ] 

Apache Spark commented on SPARK-40152:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37626

> Codegen compilation error when using split_part
> ---
>
> Key: SPARK-40152
> URL: https://issues.apache.org/jira/browse/SPARK-40152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> The following query throws an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> ('11.12.13', '.', 3)
> as v1(col1, col2, col3);
> cache table v1;
> SELECT split_part(col1, col2, col3)
> from v1;
> {noformat}
> The error is:
> {noformat}
> 22/08/19 14:25:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
>   at 
> org.codehaus.janino.Java$Atom.toTypeOrCompileException(Java.java:3934)
>   at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1887)
>   at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1811)
>   at org.codehaus.janino.Parser.parseBlock(Parser.java:1792)
>   at 
> {noformat}
> In the end, {{split_part}} does successfully execute, although in interpreted 
> mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40152) Codegen compilation error when using split_part

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583500#comment-17583500
 ] 

Apache Spark commented on SPARK-40152:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37626

> Codegen compilation error when using split_part
> ---
>
> Key: SPARK-40152
> URL: https://issues.apache.org/jira/browse/SPARK-40152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> The following query throws an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> ('11.12.13', '.', 3)
> as v1(col1, col2, col3);
> cache table v1;
> SELECT split_part(col1, col2, col3)
> from v1;
> {noformat}
> The error is:
> {noformat}
> 22/08/19 14:25:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
>   at 
> org.codehaus.janino.Java$Atom.toTypeOrCompileException(Java.java:3934)
>   at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1887)
>   at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1811)
>   at org.codehaus.janino.Parser.parseBlock(Parser.java:1792)
>   at 
> {noformat}
> In the end, {{split_part}} does successfully execute, although in interpreted 
> mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40191) Make pyspark.resource examples self-contained

2022-08-23 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-40191:


 Summary: Make pyspark.resource examples self-contained
 Key: SPARK-40191
 URL: https://issues.apache.org/jira/browse/SPARK-40191
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark, Spark Core
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40191) Make pyspark.resource examples self-contained

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40191:


Assignee: (was: Apache Spark)

> Make pyspark.resource examples self-contained
> -
>
> Key: SPARK-40191
> URL: https://issues.apache.org/jira/browse/SPARK-40191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40191) Make pyspark.resource examples self-contained

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583510#comment-17583510
 ] 

Apache Spark commented on SPARK-40191:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37627

> Make pyspark.resource examples self-contained
> -
>
> Key: SPARK-40191
> URL: https://issues.apache.org/jira/browse/SPARK-40191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40191) Make pyspark.resource examples self-contained

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40191:


Assignee: Apache Spark

> Make pyspark.resource examples self-contained
> -
>
> Key: SPARK-40191
> URL: https://issues.apache.org/jira/browse/SPARK-40191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40191) Make pyspark.resource examples self-contained

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40191:


Assignee: Apache Spark

> Make pyspark.resource examples self-contained
> -
>
> Key: SPARK-40191
> URL: https://issues.apache.org/jira/browse/SPARK-40191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40192) Remove redundant groupby

2022-08-23 Thread deshanxiao (Jira)
deshanxiao created SPARK-40192:
--

 Summary: Remove redundant groupby
 Key: SPARK-40192
 URL: https://issues.apache.org/jira/browse/SPARK-40192
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: deshanxiao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40192) Remove redundant groupby

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583532#comment-17583532
 ] 

Apache Spark commented on SPARK-40192:
--

User 'deshanxiao' has created a pull request for this issue:
https://github.com/apache/spark/pull/37628

> Remove redundant groupby
> 
>
> Key: SPARK-40192
> URL: https://issues.apache.org/jira/browse/SPARK-40192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: deshanxiao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40192) Remove redundant groupby

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40192:


Assignee: (was: Apache Spark)

> Remove redundant groupby
> 
>
> Key: SPARK-40192
> URL: https://issues.apache.org/jira/browse/SPARK-40192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: deshanxiao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40192) Remove redundant groupby

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40192:


Assignee: Apache Spark

> Remove redundant groupby
> 
>
> Key: SPARK-40192
> URL: https://issues.apache.org/jira/browse/SPARK-40192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: deshanxiao
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40160) Make pyspark.broadcast examples self-contained

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583556#comment-17583556
 ] 

Apache Spark commented on SPARK-40160:
--

User 'dcoliversun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37629

> Make pyspark.broadcast examples self-contained
> --
>
> Key: SPARK-40160
> URL: https://issues.apache.org/jira/browse/SPARK-40160
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40160) Make pyspark.broadcast examples self-contained

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40160:


Assignee: (was: Apache Spark)

> Make pyspark.broadcast examples self-contained
> --
>
> Key: SPARK-40160
> URL: https://issues.apache.org/jira/browse/SPARK-40160
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40160) Make pyspark.broadcast examples self-contained

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40160:


Assignee: Apache Spark

> Make pyspark.broadcast examples self-contained
> --
>
> Key: SPARK-40160
> URL: https://issues.apache.org/jira/browse/SPARK-40160
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40193) Merge different filters when merging subquery plans

2022-08-23 Thread Peter Toth (Jira)
Peter Toth created SPARK-40193:
--

 Summary: Merge different filters when merging subquery plans
 Key: SPARK-40193
 URL: https://issues.apache.org/jira/browse/SPARK-40193
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth


We could improve SPARK-34079 to be able to merge different filters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40193) Merge subquery plans with different filters

2022-08-23 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-40193:
---
Summary: Merge subquery plans with different filters  (was: Merge different 
filters when merging subquery plans)

> Merge subquery plans with different filters
> ---
>
> Key: SPARK-40193
> URL: https://issues.apache.org/jira/browse/SPARK-40193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>
> We could improve SPARK-34079 to be able to merge different filters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40193) Merge subquery plans with different filters

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583658#comment-17583658
 ] 

Apache Spark commented on SPARK-40193:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/37630

> Merge subquery plans with different filters
> ---
>
> Key: SPARK-40193
> URL: https://issues.apache.org/jira/browse/SPARK-40193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>
> We could improve SPARK-34079 to be able to merge different filters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40193) Merge subquery plans with different filters

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40193:


Assignee: (was: Apache Spark)

> Merge subquery plans with different filters
> ---
>
> Key: SPARK-40193
> URL: https://issues.apache.org/jira/browse/SPARK-40193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>
> We could improve SPARK-34079 to be able to merge different filters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40193) Merge subquery plans with different filters

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40193:


Assignee: Apache Spark

> Merge subquery plans with different filters
> ---
>
> Key: SPARK-40193
> URL: https://issues.apache.org/jira/browse/SPARK-40193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Major
>
> We could improve SPARK-34079 to be able to merge different filters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40183) Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal conversion

2022-08-23 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-40183.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37620
[https://github.com/apache/spark/pull/37620]

> Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal conversion
> -
>
> Key: SPARK-40183
> URL: https://issues.apache.org/jira/browse/SPARK-40183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal 
> conversion, instead of the confusing error class 
> `CANNOT_CHANGE_DECIMAL_PRECISION`.
> Also, use `decimal.toPlainString` instead of `decimal.toDebugString` in the 
> error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.

2022-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40089:
-

Assignee: Robert Joseph Evans  (was: Apache Spark)

> Sorting of at least Decimal(20, 2) fails for some values near the max.
> --
>
> Key: SPARK-40089
> URL: https://issues.apache.org/jira/browse/SPARK-40089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>Priority: Major
> Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3
>
> Attachments: input.parquet
>
>
> I have been doing some testing with Decimal values for the RAPIDS Accelerator 
> for Apache Spark. I have been trying to add in new corner cases and when I 
> tried to enable the maximum supported value for a sort I started to get 
> failures.  On closer inspection it looks like the CPU is sorting things 
> incorrectly.  Specifically anything that is "99.50" or above 
> is placed as a chunk in the wrong location in the outputs.
>  In local mode with 12 tasks.
> {code:java}
> spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println)
>  {code}
>  
> Here you will notice that the last entry printed is 
> {{[99.49]}}, and {{[99.99]}} is near the top 
> near {{[-99.99]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40172) Temporarily disable flaky test cases in ImageFileFormatSuite

2022-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40172:
--
Fix Version/s: 3.3.1
   3.2.3

> Temporarily disable flaky test cases in ImageFileFormatSuite
> 
>
> Key: SPARK-40172
> URL: https://issues.apache.org/jira/browse/SPARK-40172
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>
> 3 test cases in ImageFileFormatSuite become flaky in the GitHub action tests:
> [https://github.com/apache/spark/runs/7941765326?check_suite_focus=true]
> Before they are fixed(https://issues.apache.org/jira/browse/SPARK-40171), I 
> suggest disabling them in OSS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40171) Fix flaky tests in ImageFileFormatSuite

2022-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40171:
--
Affects Version/s: 3.3.1
   3.2.3

> Fix flaky tests in ImageFileFormatSuite
> ---
>
> Key: SPARK-40171
> URL: https://issues.apache.org/jira/browse/SPARK-40171
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Gengliang Wang
>Priority: Major
>
> There are 3 test cases that become flaky in the GitHub action tests:
> [https://github.com/apache/spark/runs/7941765326?check_suite_focus=true]
> We should fix them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40172) Temporarily disable flaky test cases in ImageFileFormatSuite

2022-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40172:
--
Affects Version/s: 3.2.2
   3.3.0

> Temporarily disable flaky test cases in ImageFileFormatSuite
> 
>
> Key: SPARK-40172
> URL: https://issues.apache.org/jira/browse/SPARK-40172
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>
> 3 test cases in ImageFileFormatSuite become flaky in the GitHub action tests:
> [https://github.com/apache/spark/runs/7941765326?check_suite_focus=true]
> Before they are fixed(https://issues.apache.org/jira/browse/SPARK-40171), I 
> suggest disabling them in OSS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40171) Fix flaky tests in ImageFileFormatSuite

2022-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40171:
--
Affects Version/s: 3.2.2
   3.3.0
   (was: 3.3.1)
   (was: 3.2.3)

> Fix flaky tests in ImageFileFormatSuite
> ---
>
> Key: SPARK-40171
> URL: https://issues.apache.org/jira/browse/SPARK-40171
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> There are 3 test cases that become flaky in the GitHub action tests:
> [https://github.com/apache/spark/runs/7941765326?check_suite_focus=true]
> We should fix them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34631) Caught Hive MetaException when query by partition (partition col start with ‘$’)

2022-08-23 Thread Brendan Morin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583785#comment-17583785
 ] 

Brendan Morin commented on SPARK-34631:
---

For other users encountering this issue suddenly: make sure you check all 
partitions, not just the partition you want to query. I'm encountering and able 
to duplicate this error this in the following scenario:

Trying to query an external hive table that is partitioned on date, e.g.:

 
{code:java}
df = spark.sql("select a, b from my_db.my_table where date = '2022-08-10'")
df.show()

>>> java.lang.RuntimeException:  Caught Hive MetaException attempting to get 
>>> partition...{code}
Confirm the data type of the columns:

 

 
{code:java}
spark.sql("select * from my_db.my_table").printSchema() 

>>> root
>>>  |-- a: string (nullable = true)
>>>  |-- b: string (nullable = true)  
>>>  |-- date: date (nullable = true){code}
Check the partitions:

 

 
{code:java}
spark.sql("show partitions my_db.my_table").show(20, False) 

>>> +---+ 
>>> |partition  | 
>>> +---+ 
>>> |date=2022-08-07| 
>>> |date=2022-08-08| 
>>> |date=2022-08-08_tmp|   # Note the malformed partition
>>> |date=2022-08-09| 
>>> |date=2022-08-10| 
>>> |date=2022-08-11| 
>>> |date=2022-08-12| 
>>> +---+{code}
This was the problem in my case. There was a date partition (Note: the problem 
partition was not the only I was querying) that was malformed in the HDFS 
directory where the hive external table data was located. The string format was 
unable to be properly parsed into the data type. Removing this partition from 
HDFS, dropping and recreating the table with MSCK repair solved the issue.

For additional context, my_db.my_table was managed as an external table. Table 
updates were done by writing parquet files as partitions, and then running drop 
table, create table, and MSCK repair on the table. For some reason, this 
write/update process did not fail due to the malformed partition, so additional 
partitions were able to continue to be added. The problem only manifested on 
read.

I think that the root cause in my case is actually an overly broad catch by 
spark, and I think the error handling logic could be refined to identify this 
root cause, or clue users in that the issue may be a malformed partition name 
that does not parse correctly into the expected data type (date in this case).

 

> Caught Hive MetaException when query by partition (partition col start with 
> ‘$’)
> 
>
> Key: SPARK-34631
> URL: https://issues.apache.org/jira/browse/SPARK-34631
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Java API
>Affects Versions: 2.4.4
>Reporter: zhouyuan
>Priority: Critical
>
> create a table, set location as parquet, do msck repair table to get the data.
> But when query with partition column, got some errors (adding backtick would 
> not help)
> {code:java}
> // code placeholder
> {code}
> select count from some_table where `$partition_date` = '2015-01-01'
>  
> {panel:title=error:}
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:679)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:677)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:677)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214)
>  

[jira] [Comment Edited] (SPARK-34631) Caught Hive MetaException when query by partition (partition col start with ‘$’)

2022-08-23 Thread Brendan Morin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583785#comment-17583785
 ] 

Brendan Morin edited comment on SPARK-34631 at 8/23/22 6:18 PM:


For other users encountering this issue suddenly: make sure you check all 
partitions, not just the partition you want to query. I'm encountering and able 
to duplicate this error this in the following scenario:

Trying to query an external hive table that is partitioned on date, e.g.:
{code:java}
df = spark.sql("select a, b from my_db.my_table where date = '2022-08-10'")
df.show()

>>> java.lang.RuntimeException:  Caught Hive MetaException attempting to get 
>>> partition...
>>> Caused by: MetaException(message:Filtering is supported only on partition 
>>> keys of type string){code}
Confirm the data type of the columns:
{code:java}
spark.sql("select * from my_db.my_table").printSchema() 

>>> root
>>>  |-- a: string (nullable = true)
>>>  |-- b: string (nullable = true)  
>>>  |-- date: date (nullable = true){code}
Check the partitions:
{code:java}
spark.sql("show partitions my_db.my_table").show(20, False) 

>>> +---+ 
>>> |partition  | 
>>> +---+ 
>>> |date=2022-08-07| 
>>> |date=2022-08-08| 
>>> |date=2022-08-08_tmp|   # Note the malformed partition
>>> |date=2022-08-09| 
>>> |date=2022-08-10| 
>>> |date=2022-08-11| 
>>> |date=2022-08-12| 
>>> +---+{code}
This was the problem in my case. There was a date partition (Note: the problem 
partition was not the only I was querying) that was malformed in the HDFS 
directory where the hive external table data was located. The string format was 
unable to be properly parsed into the data type. Removing this partition from 
HDFS, dropping and recreating the table with MSCK repair solved the issue.

For additional context, my_db.my_table was managed as an external table. Table 
updates were done by writing parquet files as partitions, and then running drop 
table, create table, and MSCK repair on the table. For some reason, this 
write/update process did not fail due to the malformed partition, so additional 
partitions were able to continue to be added. The problem only manifested on 
read.

I think that the root cause in my case is actually an overly broad catch by 
spark, and I think the error handling logic could be refined to identify this 
root cause, or clue users in that the issue may be a malformed partition name 
that does not parse correctly into the expected data type (date in this case).

The specific error:
{code:java}
Caused by: MetaException(message:Filtering is supported only on partition keys 
of type string){code}
is a bit of a red herring, as this is not true, and searching this error will 
lead you down a rabbit hole of incorrect root cause/unrelated issues.


was (Author: brendanjmorin):
For other users encountering this issue suddenly: make sure you check all 
partitions, not just the partition you want to query. I'm encountering and able 
to duplicate this error this in the following scenario:

Trying to query an external hive table that is partitioned on date, e.g.:

 
{code:java}
df = spark.sql("select a, b from my_db.my_table where date = '2022-08-10'")
df.show()

>>> java.lang.RuntimeException:  Caught Hive MetaException attempting to get 
>>> partition...{code}
Confirm the data type of the columns:

 

 
{code:java}
spark.sql("select * from my_db.my_table").printSchema() 

>>> root
>>>  |-- a: string (nullable = true)
>>>  |-- b: string (nullable = true)  
>>>  |-- date: date (nullable = true){code}
Check the partitions:

 

 
{code:java}
spark.sql("show partitions my_db.my_table").show(20, False) 

>>> +---+ 
>>> |partition  | 
>>> +---+ 
>>> |date=2022-08-07| 
>>> |date=2022-08-08| 
>>> |date=2022-08-08_tmp|   # Note the malformed partition
>>> |date=2022-08-09| 
>>> |date=2022-08-10| 
>>> |date=2022-08-11| 
>>> |date=2022-08-12| 
>>> +---+{code}
This was the problem in my case. There was a date partition (Note: the problem 
partition was not the only I was querying) that was malformed in the HDFS 
directory where the hive external table data was located. The string format was 
unable to be properly parsed into the data type. Removing this partition from 
HDFS, dropping and recreating the table with MSCK repair solved the issue.

For additional context, my_db.my_table was managed as an external table. Table 
updates were done by writing parquet files as partitions, and then running drop 
table, create table, and MSCK repair on the table. For some reason, this 
write/update process did not fail due to the malformed partition, so additional 
partitions were able to continue to be added. The problem only manifested on 
read.

I think that the root cause in my case is actually a

[jira] [Comment Edited] (SPARK-34631) Caught Hive MetaException when query by partition (partition col start with ‘$’)

2022-08-23 Thread Brendan Morin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583785#comment-17583785
 ] 

Brendan Morin edited comment on SPARK-34631 at 8/23/22 6:52 PM:


For other users encountering this issue suddenly: make sure you check all 
partitions, not just the partition you want to query. I'm encountering and able 
to duplicate this error this in the following scenario:

Trying to query an external hive table that is partitioned on date, e.g.:
{code:java}
df = spark.sql("select a, b from my_db.my_table where date = '2022-08-10'")
df.show()

>>> java.lang.RuntimeException:  Caught Hive MetaException attempting to get 
>>> partition...
>>> Caused by: MetaException(message:Filtering is supported only on partition 
>>> keys of type string){code}
Confirm the data type of the columns:
{code:java}
spark.sql("select * from my_db.my_table").printSchema() 

>>> root
>>>  |-- a: string (nullable = true)
>>>  |-- b: string (nullable = true)  
>>>  |-- date: date (nullable = true){code}
Check the partitions:
{code:java}
spark.sql("show partitions my_db.my_table").show(20, False) 

>>> +---+ 
>>> |partition  | 
>>> +---+ 
>>> |date=2022-08-07| 
>>> |date=2022-08-08| 
>>> |date=2022-08-08_tmp|   # Note the malformed partition
>>> |date=2022-08-09| 
>>> |date=2022-08-10| 
>>> |date=2022-08-11| 
>>> |date=2022-08-12| 
>>> +---+{code}
This was the problem in my case. There was a date partition (Note: the problem 
partition was not the one I was querying for) that was malformed in the HDFS 
directory where the hive external table data was located. The string format was 
unable to be properly parsed into the data type. Removing this partition from 
HDFS, dropping and recreating the table with MSCK repair solved the issue.

For additional context, my_db.my_table was managed as an external table. Table 
updates were done by writing parquet files as partitions, and then running drop 
table, create table, and MSCK repair on the table. For some reason, this 
write/update process did not fail due to the malformed partition, so additional 
partitions were able to continue to be added. The problem only manifested on 
read.

I think that the root cause in my case is actually an overly broad catch by 
spark, and I think the error handling logic could be refined to identify this 
root cause, or clue users in that the issue may be a malformed partition name 
that does not parse correctly into the expected data type (date in this case).

The specific error:
{code:java}
Caused by: MetaException(message:Filtering is supported only on partition keys 
of type string){code}
is a bit of a red herring, as this is not true, and searching this error will 
lead you down a rabbit hole of incorrect root cause/unrelated issues.


was (Author: brendanjmorin):
For other users encountering this issue suddenly: make sure you check all 
partitions, not just the partition you want to query. I'm encountering and able 
to duplicate this error this in the following scenario:

Trying to query an external hive table that is partitioned on date, e.g.:
{code:java}
df = spark.sql("select a, b from my_db.my_table where date = '2022-08-10'")
df.show()

>>> java.lang.RuntimeException:  Caught Hive MetaException attempting to get 
>>> partition...
>>> Caused by: MetaException(message:Filtering is supported only on partition 
>>> keys of type string){code}
Confirm the data type of the columns:
{code:java}
spark.sql("select * from my_db.my_table").printSchema() 

>>> root
>>>  |-- a: string (nullable = true)
>>>  |-- b: string (nullable = true)  
>>>  |-- date: date (nullable = true){code}
Check the partitions:
{code:java}
spark.sql("show partitions my_db.my_table").show(20, False) 

>>> +---+ 
>>> |partition  | 
>>> +---+ 
>>> |date=2022-08-07| 
>>> |date=2022-08-08| 
>>> |date=2022-08-08_tmp|   # Note the malformed partition
>>> |date=2022-08-09| 
>>> |date=2022-08-10| 
>>> |date=2022-08-11| 
>>> |date=2022-08-12| 
>>> +---+{code}
This was the problem in my case. There was a date partition (Note: the problem 
partition was not the only I was querying) that was malformed in the HDFS 
directory where the hive external table data was located. The string format was 
unable to be properly parsed into the data type. Removing this partition from 
HDFS, dropping and recreating the table with MSCK repair solved the issue.

For additional context, my_db.my_table was managed as an external table. Table 
updates were done by writing parquet files as partitions, and then running drop 
table, create table, and MSCK repair on the table. For some reason, this 
write/update process did not fail due to the malformed partition, so additional 
partitions were able to continue to be added.

[jira] [Created] (SPARK-40194) SPLIT function on empty regex should truncate trailing empty string.

2022-08-23 Thread Vitalii Li (Jira)
Vitalii Li created SPARK-40194:
--

 Summary: SPLIT function on empty regex should truncate trailing 
empty string.
 Key: SPARK-40194
 URL: https://issues.apache.org/jira/browse/SPARK-40194
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Vitalii Li


E.g. `select split('hello', '')` should convert to `['h', 'e', 'l', 'l', 'o']` 
instead of `['h', 'e', 'l', 'l', 'o', '']`. Requires explicit `limit` parameter 
to preserve trailing empty string.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40194) SPLIT function on empty regex should truncate trailing empty string.

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40194:


Assignee: (was: Apache Spark)

> SPLIT function on empty regex should truncate trailing empty string.
> 
>
> Key: SPARK-40194
> URL: https://issues.apache.org/jira/browse/SPARK-40194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Priority: Major
>
> E.g. `select split('hello', '')` should convert to `['h', 'e', 'l', 'l', 
> 'o']` instead of `['h', 'e', 'l', 'l', 'o', '']`. Requires explicit `limit` 
> parameter to preserve trailing empty string.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40194) SPLIT function on empty regex should truncate trailing empty string.

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583877#comment-17583877
 ] 

Apache Spark commented on SPARK-40194:
--

User 'vitaliili-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/37631

> SPLIT function on empty regex should truncate trailing empty string.
> 
>
> Key: SPARK-40194
> URL: https://issues.apache.org/jira/browse/SPARK-40194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Priority: Major
>
> E.g. `select split('hello', '')` should convert to `['h', 'e', 'l', 'l', 
> 'o']` instead of `['h', 'e', 'l', 'l', 'o', '']`. Requires explicit `limit` 
> parameter to preserve trailing empty string.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40194) SPLIT function on empty regex should truncate trailing empty string.

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40194:


Assignee: Apache Spark

> SPLIT function on empty regex should truncate trailing empty string.
> 
>
> Key: SPARK-40194
> URL: https://issues.apache.org/jira/browse/SPARK-40194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Assignee: Apache Spark
>Priority: Major
>
> E.g. `select split('hello', '')` should convert to `['h', 'e', 'l', 'l', 
> 'o']` instead of `['h', 'e', 'l', 'l', 'o', '']`. Requires explicit `limit` 
> parameter to preserve trailing empty string.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40195) Add PrunedScanWithAQESuite

2022-08-23 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-40195:
-

 Summary: Add PrunedScanWithAQESuite
 Key: SPARK-40195
 URL: https://issues.apache.org/jira/browse/SPARK-40195
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.4.0
Reporter: Kazuyuki Tanimura


Currently `PrunedScanSuite` assumes that AQE is always not applied. We should 
also test with AQE force applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40196) Consolidate `lit` function with NumPy input in sql and pandas module

2022-08-23 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-40196:


 Summary: Consolidate `lit` function with NumPy input in sql and 
pandas module
 Key: SPARK-40196
 URL: https://issues.apache.org/jira/browse/SPARK-40196
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Function `lit` with NumPy input in sql and pandas module have different 
implementations, thus, sql has a less precise result than pandas.

We shall make their result consistent, the more precise, the better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40196) Consolidate `lit` function with NumPy input in sql and pandas module

2022-08-23 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-40196:
-
Description: 
Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,]

function `lit` with NumPy input in sql and pandas module have different 
implementations, thus, sql has a less precise result than pandas.

We shall make their result consistent, the more precise, the better.

  was:
Function `lit` with NumPy input in sql and pandas module have different 
implementations, thus, sql has a less precise result than pandas.

We shall make their result consistent, the more precise, the better.


> Consolidate `lit` function with NumPy input in sql and pandas module
> 
>
> Key: SPARK-40196
> URL: https://issues.apache.org/jira/browse/SPARK-40196
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,]
> function `lit` with NumPy input in sql and pandas module have different 
> implementations, thus, sql has a less precise result than pandas.
> We shall make their result consistent, the more precise, the better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40191) Make pyspark.resource examples self-contained

2022-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40191.
---
Fix Version/s: 3.4.0
 Assignee: Hyukjin Kwon  (was: Apache Spark)
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/37627

> Make pyspark.resource examples self-contained
> -
>
> Key: SPARK-40191
> URL: https://issues.apache.org/jira/browse/SPARK-40191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40197) Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR

2022-08-23 Thread Vitalii Li (Jira)
Vitalii Li created SPARK-40197:
--

 Summary: Replace query plan with context for 
MULTI_VALUE_SUBQUERY_ERROR
 Key: SPARK-40197
 URL: https://issues.apache.org/jira/browse/SPARK-40197
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Vitalii Li


Instead of a query plan - output subquery context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40197) Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40197:


Assignee: (was: Apache Spark)

> Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR
> --
>
> Key: SPARK-40197
> URL: https://issues.apache.org/jira/browse/SPARK-40197
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Priority: Major
>
> Instead of a query plan - output subquery context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40197) Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583901#comment-17583901
 ] 

Apache Spark commented on SPARK-40197:
--

User 'vitaliili-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/37632

> Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR
> --
>
> Key: SPARK-40197
> URL: https://issues.apache.org/jira/browse/SPARK-40197
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Priority: Major
>
> Instead of a query plan - output subquery context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40197) Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40197:


Assignee: Apache Spark

> Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR
> --
>
> Key: SPARK-40197
> URL: https://issues.apache.org/jira/browse/SPARK-40197
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Assignee: Apache Spark
>Priority: Major
>
> Instead of a query plan - output subquery context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40198) Enable spark.storage.decommission.(rdd|shuffle)Blocks.enabled by default

2022-08-23 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-40198:
-

 Summary: Enable 
spark.storage.decommission.(rdd|shuffle)Blocks.enabled by default
 Key: SPARK-40198
 URL: https://issues.apache.org/jira/browse/SPARK-40198
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40198) Enable spark.storage.decommission.(rdd|shuffle)Blocks.enabled by default

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583903#comment-17583903
 ] 

Apache Spark commented on SPARK-40198:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37633

> Enable spark.storage.decommission.(rdd|shuffle)Blocks.enabled by default
> 
>
> Key: SPARK-40198
> URL: https://issues.apache.org/jira/browse/SPARK-40198
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40198) Enable spark.storage.decommission.(rdd|shuffle)Blocks.enabled by default

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40198:


Assignee: Apache Spark

> Enable spark.storage.decommission.(rdd|shuffle)Blocks.enabled by default
> 
>
> Key: SPARK-40198
> URL: https://issues.apache.org/jira/browse/SPARK-40198
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40198) Enable spark.storage.decommission.(rdd|shuffle)Blocks.enabled by default

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40198:


Assignee: (was: Apache Spark)

> Enable spark.storage.decommission.(rdd|shuffle)Blocks.enabled by default
> 
>
> Key: SPARK-40198
> URL: https://issues.apache.org/jira/browse/SPARK-40198
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40078) Make pyspark.sql.column examples self-contained

2022-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40078.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37521
[https://github.com/apache/spark/pull/37521]

> Make pyspark.sql.column examples self-contained
> ---
>
> Key: SPARK-40078
> URL: https://issues.apache.org/jira/browse/SPARK-40078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40078) Make pyspark.sql.column examples self-contained

2022-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40078:


Assignee: Qian Sun

> Make pyspark.sql.column examples self-contained
> ---
>
> Key: SPARK-40078
> URL: https://issues.apache.org/jira/browse/SPARK-40078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39150) Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas to 1.4+

2022-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39150:


Assignee: (was: Yikun Jiang)

> Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas 
> to 1.4+
> ---
>
> Key: SPARK-39150
> URL: https://issues.apache.org/jira/browse/SPARK-39150
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2333]
> [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2265]
> all doctest in https://github.com/apache/spark/pull/36712



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39150) Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas to 1.4+

2022-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39150:
-
Fix Version/s: (was: 3.4.0)

> Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas 
> to 1.4+
> ---
>
> Key: SPARK-39150
> URL: https://issues.apache.org/jira/browse/SPARK-39150
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2333]
> [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2265]
> all doctest in https://github.com/apache/spark/pull/36712



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-39150) Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas to 1.4+

2022-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-39150:
--

> Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas 
> to 1.4+
> ---
>
> Key: SPARK-39150
> URL: https://issues.apache.org/jira/browse/SPARK-39150
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2333]
> [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2265]
> all doctest in https://github.com/apache/spark/pull/36712



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39150) Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas to 1.4+

2022-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39150.
--
Resolution: Not A Problem

Reverted at 
https://github.com/apache/spark/commit/d32a67f92cfcc7c67f44e682d4c3612d60ba1b3a

> Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas 
> to 1.4+
> ---
>
> Key: SPARK-39150
> URL: https://issues.apache.org/jira/browse/SPARK-39150
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2333]
> [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2265]
> all doctest in https://github.com/apache/spark/pull/36712



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40124:
--
Fix Version/s: 3.3.1

> Update TPCDS v1.4 q32 for Plan Stability tests
> --
>
> Key: SPARK-40124
> URL: https://issues.apache.org/jira/browse/SPARK-40124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kapil Singh
>Assignee: Kapil Singh
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40199) Spark throws NPE without useful message when NULL value appears in non-null schema

2022-08-23 Thread Erik Krogen (Jira)
Erik Krogen created SPARK-40199:
---

 Summary: Spark throws NPE without useful message when NULL value 
appears in non-null schema
 Key: SPARK-40199
 URL: https://issues.apache.org/jira/browse/SPARK-40199
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.2
Reporter: Erik Krogen


Currently in some cases, if Spark encounters a NULL value where the schema 
indicates that the column/field should be non-null, it will throw a 
{{NullPointerException}} with no message and thus no way to debug further. This 
can happen, for example, if you have a UDF which is erroneously marked as 
{{asNonNullable()}}, or if you read input data where the actual values don't 
match the schema (which could happen e.g. with Avro if the reader provides a 
schema declaring non-null although the data was written with null values).

As an example of how to reproduce:
{code:scala}
val badUDF = spark.udf.register[String, Int]("bad_udf", in => 
null).asNonNullable()
Seq(1, 2).toDF("c1").select(badUDF($"c1")).collect()
{code}

This throws an exception like:
{code}
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 
1) (xx executor driver): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}

As a user, it is very confusing -- it looks like there is a bug in Spark. We 
have had many users report such problems, and though we can guide them to a 
schema-data mismatch, there is no indication of what field might contain the 
bad values, so a laborious data exploration process is required to find and 
remedy it.

We should provide a better error message in such cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-39150) Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas to 1.4+

2022-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-39150.
-

> Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas 
> to 1.4+
> ---
>
> Key: SPARK-39150
> URL: https://issues.apache.org/jira/browse/SPARK-39150
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2333]
> [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2265]
> all doctest in https://github.com/apache/spark/pull/36712



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40199) Spark throws NPE without useful message when NULL value appears in non-null schema

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40199:


Assignee: Apache Spark

> Spark throws NPE without useful message when NULL value appears in non-null 
> schema
> --
>
> Key: SPARK-40199
> URL: https://issues.apache.org/jira/browse/SPARK-40199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Erik Krogen
>Assignee: Apache Spark
>Priority: Major
>
> Currently in some cases, if Spark encounters a NULL value where the schema 
> indicates that the column/field should be non-null, it will throw a 
> {{NullPointerException}} with no message and thus no way to debug further. 
> This can happen, for example, if you have a UDF which is erroneously marked 
> as {{asNonNullable()}}, or if you read input data where the actual values 
> don't match the schema (which could happen e.g. with Avro if the reader 
> provides a schema declaring non-null although the data was written with null 
> values).
> As an example of how to reproduce:
> {code:scala}
> val badUDF = spark.udf.register[String, Int]("bad_udf", in => 
> null).asNonNullable()
> Seq(1, 2).toDF("c1").select(badUDF($"c1")).collect()
> {code}
> This throws an exception like:
> {code}
> Driver stacktrace:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 
> (TID 1) (xx executor driver): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>   at org.apache.spark.scheduler.Task.run(Task.scala:139)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> As a user, it is very confusing -- it looks like there is a bug in Spark. We 
> have had many users report such problems, and though we can guide them to a 
> schema-data mismatch, there is no indication of what field might contain the 
> bad values, so a laborious data exploration process is required to find and 
> remedy it.
> We should provide a better error message in such cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40199) Spark throws NPE without useful message when NULL value appears in non-null schema

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40199:


Assignee: (was: Apache Spark)

> Spark throws NPE without useful message when NULL value appears in non-null 
> schema
> --
>
> Key: SPARK-40199
> URL: https://issues.apache.org/jira/browse/SPARK-40199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Erik Krogen
>Priority: Major
>
> Currently in some cases, if Spark encounters a NULL value where the schema 
> indicates that the column/field should be non-null, it will throw a 
> {{NullPointerException}} with no message and thus no way to debug further. 
> This can happen, for example, if you have a UDF which is erroneously marked 
> as {{asNonNullable()}}, or if you read input data where the actual values 
> don't match the schema (which could happen e.g. with Avro if the reader 
> provides a schema declaring non-null although the data was written with null 
> values).
> As an example of how to reproduce:
> {code:scala}
> val badUDF = spark.udf.register[String, Int]("bad_udf", in => 
> null).asNonNullable()
> Seq(1, 2).toDF("c1").select(badUDF($"c1")).collect()
> {code}
> This throws an exception like:
> {code}
> Driver stacktrace:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 
> (TID 1) (xx executor driver): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>   at org.apache.spark.scheduler.Task.run(Task.scala:139)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> As a user, it is very confusing -- it looks like there is a bug in Spark. We 
> have had many users report such problems, and though we can guide them to a 
> schema-data mismatch, there is no indication of what field might contain the 
> bad values, so a laborious data exploration process is required to find and 
> remedy it.
> We should provide a better error message in such cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40199) Spark throws NPE without useful message when NULL value appears in non-null schema

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583916#comment-17583916
 ] 

Apache Spark commented on SPARK-40199:
--

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/37634

> Spark throws NPE without useful message when NULL value appears in non-null 
> schema
> --
>
> Key: SPARK-40199
> URL: https://issues.apache.org/jira/browse/SPARK-40199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Erik Krogen
>Priority: Major
>
> Currently in some cases, if Spark encounters a NULL value where the schema 
> indicates that the column/field should be non-null, it will throw a 
> {{NullPointerException}} with no message and thus no way to debug further. 
> This can happen, for example, if you have a UDF which is erroneously marked 
> as {{asNonNullable()}}, or if you read input data where the actual values 
> don't match the schema (which could happen e.g. with Avro if the reader 
> provides a schema declaring non-null although the data was written with null 
> values).
> As an example of how to reproduce:
> {code:scala}
> val badUDF = spark.udf.register[String, Int]("bad_udf", in => 
> null).asNonNullable()
> Seq(1, 2).toDF("c1").select(badUDF($"c1")).collect()
> {code}
> This throws an exception like:
> {code}
> Driver stacktrace:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 
> (TID 1) (xx executor driver): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>   at org.apache.spark.scheduler.Task.run(Task.scala:139)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> As a user, it is very confusing -- it looks like there is a bug in Spark. We 
> have had many users report such problems, and though we can guide them to a 
> schema-data mismatch, there is no indication of what field might contain the 
> bad values, so a laborious data exploration process is required to find and 
> remedy it.
> We should provide a better error message in such cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40131) Support NumPy ndarray in built-in functions

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583917#comment-17583917
 ] 

Apache Spark commented on SPARK-40131:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37635

> Support NumPy ndarray in built-in functions
> ---
>
> Key: SPARK-40131
> URL: https://issues.apache.org/jira/browse/SPARK-40131
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Per [https://github.com/apache/spark/pull/37560#discussion_r948572473]
> we want to support NumPy ndarray in built-in functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40131) Support NumPy ndarray in built-in functions

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40131:


Assignee: (was: Apache Spark)

> Support NumPy ndarray in built-in functions
> ---
>
> Key: SPARK-40131
> URL: https://issues.apache.org/jira/browse/SPARK-40131
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Per [https://github.com/apache/spark/pull/37560#discussion_r948572473]
> we want to support NumPy ndarray in built-in functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40200) unpersist cascades with Kryo, MEMORY_AND_DISK_SER and monotonically_increasing_id

2022-08-23 Thread Calvin Pietersen (Jira)
Calvin Pietersen created SPARK-40200:


 Summary: unpersist cascades with Kryo, MEMORY_AND_DISK_SER and 
monotonically_increasing_id
 Key: SPARK-40200
 URL: https://issues.apache.org/jira/browse/SPARK-40200
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
 Environment: spark-3.3.0
Reporter: Calvin Pietersen


Unpersist of a parent dataset which has a column from 
`monotonically_increasing_id` cascades to a child dataset when
 * joined on another dataset
 * kryo serialization is enabled
 * storage level is MEMORY_AND_DISK_SER
 * not all rows join

 

```
import org.apache.spark.sql.functions.monotonically_increasing_id
import org.apache.spark.storage.StorageLevel

case class a(value: String, id: Long)

val storageLevel = StorageLevel.MEMORY_AND_DISK_SER // cascades
//val storageLevel = StorageLevel.MEMORY_ONLY // doesn't cascade

val acc = sc.longAccumulator("acc")

val parent1DS = spark
.createDataset(Seq("a", "b", "c"))
.withColumn("id", monotonically_increasing_id)
.as[a]
.persist(storageLevel)

val parent2DS = spark
.createDataset(Seq(1, 2, 3)) // 0,1,2 doesn't cascade
.persist(storageLevel)

val childDS = parent1DS
.joinWith(parent2DS, parent1DS("id") === parent2DS("value"))
.map(i => {
acc.add(1)
i
}).persist(storageLevel)

childDS.count
parent1DS.unpersist
childDS.count

acc.value should be(2)
```
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40131) Support NumPy ndarray in built-in functions

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40131:


Assignee: Apache Spark

> Support NumPy ndarray in built-in functions
> ---
>
> Key: SPARK-40131
> URL: https://issues.apache.org/jira/browse/SPARK-40131
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Per [https://github.com/apache/spark/pull/37560#discussion_r948572473]
> we want to support NumPy ndarray in built-in functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40200) unpersist cascades with Kryo, MEMORY_AND_DISK_SER and monotonically_increasing_id

2022-08-23 Thread Calvin Pietersen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Pietersen updated SPARK-40200:
-
Description: 
Unpersist of a parent dataset which has a column from 
`monotonically_increasing_id` cascades to a child dataset when
 * joined on another dataset
 * kryo serialization is enabled
 * storage level is MEMORY_AND_DISK_SER
 * not all rows join

 

 

 

 

```
import org.apache.spark.sql.functions.monotonically_increasing_id
import org.apache.spark.storage.StorageLevel

case class a(value: String, id: Long)

val storageLevel = StorageLevel.MEMORY_AND_DISK_SER // cascades
//val storageLevel = StorageLevel.MEMORY_ONLY // doesn't cascade

val acc = sc.longAccumulator("acc")

val parent1DS = spark
.createDataset(Seq("a", "b", "c"))
.withColumn("id", monotonically_increasing_id)
.as[a]
.persist(storageLevel)

val parent2DS = spark
.createDataset(Seq(1, 2, 3)) // 0,1,2 doesn't cascade
.persist(storageLevel)

val childDS = parent1DS
.joinWith(parent2DS, parent1DS("id") === parent2DS("value"))
.map(i =>

{ acc.add(1) i }

).persist(storageLevel)

childDS.count
parent1DS.unpersist
childDS.count

acc.value should be(2)
```
 

  was:
Unpersist of a parent dataset which has a column from 
`monotonically_increasing_id` cascades to a child dataset when
 * joined on another dataset
 * kryo serialization is enabled
 * storage level is MEMORY_AND_DISK_SER
 * not all rows join

 

```
import org.apache.spark.sql.functions.monotonically_increasing_id
import org.apache.spark.storage.StorageLevel

case class a(value: String, id: Long)

val storageLevel = StorageLevel.MEMORY_AND_DISK_SER // cascades
//val storageLevel = StorageLevel.MEMORY_ONLY // doesn't cascade

val acc = sc.longAccumulator("acc")

val parent1DS = spark
.createDataset(Seq("a", "b", "c"))
.withColumn("id", monotonically_increasing_id)
.as[a]
.persist(storageLevel)

val parent2DS = spark
.createDataset(Seq(1, 2, 3)) // 0,1,2 doesn't cascade
.persist(storageLevel)

val childDS = parent1DS
.joinWith(parent2DS, parent1DS("id") === parent2DS("value"))
.map(i => {
acc.add(1)
i
}).persist(storageLevel)

childDS.count
parent1DS.unpersist
childDS.count

acc.value should be(2)
```
 


> unpersist cascades with Kryo, MEMORY_AND_DISK_SER and 
> monotonically_increasing_id
> -
>
> Key: SPARK-40200
> URL: https://issues.apache.org/jira/browse/SPARK-40200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
> Environment: spark-3.3.0
>Reporter: Calvin Pietersen
>Priority: Major
>
> Unpersist of a parent dataset which has a column from 
> `monotonically_increasing_id` cascades to a child dataset when
>  * joined on another dataset
>  * kryo serialization is enabled
>  * storage level is MEMORY_AND_DISK_SER
>  * not all rows join
>  
>  
>  
>  
> ```
> import org.apache.spark.sql.functions.monotonically_increasing_id
> import org.apache.spark.storage.StorageLevel
> case class a(value: String, id: Long)
> val storageLevel = StorageLevel.MEMORY_AND_DISK_SER // cascades
> //val storageLevel = StorageLevel.MEMORY_ONLY // doesn't cascade
> val acc = sc.longAccumulator("acc")
> val parent1DS = spark
> .createDataset(Seq("a", "b", "c"))
> .withColumn("id", monotonically_increasing_id)
> .as[a]
> .persist(storageLevel)
> val parent2DS = spark
> .createDataset(Seq(1, 2, 3)) // 0,1,2 doesn't cascade
> .persist(storageLevel)
> val childDS = parent1DS
> .joinWith(parent2DS, parent1DS("id") === parent2DS("value"))
> .map(i =>
> { acc.add(1) i }
> ).persist(storageLevel)
> childDS.count
> parent1DS.unpersist
> childDS.count
> acc.value should be(2)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40200) unpersist cascades with Kryo, MEMORY_AND_DISK_SER and monotonically_increasing_id

2022-08-23 Thread Calvin Pietersen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Pietersen updated SPARK-40200:
-
Description: 
Unpersist of a parent dataset which has a column from 
`monotonically_increasing_id` cascades to a child dataset when
 * joined on another dataset
 * kryo serialization is enabled
 * storage level is MEMORY_AND_DISK_SER
 * not all rows join

 
{code:java}
import org.apache.spark.sql.functions.monotonically_increasing_id
import org.apache.spark.storage.StorageLevel

case class a(value: String, id: Long)

val storageLevel = StorageLevel.MEMORY_AND_DISK_SER // cascades
//val storageLevel = StorageLevel.MEMORY_ONLY // doesn't cascade

val acc = sc.longAccumulator("acc")
val parent1DS = spark.createDataset(Seq("a", "b", "c"))
 .withColumn("id", monotonically_increasing_id)
 .as[a]
 .persist(storageLevel)

val parent2DS = spark.createDataset(Seq(1, 2, 3)) // 0,1,2 doesn't cascade
 .persist(storageLevel)

val childDS = parent1DS.joinWith(parent2DS, parent1DS("id") === 
parent2DS("value"))
   .map(i =>{ 
  acc.add(1) 
  i
}).persist(storageLevel)

childDS.count
parent1DS.unpersist
childDS.count

acc.value should be(2) {code}

  was:
Unpersist of a parent dataset which has a column from 
`monotonically_increasing_id` cascades to a child dataset when
 * joined on another dataset
 * kryo serialization is enabled
 * storage level is MEMORY_AND_DISK_SER
 * not all rows join

 

 

 

 

```
import org.apache.spark.sql.functions.monotonically_increasing_id
import org.apache.spark.storage.StorageLevel

case class a(value: String, id: Long)

val storageLevel = StorageLevel.MEMORY_AND_DISK_SER // cascades
//val storageLevel = StorageLevel.MEMORY_ONLY // doesn't cascade

val acc = sc.longAccumulator("acc")

val parent1DS = spark
.createDataset(Seq("a", "b", "c"))
.withColumn("id", monotonically_increasing_id)
.as[a]
.persist(storageLevel)

val parent2DS = spark
.createDataset(Seq(1, 2, 3)) // 0,1,2 doesn't cascade
.persist(storageLevel)

val childDS = parent1DS
.joinWith(parent2DS, parent1DS("id") === parent2DS("value"))
.map(i =>

{ acc.add(1) i }

).persist(storageLevel)

childDS.count
parent1DS.unpersist
childDS.count

acc.value should be(2)
```
 


> unpersist cascades with Kryo, MEMORY_AND_DISK_SER and 
> monotonically_increasing_id
> -
>
> Key: SPARK-40200
> URL: https://issues.apache.org/jira/browse/SPARK-40200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
> Environment: spark-3.3.0
>Reporter: Calvin Pietersen
>Priority: Major
>
> Unpersist of a parent dataset which has a column from 
> `monotonically_increasing_id` cascades to a child dataset when
>  * joined on another dataset
>  * kryo serialization is enabled
>  * storage level is MEMORY_AND_DISK_SER
>  * not all rows join
>  
> {code:java}
> import org.apache.spark.sql.functions.monotonically_increasing_id
> import org.apache.spark.storage.StorageLevel
> case class a(value: String, id: Long)
> val storageLevel = StorageLevel.MEMORY_AND_DISK_SER // cascades
> //val storageLevel = StorageLevel.MEMORY_ONLY // doesn't cascade
> val acc = sc.longAccumulator("acc")
> val parent1DS = spark.createDataset(Seq("a", "b", "c"))
>  .withColumn("id", monotonically_increasing_id)
>  .as[a]
>  .persist(storageLevel)
> val parent2DS = spark.createDataset(Seq(1, 2, 3)) // 0,1,2 doesn't cascade
>  .persist(storageLevel)
> val childDS = parent1DS.joinWith(parent2DS, parent1DS("id") === 
> parent2DS("value"))
>.map(i =>{ 
>   acc.add(1) 
>   i
> }).persist(storageLevel)
> childDS.count
> parent1DS.unpersist
> childDS.count
> acc.value should be(2) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40200) unpersist cascades with Kryo, MEMORY_AND_DISK_SER and monotonically_increasing_id

2022-08-23 Thread Calvin Pietersen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Pietersen updated SPARK-40200:
-
Description: 
Unpersist of a parent dataset which has a column from 
_*monotonically_increasing_id*_ cascades to a child dataset when
 * joined on another dataset
 * kryo serialization is enabled
 * storage level is MEMORY_AND_DISK_SER
 * not all rows join

 
{code:java}
import org.apache.spark.sql.functions.monotonically_increasing_id
import org.apache.spark.storage.StorageLevel

case class a(value: String, id: Long)

val storageLevel = StorageLevel.MEMORY_AND_DISK_SER // cascades
//val storageLevel = StorageLevel.MEMORY_ONLY // doesn't cascade

val acc = sc.longAccumulator("acc")
val parent1DS = spark.createDataset(Seq("a", "b", "c"))
 .withColumn("id", monotonically_increasing_id)
 .as[a]
 .persist(storageLevel)

val parent2DS = spark.createDataset(Seq(1, 2, 3)) // 0,1,2 doesn't cascade
 .persist(storageLevel)

val childDS = parent1DS.joinWith(parent2DS, parent1DS("id") === 
parent2DS("value"))
   .map(i =>{ 
  acc.add(1) 
  i
}).persist(storageLevel)

childDS.count
parent1DS.unpersist
childDS.count

acc.value should be(2) {code}

  was:
Unpersist of a parent dataset which has a column from 
`monotonically_increasing_id` cascades to a child dataset when
 * joined on another dataset
 * kryo serialization is enabled
 * storage level is MEMORY_AND_DISK_SER
 * not all rows join

 
{code:java}
import org.apache.spark.sql.functions.monotonically_increasing_id
import org.apache.spark.storage.StorageLevel

case class a(value: String, id: Long)

val storageLevel = StorageLevel.MEMORY_AND_DISK_SER // cascades
//val storageLevel = StorageLevel.MEMORY_ONLY // doesn't cascade

val acc = sc.longAccumulator("acc")
val parent1DS = spark.createDataset(Seq("a", "b", "c"))
 .withColumn("id", monotonically_increasing_id)
 .as[a]
 .persist(storageLevel)

val parent2DS = spark.createDataset(Seq(1, 2, 3)) // 0,1,2 doesn't cascade
 .persist(storageLevel)

val childDS = parent1DS.joinWith(parent2DS, parent1DS("id") === 
parent2DS("value"))
   .map(i =>{ 
  acc.add(1) 
  i
}).persist(storageLevel)

childDS.count
parent1DS.unpersist
childDS.count

acc.value should be(2) {code}


> unpersist cascades with Kryo, MEMORY_AND_DISK_SER and 
> monotonically_increasing_id
> -
>
> Key: SPARK-40200
> URL: https://issues.apache.org/jira/browse/SPARK-40200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
> Environment: spark-3.3.0
>Reporter: Calvin Pietersen
>Priority: Major
>
> Unpersist of a parent dataset which has a column from 
> _*monotonically_increasing_id*_ cascades to a child dataset when
>  * joined on another dataset
>  * kryo serialization is enabled
>  * storage level is MEMORY_AND_DISK_SER
>  * not all rows join
>  
> {code:java}
> import org.apache.spark.sql.functions.monotonically_increasing_id
> import org.apache.spark.storage.StorageLevel
> case class a(value: String, id: Long)
> val storageLevel = StorageLevel.MEMORY_AND_DISK_SER // cascades
> //val storageLevel = StorageLevel.MEMORY_ONLY // doesn't cascade
> val acc = sc.longAccumulator("acc")
> val parent1DS = spark.createDataset(Seq("a", "b", "c"))
>  .withColumn("id", monotonically_increasing_id)
>  .as[a]
>  .persist(storageLevel)
> val parent2DS = spark.createDataset(Seq(1, 2, 3)) // 0,1,2 doesn't cascade
>  .persist(storageLevel)
> val childDS = parent1DS.joinWith(parent2DS, parent1DS("id") === 
> parent2DS("value"))
>.map(i =>{ 
>   acc.add(1) 
>   i
> }).persist(storageLevel)
> childDS.count
> parent1DS.unpersist
> childDS.count
> acc.value should be(2) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40200) unpersist cascades with Kryo, MEMORY_AND_DISK_SER and monotonically_increasing_id

2022-08-23 Thread Calvin Pietersen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Pietersen updated SPARK-40200:
-
Affects Version/s: 3.2.1

> unpersist cascades with Kryo, MEMORY_AND_DISK_SER and 
> monotonically_increasing_id
> -
>
> Key: SPARK-40200
> URL: https://issues.apache.org/jira/browse/SPARK-40200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1, 3.3.0
> Environment: spark-3.3.0
>Reporter: Calvin Pietersen
>Priority: Major
>
> Unpersist of a parent dataset which has a column from 
> _*monotonically_increasing_id*_ cascades to a child dataset when
>  * joined on another dataset
>  * kryo serialization is enabled
>  * storage level is MEMORY_AND_DISK_SER
>  * not all rows join
>  
> {code:java}
> import org.apache.spark.sql.functions.monotonically_increasing_id
> import org.apache.spark.storage.StorageLevel
> case class a(value: String, id: Long)
> val storageLevel = StorageLevel.MEMORY_AND_DISK_SER // cascades
> //val storageLevel = StorageLevel.MEMORY_ONLY // doesn't cascade
> val acc = sc.longAccumulator("acc")
> val parent1DS = spark.createDataset(Seq("a", "b", "c"))
>  .withColumn("id", monotonically_increasing_id)
>  .as[a]
>  .persist(storageLevel)
> val parent2DS = spark.createDataset(Seq(1, 2, 3)) // 0,1,2 doesn't cascade
>  .persist(storageLevel)
> val childDS = parent1DS.joinWith(parent2DS, parent1DS("id") === 
> parent2DS("value"))
>.map(i =>{ 
>   acc.add(1) 
>   i
> }).persist(storageLevel)
> childDS.count
> parent1DS.unpersist
> childDS.count
> acc.value should be(2) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40152) Codegen compilation error when using split_part

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583923#comment-17583923
 ] 

Apache Spark commented on SPARK-40152:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37637

> Codegen compilation error when using split_part
> ---
>
> Key: SPARK-40152
> URL: https://issues.apache.org/jira/browse/SPARK-40152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> The following query throws an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> ('11.12.13', '.', 3)
> as v1(col1, col2, col3);
> cache table v1;
> SELECT split_part(col1, col2, col3)
> from v1;
> {noformat}
> The error is:
> {noformat}
> 22/08/19 14:25:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
>   at 
> org.codehaus.janino.Java$Atom.toTypeOrCompileException(Java.java:3934)
>   at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1887)
>   at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1811)
>   at org.codehaus.janino.Parser.parseBlock(Parser.java:1792)
>   at 
> {noformat}
> In the end, {{split_part}} does successfully execute, although in interpreted 
> mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40152) Codegen compilation error when using split_part

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583924#comment-17583924
 ] 

Apache Spark commented on SPARK-40152:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37637

> Codegen compilation error when using split_part
> ---
>
> Key: SPARK-40152
> URL: https://issues.apache.org/jira/browse/SPARK-40152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> The following query throws an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> ('11.12.13', '.', 3)
> as v1(col1, col2, col3);
> cache table v1;
> SELECT split_part(col1, col2, col3)
> from v1;
> {noformat}
> The error is:
> {noformat}
> 22/08/19 14:25:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
>   at 
> org.codehaus.janino.Java$Atom.toTypeOrCompileException(Java.java:3934)
>   at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1887)
>   at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1811)
>   at org.codehaus.janino.Parser.parseBlock(Parser.java:1792)
>   at 
> {noformat}
> In the end, {{split_part}} does successfully execute, although in interpreted 
> mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40156) url_decode() exposes a Java error

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583925#comment-17583925
 ] 

Apache Spark commented on SPARK-40156:
--

User 'ming95' has created a pull request for this issue:
https://github.com/apache/spark/pull/37636

> url_decode() exposes a Java error
> -
>
> Key: SPARK-40156
> URL: https://issues.apache.org/jira/browse/SPARK-40156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> Given a badly encode string Spark returns a Java error.
> It should the return an ERROR_CLASS
> spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org');
> 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT 
> url_decode('http%3A%2F%2spark.apache.org')]
> java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in 
> escape (%) pattern - Error at index 1 in: "2s"
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40156) url_decode() exposes a Java error

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40156:


Assignee: Apache Spark

> url_decode() exposes a Java error
> -
>
> Key: SPARK-40156
> URL: https://issues.apache.org/jira/browse/SPARK-40156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: Apache Spark
>Priority: Major
>
> Given a badly encode string Spark returns a Java error.
> It should the return an ERROR_CLASS
> spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org');
> 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT 
> url_decode('http%3A%2F%2spark.apache.org')]
> java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in 
> escape (%) pattern - Error at index 1 in: "2s"
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40156) url_decode() exposes a Java error

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40156:


Assignee: (was: Apache Spark)

> url_decode() exposes a Java error
> -
>
> Key: SPARK-40156
> URL: https://issues.apache.org/jira/browse/SPARK-40156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> Given a badly encode string Spark returns a Java error.
> It should the return an ERROR_CLASS
> spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org');
> 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT 
> url_decode('http%3A%2F%2spark.apache.org')]
> java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in 
> escape (%) pattern - Error at index 1 in: "2s"
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33573) Server side metrics related to push-based shuffle

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583926#comment-17583926
 ] 

Apache Spark commented on SPARK-33573:
--

User 'rmcyang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37638

> Server side metrics related to push-based shuffle
> -
>
> Key: SPARK-33573
> URL: https://issues.apache.org/jira/browse/SPARK-33573
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> Shuffle Server side metrics for push based shuffle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33573) Server side metrics related to push-based shuffle

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583927#comment-17583927
 ] 

Apache Spark commented on SPARK-33573:
--

User 'rmcyang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37638

> Server side metrics related to push-based shuffle
> -
>
> Key: SPARK-33573
> URL: https://issues.apache.org/jira/browse/SPARK-33573
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> Shuffle Server side metrics for push based shuffle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40165) Update test plugins to latest versions

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583933#comment-17583933
 ] 

Apache Spark commented on SPARK-40165:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37639

> Update test plugins to latest versions
> --
>
> Key: SPARK-40165
> URL: https://issues.apache.org/jira/browse/SPARK-40165
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
>
> Include:
>  * 1.scalacheck (from 1.15.4 to 1.16.0)
>  * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
>  * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40165) Update test plugins to latest versions

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583934#comment-17583934
 ] 

Apache Spark commented on SPARK-40165:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37639

> Update test plugins to latest versions
> --
>
> Key: SPARK-40165
> URL: https://issues.apache.org/jira/browse/SPARK-40165
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
>
> Include:
>  * 1.scalacheck (from 1.15.4 to 1.16.0)
>  * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
>  * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38752) Test the error class: UNSUPPORTED_DATATYPE

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38752:


Assignee: Apache Spark

> Test the error class: UNSUPPORTED_DATATYPE
> --
>
> Key: SPARK-38752
> URL: https://issues.apache.org/jira/browse/SPARK-38752
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *UNSUPPORTED_DATATYPE* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def dataTypeUnsupportedError(dataType: String, failure: String): Throwable 
> = {
> new SparkIllegalArgumentException(errorClass = "UNSUPPORTED_DATATYPE",
>   messageParameters = Array(dataType + failure))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38752) Test the error class: UNSUPPORTED_DATATYPE

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583943#comment-17583943
 ] 

Apache Spark commented on SPARK-38752:
--

User 'lvshaokang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37640

> Test the error class: UNSUPPORTED_DATATYPE
> --
>
> Key: SPARK-38752
> URL: https://issues.apache.org/jira/browse/SPARK-38752
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *UNSUPPORTED_DATATYPE* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def dataTypeUnsupportedError(dataType: String, failure: String): Throwable 
> = {
> new SparkIllegalArgumentException(errorClass = "UNSUPPORTED_DATATYPE",
>   messageParameters = Array(dataType + failure))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38752) Test the error class: UNSUPPORTED_DATATYPE

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583944#comment-17583944
 ] 

Apache Spark commented on SPARK-38752:
--

User 'lvshaokang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37640

> Test the error class: UNSUPPORTED_DATATYPE
> --
>
> Key: SPARK-38752
> URL: https://issues.apache.org/jira/browse/SPARK-38752
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *UNSUPPORTED_DATATYPE* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def dataTypeUnsupportedError(dataType: String, failure: String): Throwable 
> = {
> new SparkIllegalArgumentException(errorClass = "UNSUPPORTED_DATATYPE",
>   messageParameters = Array(dataType + failure))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38752) Test the error class: UNSUPPORTED_DATATYPE

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38752:


Assignee: (was: Apache Spark)

> Test the error class: UNSUPPORTED_DATATYPE
> --
>
> Key: SPARK-38752
> URL: https://issues.apache.org/jira/browse/SPARK-38752
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *UNSUPPORTED_DATATYPE* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def dataTypeUnsupportedError(dataType: String, failure: String): Throwable 
> = {
> new SparkIllegalArgumentException(errorClass = "UNSUPPORTED_DATATYPE",
>   messageParameters = Array(dataType + failure))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40201) Improve v1 write test coverage

2022-08-23 Thread XiDuo You (Jira)
XiDuo You created SPARK-40201:
-

 Summary: Improve v1 write test coverage
 Key: SPARK-40201
 URL: https://issues.apache.org/jira/browse/SPARK-40201
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


Make v1 write test work on all SQL tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40201) Improve v1 write test coverage

2022-08-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583972#comment-17583972
 ] 

Apache Spark commented on SPARK-40201:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/37641

> Improve v1 write test coverage
> --
>
> Key: SPARK-40201
> URL: https://issues.apache.org/jira/browse/SPARK-40201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> Make v1 write test work on all SQL tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40201) Improve v1 write test coverage

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40201:


Assignee: Apache Spark

> Improve v1 write test coverage
> --
>
> Key: SPARK-40201
> URL: https://issues.apache.org/jira/browse/SPARK-40201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> Make v1 write test work on all SQL tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40201) Improve v1 write test coverage

2022-08-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40201:


Assignee: (was: Apache Spark)

> Improve v1 write test coverage
> --
>
> Key: SPARK-40201
> URL: https://issues.apache.org/jira/browse/SPARK-40201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> Make v1 write test work on all SQL tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40198) Enable spark.storage.decommission.(rdd|shuffle)Blocks.enabled by default

2022-08-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40198.
---
Fix Version/s: 3.4.0
 Assignee: Dongjoon Hyun
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/37633

> Enable spark.storage.decommission.(rdd|shuffle)Blocks.enabled by default
> 
>
> Key: SPARK-40198
> URL: https://issues.apache.org/jira/browse/SPARK-40198
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40202) Allow a map in SparkSession.config in PySpark

2022-08-23 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-40202:


 Summary: Allow a map in SparkSession.config in PySpark
 Key: SPARK-40202
 URL: https://issues.apache.org/jira/browse/SPARK-40202
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


SPARK-40163 added a new signature in SparkSession.conf. We should better have 
the same one in PySpark too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40202) Allow a dictionary in SparkSession.config in PySpark

2022-08-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40202:
-
Summary: Allow a dictionary in SparkSession.config in PySpark  (was: Allow 
a map in SparkSession.config in PySpark)

> Allow a dictionary in SparkSession.config in PySpark
> 
>
> Key: SPARK-40202
> URL: https://issues.apache.org/jira/browse/SPARK-40202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-40163 added a new signature in SparkSession.conf. We should better have 
> the same one in PySpark too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >