[jira] [Assigned] (SPARK-42396) Upgrade Apache Kafka to 3.4.0
[ https://issues.apache.org/jira/browse/SPARK-42396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42396: - Assignee: Bjørn Jørgensen > Upgrade Apache Kafka to 3.4.0 > - > > Key: SPARK-42396 > URL: https://issues.apache.org/jira/browse/SPARK-42396 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > > [CVE-2023-25194|https://www.cve.org/CVERecord?id=CVE-2023-25194] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42396) Upgrade Apache Kafka to 3.4.0
[ https://issues.apache.org/jira/browse/SPARK-42396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42396. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 39969 [https://github.com/apache/spark/pull/39969] > Upgrade Apache Kafka to 3.4.0 > - > > Key: SPARK-42396 > URL: https://issues.apache.org/jira/browse/SPARK-42396 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.5.0 > > > [CVE-2023-25194|https://www.cve.org/CVERecord?id=CVE-2023-25194] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42339) Improve Kryo Serializer Support
[ https://issues.apache.org/jira/browse/SPARK-42339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42339: -- Summary: Improve Kryo Serializer Support (was: Improve Kryo Serialize Support) > Improve Kryo Serializer Support > --- > > Key: SPARK-42339 > URL: https://issues.apache.org/jira/browse/SPARK-42339 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: releasenotes > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42408) Register DoubleType to KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-42408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42408: -- Parent: SPARK-42339 Issue Type: Sub-task (was: Improvement) > Register DoubleType to KryoSerializer > - > > Key: SPARK-42408 > URL: https://issues.apache.org/jira/browse/SPARK-42408 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42408) Register DoubleType to KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-42408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42408: - Assignee: William Hyun > Register DoubleType to KryoSerializer > - > > Key: SPARK-42408 > URL: https://issues.apache.org/jira/browse/SPARK-42408 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42408) Register DoubleType to KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-42408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42408. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39978 [https://github.com/apache/spark/pull/39978] > Register DoubleType to KryoSerializer > - > > Key: SPARK-42408 > URL: https://issues.apache.org/jira/browse/SPARK-42408 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: William Hyun >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42327) Assign name to_LEGACY_ERROR_TEMP_2177
[ https://issues.apache.org/jira/browse/SPARK-42327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42327: Assignee: Apache Spark > Assign name to_LEGACY_ERROR_TEMP_2177 > - > > Key: SPARK-42327 > URL: https://issues.apache.org/jira/browse/SPARK-42327 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42327) Assign name to_LEGACY_ERROR_TEMP_2177
[ https://issues.apache.org/jira/browse/SPARK-42327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687519#comment-17687519 ] Apache Spark commented on SPARK-42327: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39980 > Assign name to_LEGACY_ERROR_TEMP_2177 > - > > Key: SPARK-42327 > URL: https://issues.apache.org/jira/browse/SPARK-42327 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42327) Assign name to_LEGACY_ERROR_TEMP_2177
[ https://issues.apache.org/jira/browse/SPARK-42327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42327: Assignee: (was: Apache Spark) > Assign name to_LEGACY_ERROR_TEMP_2177 > - > > Key: SPARK-42327 > URL: https://issues.apache.org/jira/browse/SPARK-42327 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42326) Assign name to _LEGACY_ERROR_TEMP_2099
[ https://issues.apache.org/jira/browse/SPARK-42326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42326: Assignee: (was: Apache Spark) > Assign name to _LEGACY_ERROR_TEMP_2099 > -- > > Key: SPARK-42326 > URL: https://issues.apache.org/jira/browse/SPARK-42326 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42326) Assign name to _LEGACY_ERROR_TEMP_2099
[ https://issues.apache.org/jira/browse/SPARK-42326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42326: Assignee: Apache Spark > Assign name to _LEGACY_ERROR_TEMP_2099 > -- > > Key: SPARK-42326 > URL: https://issues.apache.org/jira/browse/SPARK-42326 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42326) Assign name to _LEGACY_ERROR_TEMP_2099
[ https://issues.apache.org/jira/browse/SPARK-42326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687517#comment-17687517 ] Apache Spark commented on SPARK-42326: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39979 > Assign name to _LEGACY_ERROR_TEMP_2099 > -- > > Key: SPARK-42326 > URL: https://issues.apache.org/jira/browse/SPARK-42326 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42408) Register DoubleType to KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-42408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42408: Assignee: (was: Apache Spark) > Register DoubleType to KryoSerializer > - > > Key: SPARK-42408 > URL: https://issues.apache.org/jira/browse/SPARK-42408 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: William Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42408) Register DoubleType to KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-42408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42408: Assignee: Apache Spark > Register DoubleType to KryoSerializer > - > > Key: SPARK-42408 > URL: https://issues.apache.org/jira/browse/SPARK-42408 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: William Hyun >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42408) Register DoubleType to KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-42408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687516#comment-17687516 ] Apache Spark commented on SPARK-42408: -- User 'williamhyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39978 > Register DoubleType to KryoSerializer > - > > Key: SPARK-42408 > URL: https://issues.apache.org/jira/browse/SPARK-42408 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: William Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42377) Test Framework for Connect Scala Client
[ https://issues.apache.org/jira/browse/SPARK-42377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42377. --- Fix Version/s: 3.4.0 Assignee: Herman van Hövell Resolution: Fixed > Test Framework for Connect Scala Client > --- > > Key: SPARK-42377 > URL: https://issues.apache.org/jira/browse/SPARK-42377 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42408) Register DoubleType to KryoSerializer
William Hyun created SPARK-42408: Summary: Register DoubleType to KryoSerializer Key: SPARK-42408 URL: https://issues.apache.org/jira/browse/SPARK-42408 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 3.4.0 Reporter: William Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42390) Upgrade buf from 1.13.1 to 1.14.0
[ https://issues.apache.org/jira/browse/SPARK-42390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42390. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39959 [https://github.com/apache/spark/pull/39959] > Upgrade buf from 1.13.1 to 1.14.0 > - > > Key: SPARK-42390 > URL: https://issues.apache.org/jira/browse/SPARK-42390 > Project: Spark > Issue Type: Improvement > Components: Build, Connect >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42390) Upgrade buf from 1.13.1 to 1.14.0
[ https://issues.apache.org/jira/browse/SPARK-42390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42390: Assignee: BingKun Pan > Upgrade buf from 1.13.1 to 1.14.0 > - > > Key: SPARK-42390 > URL: https://issues.apache.org/jira/browse/SPARK-42390 > Project: Spark > Issue Type: Improvement > Components: Build, Connect >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42310) Assign name to _LEGACY_ERROR_TEMP_1289
[ https://issues.apache.org/jira/browse/SPARK-42310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42310. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39946 [https://github.com/apache/spark/pull/39946] > Assign name to _LEGACY_ERROR_TEMP_1289 > -- > > Key: SPARK-42310 > URL: https://issues.apache.org/jira/browse/SPARK-42310 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42310) Assign name to _LEGACY_ERROR_TEMP_1289
[ https://issues.apache.org/jira/browse/SPARK-42310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42310: Assignee: Haejoon Lee > Assign name to _LEGACY_ERROR_TEMP_1289 > -- > > Key: SPARK-42310 > URL: https://issues.apache.org/jira/browse/SPARK-42310 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42402) Support parameterized SQL by sql()
[ https://issues.apache.org/jira/browse/SPARK-42402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42402: Assignee: Takuya Ueshin > Support parameterized SQL by sql() > -- > > Key: SPARK-42402 > URL: https://issues.apache.org/jira/browse/SPARK-42402 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42402) Support parameterized SQL by sql()
[ https://issues.apache.org/jira/browse/SPARK-42402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42402. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39971 [https://github.com/apache/spark/pull/39971] > Support parameterized SQL by sql() > -- > > Key: SPARK-42402 > URL: https://issues.apache.org/jira/browse/SPARK-42402 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42407) `with as` executed again
[ https://issues.apache.org/jira/browse/SPARK-42407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yiku123 updated SPARK-42407: Summary: `with as` executed again (was: with as ) > `with as` executed again > > > Key: SPARK-42407 > URL: https://issues.apache.org/jira/browse/SPARK-42407 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.3 >Reporter: yiku123 >Priority: Critical > > When 'with as' is used multiple times, it will be executed again each time > without saving the results of' with as', resulting in low efficiency. > Will you consider improving the behavior of 'with as' > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42407) with as
yiku123 created SPARK-42407: --- Summary: with as Key: SPARK-42407 URL: https://issues.apache.org/jira/browse/SPARK-42407 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.3 Reporter: yiku123 When 'with as' is used multiple times, it will be executed again each time without saving the results of' with as', resulting in low efficiency. Will you consider improving the behavior of 'with as' -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42406) [PROTOBUF] Recursive field handling is incompatible with delta
[ https://issues.apache.org/jira/browse/SPARK-42406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687489#comment-17687489 ] Raghu Angadi commented on SPARK-42406: -- cc: [~sanysand...@gmail.com] PTAL. > [PROTOBUF] Recursive field handling is incompatible with delta > -- > > Key: SPARK-42406 > URL: https://issues.apache.org/jira/browse/SPARK-42406 > Project: Spark > Issue Type: Bug > Components: Protobuf >Affects Versions: 3.4.0 >Reporter: Raghu Angadi >Priority: Major > Fix For: 3.4.1 > > > Protobuf deserializer (`from_protobuf()` function()) optionally supports > recursive fields by limiting the depth to certain level. See example below. > It assigns a 'NullType' for such a field when allowed depth is reached. > It causes a few issues. E.g. a repeated field as in the following example > results in a Array field with 'NullType'. Delta does not support null type in > a complex type. > Actually `Array[NullType]` is not really useful anyway. > How about this fix: Drop the recursive field when the limit reached rather > than using a NullType. > The example below makes it clear: > Consider a recursive Protobuf: > > {code:python} > message TreeNode { > string value = 1; > repeated TreeNode children = 2; > } > {code} > Allow depth of 2: > > {code:python} > df.select( > 'proto', > messageName = 'TreeNode', > options = { ... "recursive.fields.max.depth" : "2" } > ).printSchema() > {code} > Schema looks like this: > {noformat} > root > |– from_protobuf(proto): struct (nullable = true)| > | |– value: string (nullable = true)| > | |– children: array (nullable = false)| > | | |– element: struct (containsNull = false)| > | | | |– value: string (nullable = true)| > | | | |– children: array (nullable = false)| > | | | | |– element: struct (containsNull = false)| > | | | | | |– value: string (nullable = true)| > | | | | | |– children: array (nullable = false). [ === Proposed fix: Drop > this field === ]| > | | | | | | |– element: void (containsNull = false) [ === NOTICE 'void' HERE > === ] > {noformat} > When we try to write this to a delta table, we get an error: > {noformat} > AnalysisException: Found nested NullType in column > from_protobuf(proto).children which is of ArrayType. Delta doesn't support > writing NullType in complex types. > {noformat} > > We could just drop the field 'element' when recursion depth is reached. It is > simpler and does not need to deal with NullType. We are ignoring the value > anyway. There is no use in keeping the field. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42406) [PROTOBUF] Recursive field handling is incompatible with delta
[ https://issues.apache.org/jira/browse/SPARK-42406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi updated SPARK-42406: - Description: Protobuf deserializer (`from_protobuf()` function()) optionally supports recursive fields by limiting the depth to certain level. See example below. It assigns a 'NullType' for such a field when allowed depth is reached. It causes a few issues. E.g. a repeated field as in the following example results in a Array field with 'NullType'. Delta does not support null type in a complex type. Actually `Array[NullType]` is not really useful anyway. How about this fix: Drop the recursive field when the limit reached rather than using a NullType. The example below makes it clear: Consider a recursive Protobuf: {code:python} message TreeNode { string value = 1; repeated TreeNode children = 2; } {code} Allow depth of 2: {code:python} df.select( 'proto', messageName = 'TreeNode', options = { ... "recursive.fields.max.depth" : "2" } ).printSchema() {code} Schema looks like this: {noformat} root |– from_protobuf(proto): struct (nullable = true)| | |– value: string (nullable = true)| | |– children: array (nullable = false)| | | |– element: struct (containsNull = false)| | | | |– value: string (nullable = true)| | | | |– children: array (nullable = false)| | | | | |– element: struct (containsNull = false)| | | | | | |– value: string (nullable = true)| | | | | | |– children: array (nullable = false). [ === Proposed fix: Drop this field === ]| | | | | | | |– element: void (containsNull = false) [ === NOTICE 'void' HERE === ] {noformat} When we try to write this to a delta table, we get an error: {noformat} AnalysisException: Found nested NullType in column from_protobuf(proto).children which is of ArrayType. Delta doesn't support writing NullType in complex types. {noformat} We could just drop the field 'element' when recursion depth is reached. It is simpler and does not need to deal with NullType. We are ignoring the value anyway. There is no use in keeping the field. was: Protobuf deserializer (`from_protobuf()` function()) optionally supports recursive fields by limiting the depth to certain level. See example below. It assigns a 'NullType' for such a field when allowed depth is reached. It causes a few issues. E.g. a repeated field as in the following example results in a Array field with 'NullType'. Delta does not support null type in a complex type. Actually `Array[NullType]` is not really useful anyway. How about this fix: Drop the recursive field when the limit reached rather than using a NullType. The example below makes it clear: Consider a recursive Protobuf: ``` message TreeNode { string value = 1; repeated TreeNode children = 2; } ``` Allow depth of 2: ```python df.select( 'proto', messageName = 'TreeNode', options = { ... "recursive.fields.max.depth" : "2" } ).printSchema() ``` Schema looks like this: ``` root |-- from_protobuf(proto): struct (nullable = true) ||-- value: string (nullable = true) ||-- children: array (nullable = false) |||-- element: struct (containsNull = false) ||||-- value: string (nullable = true) ||||-- children: array (nullable = false) |||||-- element: struct (containsNull = false) ||||||-- value: string (nullable = true) ||||||-- children: array (nullable = false). [ === Proposed fix: Drop this field === ] |||||||-- element: void (containsNull = false) [ === NOTICE 'void' HERE === ] ``` When we try to write this to a delta table, we get an error: ``` AnalysisException: Found nested NullType in column from_protobuf(proto).children which is of ArrayType. Delta doesn't support writing NullType in complex types. ``` We could just drop the field 'element' when recursion depth is reached. It is simpler and does not need to deal with NullType. We are ignoring the value anyway. There is no use in keeping the field. > [PROTOBUF] Recursive field handling is incompatible with delta > -- > > Key: SPARK-42406 > URL: https://issues.apache.org/jira/browse/SPARK-42406 > Project: Spark > Issue Type: Bug > Components: Protobuf >Affects Versions: 3.4.0 >Reporter: Raghu Angadi >Priority: Major > Fix For: 3.4.1 > > > Protobuf deserializer (`from_protobuf()` function()) optionally supports > recursive fields by limiting the depth to certain level. See example below. > It assigns a 'NullType' for such a field when allowed depth is reached. > It causes a few issues. E.g. a repeated field as in the following example > results in a Array field with 'NullType'. Delta does not support null
[jira] [Created] (SPARK-42406) [PROTOBUF] Recursive field handling is incompatible with delta
Raghu Angadi created SPARK-42406: Summary: [PROTOBUF] Recursive field handling is incompatible with delta Key: SPARK-42406 URL: https://issues.apache.org/jira/browse/SPARK-42406 Project: Spark Issue Type: Bug Components: Protobuf Affects Versions: 3.4.0 Reporter: Raghu Angadi Fix For: 3.4.1 Protobuf deserializer (`from_protobuf()` function()) optionally supports recursive fields by limiting the depth to certain level. See example below. It assigns a 'NullType' for such a field when allowed depth is reached. It causes a few issues. E.g. a repeated field as in the following example results in a Array field with 'NullType'. Delta does not support null type in a complex type. Actually `Array[NullType]` is not really useful anyway. How about this fix: Drop the recursive field when the limit reached rather than using a NullType. The example below makes it clear: Consider a recursive Protobuf: ``` message TreeNode { string value = 1; repeated TreeNode children = 2; } ``` Allow depth of 2: ```python df.select( 'proto', messageName = 'TreeNode', options = { ... "recursive.fields.max.depth" : "2" } ).printSchema() ``` Schema looks like this: ``` root |-- from_protobuf(proto): struct (nullable = true) ||-- value: string (nullable = true) ||-- children: array (nullable = false) |||-- element: struct (containsNull = false) ||||-- value: string (nullable = true) ||||-- children: array (nullable = false) |||||-- element: struct (containsNull = false) ||||||-- value: string (nullable = true) ||||||-- children: array (nullable = false). [ === Proposed fix: Drop this field === ] |||||||-- element: void (containsNull = false) [ === NOTICE 'void' HERE === ] ``` When we try to write this to a delta table, we get an error: ``` AnalysisException: Found nested NullType in column from_protobuf(proto).children which is of ArrayType. Delta doesn't support writing NullType in complex types. ``` We could just drop the field 'element' when recursion depth is reached. It is simpler and does not need to deal with NullType. We are ignoring the value anyway. There is no use in keeping the field. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42034) QueryExecutionListener and Observation API, df.observe do not work with `foreach` action.
[ https://issues.apache.org/jira/browse/SPARK-42034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42034: Assignee: Apache Spark > QueryExecutionListener and Observation API, df.observe do not work with > `foreach` action. > - > > Key: SPARK-42034 > URL: https://issues.apache.org/jira/browse/SPARK-42034 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.2, 3.3.1 > Environment: I test it locally and on YARN in cluster mode. > Spark 3.3.1 and 3.2.2 and 3.1.1. > Yarn 2.9.2 and 3.2.1. >Reporter: Nick Hryhoriev >Assignee: Apache Spark >Priority: Major > Labels: sql-api > > Observation API, {{observe}} dataframe transformation, and custom > QueryExecutionListener. > Do not work with {{foreach}} or {{foreachPartition actions.}} > {{This is due to }}QueryExecutionListener functions do not trigger on queries > whose action is {{foreach}} or {{{}foreachPartition{}}}. > But the Spark GUI SQL tab sees this query as SQL query and shows its query > plans and etc. > here is the code to reproduce it: > https://gist.github.com/GrigorievNick/e7cf9ec5584b417d9719e2812722e6d3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42034) QueryExecutionListener and Observation API, df.observe do not work with `foreach` action.
[ https://issues.apache.org/jira/browse/SPARK-42034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687465#comment-17687465 ] Apache Spark commented on SPARK-42034: -- User 'ming95' has created a pull request for this issue: https://github.com/apache/spark/pull/39976 > QueryExecutionListener and Observation API, df.observe do not work with > `foreach` action. > - > > Key: SPARK-42034 > URL: https://issues.apache.org/jira/browse/SPARK-42034 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.2, 3.3.1 > Environment: I test it locally and on YARN in cluster mode. > Spark 3.3.1 and 3.2.2 and 3.1.1. > Yarn 2.9.2 and 3.2.1. >Reporter: Nick Hryhoriev >Priority: Major > Labels: sql-api > > Observation API, {{observe}} dataframe transformation, and custom > QueryExecutionListener. > Do not work with {{foreach}} or {{foreachPartition actions.}} > {{This is due to }}QueryExecutionListener functions do not trigger on queries > whose action is {{foreach}} or {{{}foreachPartition{}}}. > But the Spark GUI SQL tab sees this query as SQL query and shows its query > plans and etc. > here is the code to reproduce it: > https://gist.github.com/GrigorievNick/e7cf9ec5584b417d9719e2812722e6d3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42034) QueryExecutionListener and Observation API, df.observe do not work with `foreach` action.
[ https://issues.apache.org/jira/browse/SPARK-42034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42034: Assignee: (was: Apache Spark) > QueryExecutionListener and Observation API, df.observe do not work with > `foreach` action. > - > > Key: SPARK-42034 > URL: https://issues.apache.org/jira/browse/SPARK-42034 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.2, 3.3.1 > Environment: I test it locally and on YARN in cluster mode. > Spark 3.3.1 and 3.2.2 and 3.1.1. > Yarn 2.9.2 and 3.2.1. >Reporter: Nick Hryhoriev >Priority: Major > Labels: sql-api > > Observation API, {{observe}} dataframe transformation, and custom > QueryExecutionListener. > Do not work with {{foreach}} or {{foreachPartition actions.}} > {{This is due to }}QueryExecutionListener functions do not trigger on queries > whose action is {{foreach}} or {{{}foreachPartition{}}}. > But the Spark GUI SQL tab sees this query as SQL query and shows its query > plans and etc. > here is the code to reproduce it: > https://gist.github.com/GrigorievNick/e7cf9ec5584b417d9719e2812722e6d3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42405) Better documentation of array_insert function
Daniel Davies created SPARK-42405: - Summary: Better documentation of array_insert function Key: SPARK-42405 URL: https://issues.apache.org/jira/browse/SPARK-42405 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.4.0 Reporter: Daniel Davies See the following thread for discussion: https://github.com/apache/spark/pull/38867#discussion_r1097054656 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42235) Missing typing for pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-42235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42235: Assignee: (was: Apache Spark) > Missing typing for pandas_udf > - > > Key: SPARK-42235 > URL: https://issues.apache.org/jira/browse/SPARK-42235 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Donggu Kang >Priority: Minor > > The typing stub {{site-packages/pyspark/sql/pandas/functions.pyi}} has a list > of possible signatures of {{pandas_udf}}. It is missing a case > {code:python} > import pyspark.sql.functions as F > # PySpark3's typing stub error > @F.pandas_udf( > returnType=LongType() > ) > def my_udf(dummy_col: pd.Series) -> pd.Series: > ... > {code} > The stub defined {{pandas_udf(f, returnType, functionType)}} but not > {{pandas_udf(f, returnType)}}. The official documentation recommends using a > return type hint instead of {{returnType}}. > > {code:python} > # defined > @overload > def pandas_udf( f: PandasScalarToScalarFunction, returnType: > Union[AtomicDataTypeOrString, ArrayType], functionType: PandasScalarUDFType, > ) -> UserDefinedFunctionLike: ... > > # not defined > @overload > def pandas_udf( f: PandasScalarToScalarFunction, returnType: > Union[AtomicDataTypeOrString, ArrayType]) -> UserDefinedFunctionLike: ... > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42235) Missing typing for pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-42235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42235: Assignee: Apache Spark > Missing typing for pandas_udf > - > > Key: SPARK-42235 > URL: https://issues.apache.org/jira/browse/SPARK-42235 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Donggu Kang >Assignee: Apache Spark >Priority: Minor > > The typing stub {{site-packages/pyspark/sql/pandas/functions.pyi}} has a list > of possible signatures of {{pandas_udf}}. It is missing a case > {code:python} > import pyspark.sql.functions as F > # PySpark3's typing stub error > @F.pandas_udf( > returnType=LongType() > ) > def my_udf(dummy_col: pd.Series) -> pd.Series: > ... > {code} > The stub defined {{pandas_udf(f, returnType, functionType)}} but not > {{pandas_udf(f, returnType)}}. The official documentation recommends using a > return type hint instead of {{returnType}}. > > {code:python} > # defined > @overload > def pandas_udf( f: PandasScalarToScalarFunction, returnType: > Union[AtomicDataTypeOrString, ArrayType], functionType: PandasScalarUDFType, > ) -> UserDefinedFunctionLike: ... > > # not defined > @overload > def pandas_udf( f: PandasScalarToScalarFunction, returnType: > Union[AtomicDataTypeOrString, ArrayType]) -> UserDefinedFunctionLike: ... > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42235) Missing typing for pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-42235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687447#comment-17687447 ] Apache Spark commented on SPARK-42235: -- User 'wayneguow' has created a pull request for this issue: https://github.com/apache/spark/pull/39974 > Missing typing for pandas_udf > - > > Key: SPARK-42235 > URL: https://issues.apache.org/jira/browse/SPARK-42235 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Donggu Kang >Priority: Minor > > The typing stub {{site-packages/pyspark/sql/pandas/functions.pyi}} has a list > of possible signatures of {{pandas_udf}}. It is missing a case > {code:python} > import pyspark.sql.functions as F > # PySpark3's typing stub error > @F.pandas_udf( > returnType=LongType() > ) > def my_udf(dummy_col: pd.Series) -> pd.Series: > ... > {code} > The stub defined {{pandas_udf(f, returnType, functionType)}} but not > {{pandas_udf(f, returnType)}}. The official documentation recommends using a > return type hint instead of {{returnType}}. > > {code:python} > # defined > @overload > def pandas_udf( f: PandasScalarToScalarFunction, returnType: > Union[AtomicDataTypeOrString, ArrayType], functionType: PandasScalarUDFType, > ) -> UserDefinedFunctionLike: ... > > # not defined > @overload > def pandas_udf( f: PandasScalarToScalarFunction, returnType: > Union[AtomicDataTypeOrString, ArrayType]) -> UserDefinedFunctionLike: ... > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42235) Missing typing for pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-42235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687446#comment-17687446 ] Apache Spark commented on SPARK-42235: -- User 'wayneguow' has created a pull request for this issue: https://github.com/apache/spark/pull/39974 > Missing typing for pandas_udf > - > > Key: SPARK-42235 > URL: https://issues.apache.org/jira/browse/SPARK-42235 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Donggu Kang >Priority: Minor > > The typing stub {{site-packages/pyspark/sql/pandas/functions.pyi}} has a list > of possible signatures of {{pandas_udf}}. It is missing a case > {code:python} > import pyspark.sql.functions as F > # PySpark3's typing stub error > @F.pandas_udf( > returnType=LongType() > ) > def my_udf(dummy_col: pd.Series) -> pd.Series: > ... > {code} > The stub defined {{pandas_udf(f, returnType, functionType)}} but not > {{pandas_udf(f, returnType)}}. The official documentation recommends using a > return type hint instead of {{returnType}}. > > {code:python} > # defined > @overload > def pandas_udf( f: PandasScalarToScalarFunction, returnType: > Union[AtomicDataTypeOrString, ArrayType], functionType: PandasScalarUDFType, > ) -> UserDefinedFunctionLike: ... > > # not defined > @overload > def pandas_udf( f: PandasScalarToScalarFunction, returnType: > Union[AtomicDataTypeOrString, ArrayType]) -> UserDefinedFunctionLike: ... > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39142) Type overloads in `pandas_udf`
[ https://issues.apache.org/jira/browse/SPARK-39142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39142: Assignee: (was: Apache Spark) > Type overloads in `pandas_udf` > --- > > Key: SPARK-39142 > URL: https://issues.apache.org/jira/browse/SPARK-39142 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Philip Kahn >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > It seems that the `returnType` in the type overloads for `pandas_udf` never > specify a generic for PySpark SQL types or explicitly list those types: > > [https://github.com/apache/spark/blob/f84018a4810867afa84658fec76494aaae6d57fc/python/pyspark/sql/pandas/functions.pyi] > > This results in static type checkers flagging the type of the decorated > functions (and their parameters) as incorrect, see > [https://github.com/microsoft/pylance-release/issues/2789] as an example. > > For someone familiar with the code base, this should be a very fast patch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39142) Type overloads in `pandas_udf`
[ https://issues.apache.org/jira/browse/SPARK-39142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39142: Assignee: Apache Spark > Type overloads in `pandas_udf` > --- > > Key: SPARK-39142 > URL: https://issues.apache.org/jira/browse/SPARK-39142 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Philip Kahn >Assignee: Apache Spark >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > It seems that the `returnType` in the type overloads for `pandas_udf` never > specify a generic for PySpark SQL types or explicitly list those types: > > [https://github.com/apache/spark/blob/f84018a4810867afa84658fec76494aaae6d57fc/python/pyspark/sql/pandas/functions.pyi] > > This results in static type checkers flagging the type of the decorated > functions (and their parameters) as incorrect, see > [https://github.com/microsoft/pylance-release/issues/2789] as an example. > > For someone familiar with the code base, this should be a very fast patch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39142) Type overloads in `pandas_udf`
[ https://issues.apache.org/jira/browse/SPARK-39142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687445#comment-17687445 ] Apache Spark commented on SPARK-39142: -- User 'wayneguow' has created a pull request for this issue: https://github.com/apache/spark/pull/39974 > Type overloads in `pandas_udf` > --- > > Key: SPARK-39142 > URL: https://issues.apache.org/jira/browse/SPARK-39142 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Philip Kahn >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > It seems that the `returnType` in the type overloads for `pandas_udf` never > specify a generic for PySpark SQL types or explicitly list those types: > > [https://github.com/apache/spark/blob/f84018a4810867afa84658fec76494aaae6d57fc/python/pyspark/sql/pandas/functions.pyi] > > This results in static type checkers flagging the type of the decorated > functions (and their parameters) as incorrect, see > [https://github.com/microsoft/pylance-release/issues/2789] as an example. > > For someone familiar with the code base, this should be a very fast patch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org