[jira] [Assigned] (SPARK-50235) Clean up ColumnVector resource after processing all rows in ColumnarToRowExec
[ https://issues.apache.org/jira/browse/SPARK-50235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-50235: Assignee: L. C. Hsieh > Clean up ColumnVector resource after processing all rows in ColumnarToRowExec > - > > Key: SPARK-50235 > URL: https://issues.apache.org/jira/browse/SPARK-50235 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.4.4, 3.5.3 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Labels: pull-request-available > > Currently we only assign null to ColumnarBatch object but it doesn't release > the resources hold by the vectors in the batch. For OnHeapColumnVector, the > Java arrays may be automatically collected by JVM, but for > OffHeapColumnVector, the allocated off-heap memory will be leaked. > For custom ColumnVector implementations like Arrow-based, it also possibly > causes issues on memory safety if the underlying buffers are reused across > batches. Because when ColumnarToRowExec begins to fill values for next batch, > the arrays in previous batch are still hold. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-50235) Clean up ColumnVector resource after processing all rows in ColumnarToRowExec
[ https://issues.apache.org/jira/browse/SPARK-50235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-50235. -- Fix Version/s: 3.5.4 4.0.0 Resolution: Fixed Issue resolved by pull request 48767 [https://github.com/apache/spark/pull/48767] > Clean up ColumnVector resource after processing all rows in ColumnarToRowExec > - > > Key: SPARK-50235 > URL: https://issues.apache.org/jira/browse/SPARK-50235 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.4.4, 3.5.3 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Labels: pull-request-available > Fix For: 3.5.4, 4.0.0 > > > Currently we only assign null to ColumnarBatch object but it doesn't release > the resources hold by the vectors in the batch. For OnHeapColumnVector, the > Java arrays may be automatically collected by JVM, but for > OffHeapColumnVector, the allocated off-heap memory will be leaked. > For custom ColumnVector implementations like Arrow-based, it also possibly > causes issues on memory safety if the underlying buffers are reused across > batches. Because when ColumnarToRowExec begins to fill values for next batch, > the arrays in previous batch are still hold. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-50224) The replacements of IsValidUTF8|ValidateUTF8|TryValidateUTF8|MakeValidUTF8 shall be NullIntolerant
[ https://issues.apache.org/jira/browse/SPARK-50224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-50224. -- Fix Version/s: 4.0.0 Target Version/s: 4.0.0 Resolution: Fixed Issue resolved by https://github.com/apache/spark/pull/48758 > The replacements of IsValidUTF8|ValidateUTF8|TryValidateUTF8|MakeValidUTF8 > shall be NullIntolerant > --- > > Key: SPARK-50224 > URL: https://issues.apache.org/jira/browse/SPARK-50224 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-50224) The replacements of IsValidUTF8|ValidateUTF8|TryValidateUTF8|MakeValidUTF8 shall be NullIntolerant
Kent Yao created SPARK-50224: Summary: The replacements of IsValidUTF8|ValidateUTF8|TryValidateUTF8|MakeValidUTF8 shall be NullIntolerant Key: SPARK-50224 URL: https://issues.apache.org/jira/browse/SPARK-50224 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-50223) RuntimeReplaceable lost NullIntolerant optimization
[ https://issues.apache.org/jira/browse/SPARK-50223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-50223: Assignee: Kent Yao > RuntimeReplaceable lost NullIntolerant optimization > --- > > Key: SPARK-50223 > URL: https://issues.apache.org/jira/browse/SPARK-50223 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-50223) RuntimeReplaceable lost NullIntolerant optimization
Kent Yao created SPARK-50223: Summary: RuntimeReplaceable lost NullIntolerant optimization Key: SPARK-50223 URL: https://issues.apache.org/jira/browse/SPARK-50223 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-50201) Perf improvement for cryptographic hash functions
Kent Yao created SPARK-50201: Summary: Perf improvement for cryptographic hash functions Key: SPARK-50201 URL: https://issues.apache.org/jira/browse/SPARK-50201 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-50123) Move BitmapExpressionUtilsSuite & ExpressionImplUtilsSuite from java to scala test sources folder
Kent Yao created SPARK-50123: Summary: Move BitmapExpressionUtilsSuite & ExpressionImplUtilsSuite from java to scala test sources folder Key: SPARK-50123 URL: https://issues.apache.org/jira/browse/SPARK-50123 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.5.3 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-50080) Add benchmark cases for parquet adaptive bloom filter in BloomFilterBenchmark
Kent Yao created SPARK-50080: Summary: Add benchmark cases for parquet adaptive bloom filter in BloomFilterBenchmark Key: SPARK-50080 URL: https://issues.apache.org/jira/browse/SPARK-50080 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49944) Fix Javascript and image imports in the documentation site
[ https://issues.apache.org/jira/browse/SPARK-49944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49944. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 48438 [https://github.com/apache/spark/pull/48438] > Fix Javascript and image imports in the documentation site > -- > > Key: SPARK-49944 > URL: https://issues.apache.org/jira/browse/SPARK-49944 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Neil Ramaswamy >Assignee: Neil Ramaswamy >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > In [SPARK-49378|https://github.com/apache/spark/pull/47864/], we introduced a > change to break apart the Structured Streaming programming guide. This > created several pages under `/streaming`. > To make this change work, we had to modify some file-paths in script files; > previously, they used to rely on the fact that every page were siblings. But > after the nesting of `/streaming` was added, this assumption was broken. We > introduced the `rel_path_to_root` Jekyll variable to make this work, and we > used it in most places. > However, we [inadvertently > modified|https://github.com/apache/spark/pull/47864/files#diff-729ad9c4e852768f70b7c45195e7e5f8271a7f3146df73045e441d024b907819R201-R202] > the paths to our main Javascript file and AnchorJS import to be absolute. > These should instead be prefixed with `rel_path_to_root`. (The net effect is > that the language-specific code blocks aren't rendering properly on any of > the pages.) > Also, the images in the Structured Streaming programming guide use `/img`, > which is not correct. Those also need `rel_path_to_root`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49944) Fix Javascript and image imports in the documentation site
[ https://issues.apache.org/jira/browse/SPARK-49944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49944: Assignee: Neil Ramaswamy > Fix Javascript and image imports in the documentation site > -- > > Key: SPARK-49944 > URL: https://issues.apache.org/jira/browse/SPARK-49944 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Neil Ramaswamy >Assignee: Neil Ramaswamy >Priority: Major > Labels: pull-request-available > > In [SPARK-49378|https://github.com/apache/spark/pull/47864/], we introduced a > change to break apart the Structured Streaming programming guide. This > created several pages under `/streaming`. > To make this change work, we had to modify some file-paths in script files; > previously, they used to rely on the fact that every page were siblings. But > after the nesting of `/streaming` was added, this assumption was broken. We > introduced the `rel_path_to_root` Jekyll variable to make this work, and we > used it in most places. > However, we [inadvertently > modified|https://github.com/apache/spark/pull/47864/files#diff-729ad9c4e852768f70b7c45195e7e5f8271a7f3146df73045e441d024b907819R201-R202] > the paths to our main Javascript file and AnchorJS import to be absolute. > These should instead be prefixed with `rel_path_to_root`. (The net effect is > that the language-specific code blocks aren't rendering properly on any of > the pages.) > Also, the images in the Structured Streaming programming guide use `/img`, > which is not correct. Those also need `rel_path_to_root`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49996) Upgrade `mysql-connector-j` to 9.1.0
[ https://issues.apache.org/jira/browse/SPARK-49996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49996: Assignee: BingKun Pan > Upgrade `mysql-connector-j` to 9.1.0 > > > Key: SPARK-49996 > URL: https://issues.apache.org/jira/browse/SPARK-49996 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49996) Upgrade `mysql-connector-j` to 9.1.0
[ https://issues.apache.org/jira/browse/SPARK-49996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49996. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 48504 [https://github.com/apache/spark/pull/48504] > Upgrade `mysql-connector-j` to 9.1.0 > > > Key: SPARK-49996 > URL: https://issues.apache.org/jira/browse/SPARK-49996 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49991) Make HadoopMapReduceCommitProtocol respect 'mapreduce.output.basename' to generate file names
Kent Yao created SPARK-49991: Summary: Make HadoopMapReduceCommitProtocol respect 'mapreduce.output.basename' to generate file names Key: SPARK-49991 URL: https://issues.apache.org/jira/browse/SPARK-49991 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49915) Handle zeros and ones in ReorderAssociativeOperator
[ https://issues.apache.org/jira/browse/SPARK-49915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49915: Assignee: Kent Yao > Handle zeros and ones in ReorderAssociativeOperator > --- > > Key: SPARK-49915 > URL: https://issues.apache.org/jira/browse/SPARK-49915 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49915) Handle zeros and ones in ReorderAssociativeOperator
[ https://issues.apache.org/jira/browse/SPARK-49915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49915. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 48395 [https://github.com/apache/spark/pull/48395] > Handle zeros and ones in ReorderAssociativeOperator > --- > > Key: SPARK-49915 > URL: https://issues.apache.org/jira/browse/SPARK-49915 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49915) Handle zeros and ones in ReorderAssociativeOperator
Kent Yao created SPARK-49915: Summary: Handle zeros and ones in ReorderAssociativeOperator Key: SPARK-49915 URL: https://issues.apache.org/jira/browse/SPARK-49915 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49495) Document and Feature Preview on master branch via Live GitHub Pages Updates
[ https://issues.apache.org/jira/browse/SPARK-49495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49495. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 48175 [https://github.com/apache/spark/pull/48175] > Document and Feature Preview on master branch via Live GitHub Pages Updates > --- > > Key: SPARK-49495 > URL: https://issues.apache.org/jira/browse/SPARK-49495 > Project: Spark > Issue Type: Documentation > Components: Documentation, Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49495) Document and Feature Preview on master branch via Live GitHub Pages Updates
[ https://issues.apache.org/jira/browse/SPARK-49495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49495: Assignee: Kent Yao > Document and Feature Preview on master branch via Live GitHub Pages Updates > --- > > Key: SPARK-49495 > URL: https://issues.apache.org/jira/browse/SPARK-49495 > Project: Spark > Issue Type: Documentation > Components: Documentation, Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49495) Document and Feature Preview on master branch via Live GitHub Pages Updates
[ https://issues.apache.org/jira/browse/SPARK-49495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49495. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47968 [https://github.com/apache/spark/pull/47968] > Document and Feature Preview on master branch via Live GitHub Pages Updates > --- > > Key: SPARK-49495 > URL: https://issues.apache.org/jira/browse/SPARK-49495 > Project: Spark > Issue Type: Documentation > Components: Documentation, Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49495) Document and Feature Preview on master branch via Live GitHub Pages Updates
[ https://issues.apache.org/jira/browse/SPARK-49495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49495: Assignee: Kent Yao > Document and Feature Preview on master branch via Live GitHub Pages Updates > --- > > Key: SPARK-49495 > URL: https://issues.apache.org/jira/browse/SPARK-49495 > Project: Spark > Issue Type: Documentation > Components: Documentation, Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-49508) Optimized hadoop-aws dependency, aws-java-sdk-bundle jar is too large
[ https://issues.apache.org/jira/browse/SPARK-49508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879827#comment-17879827 ] Kent Yao commented on SPARK-49508: -- That‘s the 'provided' scope for. Spark already provides a lot of official images and you can build yours with them as base images > Optimized hadoop-aws dependency, aws-java-sdk-bundle jar is too large > - > > Key: SPARK-49508 > URL: https://issues.apache.org/jira/browse/SPARK-49508 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0, 3.5.2 >Reporter: melin >Priority: Major > Attachments: image-2024-09-06-17-29-33-066.png > > > aws-java-sdk-bundle jar is too large,The size of the spark image will > double。hadoop aws only requires the use of aws-java-sdk-s3 and > aws-java-sdk-dynamodb > > {code:java} > // code placeholder > > org.apache.hadoop > hadoop-aws > ${hadoop.version} > > > com.amazonaws > aws-java-sdk-bundle > > > > > com.amazonaws > aws-java-sdk-s3 > ${awssdk.v1.version} > > > com.amazonaws > aws-java-sdk-dynamodb > ${awssdk.v1.version} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-49508) Optimized hadoop-aws dependency, aws-java-sdk-bundle jar is too large
[ https://issues.apache.org/jira/browse/SPARK-49508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879818#comment-17879818 ] Kent Yao commented on SPARK-49508: -- I don't see this in tar.gz files. Do they only pulled when you turn on hadoop-cloud file > Optimized hadoop-aws dependency, aws-java-sdk-bundle jar is too large > - > > Key: SPARK-49508 > URL: https://issues.apache.org/jira/browse/SPARK-49508 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0, 3.5.2 >Reporter: melin >Priority: Major > > aws-java-sdk-bundle jar is too large,The size of the spark image will > double。hadoop aws only requires the use of aws-java-sdk-s3 and > aws-java-sdk-dynamodb > > {code:java} > // code placeholder > > org.apache.hadoop > hadoop-aws > ${hadoop.version} > > > com.amazonaws > aws-java-sdk-bundle > > > > > com.amazonaws > aws-java-sdk-s3 > ${awssdk.v1.version} > > > com.amazonaws > aws-java-sdk-dynamodb > ${awssdk.v1.version} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49408) Poor performance in ProjectingInternalRow
[ https://issues.apache.org/jira/browse/SPARK-49408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49408. -- Fix Version/s: 3.4.4 4.0.0 3.5.3 Resolution: Fixed Issue resolved by pull request 47890 [https://github.com/apache/spark/pull/47890] > Poor performance in ProjectingInternalRow > - > > Key: SPARK-49408 > URL: https://issues.apache.org/jira/browse/SPARK-49408 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.2 >Reporter: Frank Wong >Assignee: Frank Wong >Priority: Major > Labels: pull-request-available > Fix For: 3.4.4, 4.0.0, 3.5.3 > > Attachments: 20240827-172739.html > > > In {*}ProjectingInternalRow{*}, the *colOrdinals* is passed as a {_}List{_}. > According to the Scala documentation, the _{{apply}}_ method for _{{List}}_ > has a linear time complexity, and it is used in all methods of > ProjectingInternalRow for every row. This can have a significant impact on > performance. > The following flame graph was captured in a {*}merge into sql{*}. A > considerable amount of time was spent on {{{}List.apply{}}}. Changing this to > _{{IndexedSeq}}_ would improve the performance. > > [^20240827-172739.html] > [https://docs.scala-lang.org/overviews/collections-2.13/performance-characteristics.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49408) Poor performance in ProjectingInternalRow
[ https://issues.apache.org/jira/browse/SPARK-49408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49408: Assignee: Frank Wong > Poor performance in ProjectingInternalRow > - > > Key: SPARK-49408 > URL: https://issues.apache.org/jira/browse/SPARK-49408 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.2 >Reporter: Frank Wong >Assignee: Frank Wong >Priority: Major > Labels: pull-request-available > Attachments: 20240827-172739.html > > > In {*}ProjectingInternalRow{*}, the *colOrdinals* is passed as a {_}List{_}. > According to the Scala documentation, the _{{apply}}_ method for _{{List}}_ > has a linear time complexity, and it is used in all methods of > ProjectingInternalRow for every row. This can have a significant impact on > performance. > The following flame graph was captured in a {*}merge into sql{*}. A > considerable amount of time was spent on {{{}List.apply{}}}. Changing this to > _{{IndexedSeq}}_ would improve the performance. > > [^20240827-172739.html] > [https://docs.scala-lang.org/overviews/collections-2.13/performance-characteristics.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49445) Support show tooltip in the progress bar of UI
[ https://issues.apache.org/jira/browse/SPARK-49445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49445. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47908 [https://github.com/apache/spark/pull/47908] > Support show tooltip in the progress bar of UI > -- > > Key: SPARK-49445 > URL: https://issues.apache.org/jira/browse/SPARK-49445 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 4.0.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49470) Update dataTables from 1.13.5 to 1.13.11
[ https://issues.apache.org/jira/browse/SPARK-49470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49470: Assignee: Kent Yao > Update dataTables from 1.13.5 to 1.13.11 > > > Key: SPARK-49470 > URL: https://issues.apache.org/jira/browse/SPARK-49470 > Project: Spark > Issue Type: Dependency upgrade > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49470) Update dataTables from 1.13.5 to 1.13.11
[ https://issues.apache.org/jira/browse/SPARK-49470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49470. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47938 [https://github.com/apache/spark/pull/47938] > Update dataTables from 1.13.5 to 1.13.11 > > > Key: SPARK-49470 > URL: https://issues.apache.org/jira/browse/SPARK-49470 > Project: Spark > Issue Type: Dependency upgrade > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49495) Document and Feature Preview on master branch via Live GitHub Pages Updates
Kent Yao created SPARK-49495: Summary: Document and Feature Preview on master branch via Live GitHub Pages Updates Key: SPARK-49495 URL: https://issues.apache.org/jira/browse/SPARK-49495 Project: Spark Issue Type: Documentation Components: Documentation, Project Infra Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49494) Using spark logos from spark-website in docs
Kent Yao created SPARK-49494: Summary: Using spark logos from spark-website in docs Key: SPARK-49494 URL: https://issues.apache.org/jira/browse/SPARK-49494 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49378) Break apart the Structured Streaming Programming Guide
[ https://issues.apache.org/jira/browse/SPARK-49378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49378. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47864 [https://github.com/apache/spark/pull/47864] > Break apart the Structured Streaming Programming Guide > -- > > Key: SPARK-49378 > URL: https://issues.apache.org/jira/browse/SPARK-49378 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Neil Ramaswamy >Assignee: Neil Ramaswamy >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > As discussed in [this email > thread,|https://lists.apache.org/thread/tbqxg28w4njsp4ws5gbssfckx5zydbdj] we > should break apart the Structured Streaming programming guide to make it > easier for readers to consume (it will also help SEO). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49378) Break apart the Structured Streaming Programming Guide
[ https://issues.apache.org/jira/browse/SPARK-49378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49378: Assignee: Neil Ramaswamy > Break apart the Structured Streaming Programming Guide > -- > > Key: SPARK-49378 > URL: https://issues.apache.org/jira/browse/SPARK-49378 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Neil Ramaswamy >Assignee: Neil Ramaswamy >Priority: Major > Labels: pull-request-available > > As discussed in [this email > thread,|https://lists.apache.org/thread/tbqxg28w4njsp4ws5gbssfckx5zydbdj] we > should break apart the Structured Streaming programming guide to make it > easier for readers to consume (it will also help SEO). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49470) Update dataTables from 1.13.5 to 1.13.11
Kent Yao created SPARK-49470: Summary: Update dataTables from 1.13.5 to 1.13.11 Key: SPARK-49470 URL: https://issues.apache.org/jira/browse/SPARK-49470 Project: Spark Issue Type: Dependency upgrade Components: Web UI Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49459) Support CRC32C for Shuffle Checksum
Kent Yao created SPARK-49459: Summary: Support CRC32C for Shuffle Checksum Key: SPARK-49459 URL: https://issues.apache.org/jira/browse/SPARK-49459 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46037) When Left Join build Left, ShuffledHashJoinExec may result in incorrect results
[ https://issues.apache.org/jira/browse/SPARK-46037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-46037: - Priority: Blocker (was: Minor) > When Left Join build Left, ShuffledHashJoinExec may result in incorrect > results > --- > > Key: SPARK-46037 > URL: https://issues.apache.org/jira/browse/SPARK-46037 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: mcdull_zhang >Priority: Blocker > Labels: correctness, pull-request-available > > When Left Join build Left and codegen is turned off, ShuffledHashJoinExec may > have incorrect results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49405) Restrict charsets in JsonOptions
[ https://issues.apache.org/jira/browse/SPARK-49405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49405. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47887 [https://github.com/apache/spark/pull/47887] > Restrict charsets in JsonOptions > > > Key: SPARK-49405 > URL: https://issues.apache.org/jira/browse/SPARK-49405 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49405) Restrict charsets in JsonOptions
Kent Yao created SPARK-49405: Summary: Restrict charsets in JsonOptions Key: SPARK-49405 URL: https://issues.apache.org/jira/browse/SPARK-49405 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49314) Upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11
[ https://issues.apache.org/jira/browse/SPARK-49314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49314. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47810 [https://github.com/apache/spark/pull/47810] > Upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11 > --- > > Key: SPARK-49314 > URL: https://issues.apache.org/jira/browse/SPARK-49314 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 4.0.0 >Reporter: Wei Guo >Assignee: Wei Guo >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49314) Upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11
[ https://issues.apache.org/jira/browse/SPARK-49314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49314: Assignee: Wei Guo > Upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11 > --- > > Key: SPARK-49314 > URL: https://issues.apache.org/jira/browse/SPARK-49314 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 4.0.0 >Reporter: Wei Guo >Assignee: Wei Guo >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49180) Optimize spark-website and doc release size
[ https://issues.apache.org/jira/browse/SPARK-49180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49180. -- Assignee: Kent Yao Resolution: Fixed The repo size is reduced from 17G to 6.5G now, so I mark this issue as resolved {code:shell} du -sh . 17G . du -sh . 6.5G . {code} > Optimize spark-website and doc release size > --- > > Key: SPARK-49180 > URL: https://issues.apache.org/jira/browse/SPARK-49180 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49209) Archive Spark Documentations in Apache Archives
[ https://issues.apache.org/jira/browse/SPARK-49209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49209. -- Assignee: Kent Yao Resolution: Fixed Archived at https://archive.apache.org/dist/spark/docs/ > Archive Spark Documentations in Apache Archives > --- > > Key: SPARK-49209 > URL: https://issues.apache.org/jira/browse/SPARK-49209 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > To address the issue of the Spark website repository size > reaching the storage limit for GitHub-hosted runners [1], I suggest > enhancing step [2] in our release process by relocating the > documentation releases from the dev[3] directory to the release > directory[4]. Then it would captured by the Apache Archives > service[5] to create permanent links, which would be alternative > endpoints for our documentation, like > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/_site/index.html > for > https://spark.apache.org/docs/3.5.2/index.html > Note that the previous example still uses the staging repository, > which will become > https://archive.apache.org/dist/spark/docs/3.5.2/index.html. > For older releases hosted on the Spark website [6], we also need to > upload them via SVN manually. > After that, when we reach the threshold again, we can delete some of > the old ones on page [6], and update their links on page [7] or use > redirection. > [1] https://lists.apache.org/thread/o0w4gqoks23xztdmjjj26jkp1yyg2bvq > [2] > https://spark.apache.org/release-process.html#upload-to-apache-release-directory > [3] https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/ > [4] https://dist.apache.org/repos/dist/release/spark/docs/3.5.2 > [5] https://archive.apache.org/dist/spark/ > [6] https://github.com/apache/spark-website/tree/asf-site/site/docs > [7] https://spark.apache.org/documentation.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49039) Reset checkbox when executor metrics are loaded in the Stages tab
[ https://issues.apache.org/jira/browse/SPARK-49039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49039: Assignee: dzcxzl > Reset checkbox when executor metrics are loaded in the Stages tab > - > > Key: SPARK-49039 > URL: https://issues.apache.org/jira/browse/SPARK-49039 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0, 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49039) Reset checkbox when executor metrics are loaded in the Stages tab
[ https://issues.apache.org/jira/browse/SPARK-49039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49039. -- Fix Version/s: 3.4.4 4.0.0 3.5.3 Resolution: Fixed Issue resolved by pull request 47519 [https://github.com/apache/spark/pull/47519] > Reset checkbox when executor metrics are loaded in the Stages tab > - > > Key: SPARK-49039 > URL: https://issues.apache.org/jira/browse/SPARK-49039 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0, 3.2.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Labels: pull-request-available > Fix For: 3.4.4, 4.0.0, 3.5.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45590) okio-1.15.0 CVE-2023-3635
[ https://issues.apache.org/jira/browse/SPARK-45590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-45590: - Fix Version/s: 4.0.0 3.4.4 > okio-1.15.0 CVE-2023-3635 > - > > Key: SPARK-45590 > URL: https://issues.apache.org/jira/browse/SPARK-45590 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Colm O hEigeartaigh >Assignee: Gabor Roczei >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0, 3.4.4, 3.5.3 > > > CVE-2023-3635 is being flagged against okio-1.15.0 present in the Spark 3.5.0 > build: > * ./spark-3.5.0-bin-without-hadoop/jars/okio-1.15.0.jar > * ./spark-3.5.0-bin-hadoop3/jars/okio-1.15.0.jar > I don't see okio in the dependency tree, it must be coming in via some > profile. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45590) okio-1.15.0 CVE-2023-3635
[ https://issues.apache.org/jira/browse/SPARK-45590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-45590. -- Fix Version/s: 3.5.3 Resolution: Fixed Issue resolved by pull request 47769 [https://github.com/apache/spark/pull/47769] > okio-1.15.0 CVE-2023-3635 > - > > Key: SPARK-45590 > URL: https://issues.apache.org/jira/browse/SPARK-45590 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Colm O hEigeartaigh >Assignee: Gabor Roczei >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.3 > > > CVE-2023-3635 is being flagged against okio-1.15.0 present in the Spark 3.5.0 > build: > * ./spark-3.5.0-bin-without-hadoop/jars/okio-1.15.0.jar > * ./spark-3.5.0-bin-hadoop3/jars/okio-1.15.0.jar > I don't see okio in the dependency tree, it must be coming in via some > profile. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45590) okio-1.15.0 CVE-2023-3635
[ https://issues.apache.org/jira/browse/SPARK-45590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-45590: Assignee: Gabor Roczei > okio-1.15.0 CVE-2023-3635 > - > > Key: SPARK-45590 > URL: https://issues.apache.org/jira/browse/SPARK-45590 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Colm O hEigeartaigh >Assignee: Gabor Roczei >Priority: Minor > Labels: pull-request-available > > CVE-2023-3635 is being flagged against okio-1.15.0 present in the Spark 3.5.0 > build: > * ./spark-3.5.0-bin-without-hadoop/jars/okio-1.15.0.jar > * ./spark-3.5.0-bin-hadoop3/jars/okio-1.15.0.jar > I don't see okio in the dependency tree, it must be coming in via some > profile. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49242) Upgrade commons-cli to 1.9.0
[ https://issues.apache.org/jira/browse/SPARK-49242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49242. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47764 [https://github.com/apache/spark/pull/47764] > Upgrade commons-cli to 1.9.0 > > > Key: SPARK-49242 > URL: https://issues.apache.org/jira/browse/SPARK-49242 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48947) Use lowercased charset name to decrease cache missing in Charset.forName
[ https://issues.apache.org/jira/browse/SPARK-48947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-48947. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47420 [https://github.com/apache/spark/pull/47420] > Use lowercased charset name to decrease cache missing in Charset.forName > > > Key: SPARK-48947 > URL: https://issues.apache.org/jira/browse/SPARK-48947 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49205) KeyGroupedPartitioning should inherit HashPartitioningLike
[ https://issues.apache.org/jira/browse/SPARK-49205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49205. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47734 [https://github.com/apache/spark/pull/47734] > KeyGroupedPartitioning should inherit HashPartitioningLike > -- > > Key: SPARK-49205 > URL: https://issues.apache.org/jira/browse/SPARK-49205 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49205) KeyGroupedPartitioning should inherit HashPartitioningLike
[ https://issues.apache.org/jira/browse/SPARK-49205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49205: Assignee: XiDuo You > KeyGroupedPartitioning should inherit HashPartitioningLike > -- > > Key: SPARK-49205 > URL: https://issues.apache.org/jira/browse/SPARK-49205 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49209) Archive Spark Documentations in Apache Archives
[ https://issues.apache.org/jira/browse/SPARK-49209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-49209: - Description: To address the issue of the Spark website repository size reaching the storage limit for GitHub-hosted runners [1], I suggest enhancing step [2] in our release process by relocating the documentation releases from the dev[3] directory to the release directory[4]. Then it would captured by the Apache Archives service[5] to create permanent links, which would be alternative endpoints for our documentation, like https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/_site/index.html for https://spark.apache.org/docs/3.5.2/index.html Note that the previous example still uses the staging repository, which will become https://archive.apache.org/dist/spark/docs/3.5.2/index.html. For older releases hosted on the Spark website [6], we also need to upload them via SVN manually. After that, when we reach the threshold again, we can delete some of the old ones on page [6], and update their links on page [7] or use redirection. [1] https://lists.apache.org/thread/o0w4gqoks23xztdmjjj26jkp1yyg2bvq [2] https://spark.apache.org/release-process.html#upload-to-apache-release-directory [3] https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/ [4] https://dist.apache.org/repos/dist/release/spark/docs/3.5.2 [5] https://archive.apache.org/dist/spark/ [6] https://github.com/apache/spark-website/tree/asf-site/site/docs [7] https://spark.apache.org/documentation.html > Archive Spark Documentations in Apache Archives > --- > > Key: SPARK-49209 > URL: https://issues.apache.org/jira/browse/SPARK-49209 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > > To address the issue of the Spark website repository size > reaching the storage limit for GitHub-hosted runners [1], I suggest > enhancing step [2] in our release process by relocating the > documentation releases from the dev[3] directory to the release > directory[4]. Then it would captured by the Apache Archives > service[5] to create permanent links, which would be alternative > endpoints for our documentation, like > https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/_site/index.html > for > https://spark.apache.org/docs/3.5.2/index.html > Note that the previous example still uses the staging repository, > which will become > https://archive.apache.org/dist/spark/docs/3.5.2/index.html. > For older releases hosted on the Spark website [6], we also need to > upload them via SVN manually. > After that, when we reach the threshold again, we can delete some of > the old ones on page [6], and update their links on page [7] or use > redirection. > [1] https://lists.apache.org/thread/o0w4gqoks23xztdmjjj26jkp1yyg2bvq > [2] > https://spark.apache.org/release-process.html#upload-to-apache-release-directory > [3] https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/ > [4] https://dist.apache.org/repos/dist/release/spark/docs/3.5.2 > [5] https://archive.apache.org/dist/spark/ > [6] https://github.com/apache/spark-website/tree/asf-site/site/docs > [7] https://spark.apache.org/documentation.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49209) Archive Spark Documentations in Apache Archives
Kent Yao created SPARK-49209: Summary: Archive Spark Documentations in Apache Archives Key: SPARK-49209 URL: https://issues.apache.org/jira/browse/SPARK-49209 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49181) Remove site/docs/{version}/api/python/_sources folder
[ https://issues.apache.org/jira/browse/SPARK-49181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49181: Assignee: Kent Yao > Remove site/docs/{version}/api/python/_sources folder > - > > Key: SPARK-49181 > URL: https://issues.apache.org/jira/browse/SPARK-49181 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49181) Remove site/docs/{version}/api/python/_sources folder
[ https://issues.apache.org/jira/browse/SPARK-49181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49181. -- Resolution: Fixed > Remove site/docs/{version}/api/python/_sources folder > - > > Key: SPARK-49181 > URL: https://issues.apache.org/jira/browse/SPARK-49181 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11583) Make MapStatus use less memory uage
[ https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-11583: Assignee: Kent Yao (was: Davies Liu) > Make MapStatus use less memory uage > --- > > Key: SPARK-11583 > URL: https://issues.apache.org/jira/browse/SPARK-11583 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Reporter: Kent Yao 2 >Assignee: Kent Yao >Priority: Major > Fix For: 1.6.0 > > > In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I > said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. > For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. > Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125]. > So if we use a HashSet[Int] to store reduceId (when non-empty blocks are > dense,use reduceId of empty blocks; when sparse, use non-empty ones). > For dense cases: if HashSet[Int](numNonEmptyBlocks).size < > BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks > For sparse cases: if HashSet[Int](numEmptyBlocks).size < > BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks > sparse case, 299/300 are empty > sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5) > dense case, no block is empty > sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49182) Stop publish site/docs/{version}/api/python/_sources
[ https://issues.apache.org/jira/browse/SPARK-49182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-49182: - Fix Version/s: 3.5.3 (was: 3.5.2) > Stop publish site/docs/{version}/api/python/_sources > > > Key: SPARK-49182 > URL: https://issues.apache.org/jira/browse/SPARK-49182 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.4.4, 3.5.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49182) Stop publish site/docs/{version}/api/python/_sources
[ https://issues.apache.org/jira/browse/SPARK-49182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49182: Assignee: Kent Yao > Stop publish site/docs/{version}/api/python/_sources > > > Key: SPARK-49182 > URL: https://issues.apache.org/jira/browse/SPARK-49182 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49182) Stop publish site/docs/{version}/api/python/_sources
[ https://issues.apache.org/jira/browse/SPARK-49182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49182. -- Fix Version/s: 3.4.4 3.5.2 4.0.0 Resolution: Fixed Issue resolved by pull request 47686 [https://github.com/apache/spark/pull/47686] > Stop publish site/docs/{version}/api/python/_sources > > > Key: SPARK-49182 > URL: https://issues.apache.org/jira/browse/SPARK-49182 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 3.4.4, 3.5.2, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49174) Exclude the dir `docs/util` from `_site`
[ https://issues.apache.org/jira/browse/SPARK-49174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49174. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47678 [https://github.com/apache/spark/pull/47678] > Exclude the dir `docs/util` from `_site` > > > Key: SPARK-49174 > URL: https://issues.apache.org/jira/browse/SPARK-49174 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49182) Stop publish site/docs/{version}/api/python/_sources
Kent Yao created SPARK-49182: Summary: Stop publish site/docs/{version}/api/python/_sources Key: SPARK-49182 URL: https://issues.apache.org/jira/browse/SPARK-49182 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49181) Remove site/docs/{version}/api/python/_sources folder
Kent Yao created SPARK-49181: Summary: Remove site/docs/{version}/api/python/_sources folder Key: SPARK-49181 URL: https://issues.apache.org/jira/browse/SPARK-49181 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49180) Optimize spark-website and doc release size
Kent Yao created SPARK-49180: Summary: Optimize spark-website and doc release size Key: SPARK-49180 URL: https://issues.apache.org/jira/browse/SPARK-49180 Project: Spark Issue Type: Umbrella Components: Documentation Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49147) Mark KryoRegistrator as DeveloperApi interface
[ https://issues.apache.org/jira/browse/SPARK-49147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49147. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47657 [https://github.com/apache/spark/pull/47657] > Mark KryoRegistrator as DeveloperApi interface > -- > > Key: SPARK-49147 > URL: https://issues.apache.org/jira/browse/SPARK-49147 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Rob Reeves >Assignee: Rob Reeves >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Trait org.apache.spark.serializer.KryoRegistrator is a public interface > because it is exposed via config "spark.kryo.registrator" since version > 0.5.0. It should have an annotation to describe its stability. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49147) Mark KryoRegistrator as DeveloperApi interface
[ https://issues.apache.org/jira/browse/SPARK-49147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49147: Assignee: Rob Reeves > Mark KryoRegistrator as DeveloperApi interface > -- > > Key: SPARK-49147 > URL: https://issues.apache.org/jira/browse/SPARK-49147 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Rob Reeves >Assignee: Rob Reeves >Priority: Major > Labels: pull-request-available > > Trait org.apache.spark.serializer.KryoRegistrator is a public interface > because it is exposed via config "spark.kryo.registrator" since version > 0.5.0. It should have an annotation to describe its stability. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49154) Upgrade `Volcano` to 1.9.0
[ https://issues.apache.org/jira/browse/SPARK-49154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49154. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47661 [https://github.com/apache/spark/pull/47661] > Upgrade `Volcano` to 1.9.0 > -- > > Key: SPARK-49154 > URL: https://issues.apache.org/jira/browse/SPARK-49154 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes, Project Infra, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49124) Upgrade tink to 1.14.1
[ https://issues.apache.org/jira/browse/SPARK-49124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49124. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47632 [https://github.com/apache/spark/pull/47632] > Upgrade tink to 1.14.1 > -- > > Key: SPARK-49124 > URL: https://issues.apache.org/jira/browse/SPARK-49124 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Wei Guo >Assignee: Wei Guo >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49124) Upgrade tink to 1.14.1
[ https://issues.apache.org/jira/browse/SPARK-49124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49124: Assignee: Wei Guo > Upgrade tink to 1.14.1 > -- > > Key: SPARK-49124 > URL: https://issues.apache.org/jira/browse/SPARK-49124 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Wei Guo >Assignee: Wei Guo >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49141) Mark variant as hive incompatible data type
Kent Yao created SPARK-49141: Summary: Mark variant as hive incompatible data type Key: SPARK-49141 URL: https://issues.apache.org/jira/browse/SPARK-49141 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49134) Support retry for deploying artifacts to Nexus staging repository
Kent Yao created SPARK-49134: Summary: Support retry for deploying artifacts to Nexus staging repository Key: SPARK-49134 URL: https://issues.apache.org/jira/browse/SPARK-49134 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49120) Bump Gson 2.11.0
[ https://issues.apache.org/jira/browse/SPARK-49120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49120. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47627 [https://github.com/apache/spark/pull/47627] > Bump Gson 2.11.0 > > > Key: SPARK-49120 > URL: https://issues.apache.org/jira/browse/SPARK-49120 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49099) Refactor CatalogManager.setCurrentNamespace
[ https://issues.apache.org/jira/browse/SPARK-49099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-49099: - Affects Version/s: 3.5.1 > Refactor CatalogManager.setCurrentNamespace > --- > > Key: SPARK-49099 > URL: https://issues.apache.org/jira/browse/SPARK-49099 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49099) Refactor CatalogManager.setCurrentNamespace
[ https://issues.apache.org/jira/browse/SPARK-49099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-49099: - Issue Type: Bug (was: Improvement) > Refactor CatalogManager.setCurrentNamespace > --- > > Key: SPARK-49099 > URL: https://issues.apache.org/jira/browse/SPARK-49099 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49099) Refactor CatalogManager.setCurrentNamespace
[ https://issues.apache.org/jira/browse/SPARK-49099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49099. -- Fix Version/s: 4.0.0 3.5.2 Resolution: Fixed > Refactor CatalogManager.setCurrentNamespace > --- > > Key: SPARK-49099 > URL: https://issues.apache.org/jira/browse/SPARK-49099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49091) Some broadcasts cannot be cleared from memory storage
[ https://issues.apache.org/jira/browse/SPARK-49091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49091. -- Assignee: Zhen Wang Resolution: Not A Problem > Some broadcasts cannot be cleared from memory storage > - > > Key: SPARK-49091 > URL: https://issues.apache.org/jira/browse/SPARK-49091 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Yuming Wang >Assignee: Zhen Wang >Priority: Major > Attachments: SPARK-49091.patch, driver heap.png, > image-2024-08-02-20-45-48-252.png, image-2024-08-02-20-52-33-896.png > > > Please apply this patch([^SPARK-49091.patch]) to reproduce this issue. This > issue may cause driver memory leak. > !driver heap.png|thumbnail! > This issue was introduced by SPARK-41914. > Before SPARK-41914: > {noformat} > [info] BroadcastCleanerSuite: > 10:30:16.228 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 0, names: > [info] - Test broadcast cleaner (1 minute, 4 seconds) > 10:31:21.552 WARN org.apache.spark.sql.BroadcastCleanerSuite: > {noformat} > After SPARK-41914: > {noformat} > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2,
[jira] [Resolved] (SPARK-49107) ROUTINE_ALREADY_EXISTS supports RoutineType
[ https://issues.apache.org/jira/browse/SPARK-49107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49107. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47600 [https://github.com/apache/spark/pull/47600] > ROUTINE_ALREADY_EXISTS supports RoutineType > --- > > Key: SPARK-49107 > URL: https://issues.apache.org/jira/browse/SPARK-49107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49108) Add `submit_pi.sh` REST API example
[ https://issues.apache.org/jira/browse/SPARK-49108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49108. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47601 [https://github.com/apache/spark/pull/47601] > Add `submit_pi.sh` REST API example > --- > > Key: SPARK-49108 > URL: https://issues.apache.org/jira/browse/SPARK-49108 > Project: Spark > Issue Type: Sub-task > Components: Examples >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49078) Support show columns syntax in v2 table
[ https://issues.apache.org/jira/browse/SPARK-49078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49078: Assignee: xy > Support show columns syntax in v2 table > --- > > Key: SPARK-49078 > URL: https://issues.apache.org/jira/browse/SPARK-49078 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Support show columns syntax in v2 table -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49078) Support show columns syntax in v2 table
[ https://issues.apache.org/jira/browse/SPARK-49078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49078. -- Resolution: Fixed Issue resolved by pull request 47568 [https://github.com/apache/spark/pull/47568] > Support show columns syntax in v2 table > --- > > Key: SPARK-49078 > URL: https://issues.apache.org/jira/browse/SPARK-49078 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Support show columns syntax in v2 table -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49094) ignoreCorruptFiles file source option is partially supported for orc format
[ https://issues.apache.org/jira/browse/SPARK-49094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-49094: - Issue Type: Bug (was: Improvement) > ignoreCorruptFiles file source option is partially supported for orc format > --- > > Key: SPARK-49094 > URL: https://issues.apache.org/jira/browse/SPARK-49094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.4.3 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-49091) Some broadcasts cannot be cleared from memory storage
[ https://issues.apache.org/jira/browse/SPARK-49091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870519#comment-17870519 ] Kent Yao commented on SPARK-49091: -- Thank you [~wforget] for the verification, sounds reasonable to me > Some broadcasts cannot be cleared from memory storage > - > > Key: SPARK-49091 > URL: https://issues.apache.org/jira/browse/SPARK-49091 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: SPARK-49091.patch, driver heap.png, > image-2024-08-02-20-45-48-252.png, image-2024-08-02-20-52-33-896.png > > > Please apply this patch([^SPARK-49091.patch]) to reproduce this issue. This > issue may cause driver memory leak. > !driver heap.png|thumbnail! > This issue was introduced by SPARK-41914. > Before SPARK-41914: > {noformat} > [info] BroadcastCleanerSuite: > 10:30:16.228 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 0, names: > [info] - Test broadcast cleaner (1 minute, 4 seconds) > 10:31:21.552 WARN org.apache.spark.sql.BroadcastCleanerSuite: > {noformat} > After SPARK-41914: > {noformat} > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, > broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_piece0, broadcast_1 > entries size: 2, names: broadcast_1_pi
[jira] [Updated] (SPARK-49094) ignoreCorruptFiles file source option is partially supported for orc format
[ https://issues.apache.org/jira/browse/SPARK-49094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-49094: - Affects Version/s: 3.4.3 3.5.1 > ignoreCorruptFiles file source option is partially supported for orc format > --- > > Key: SPARK-49094 > URL: https://issues.apache.org/jira/browse/SPARK-49094 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.4.3 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49094) ignoreCorruptFiles file source option is partially supported for orc format
Kent Yao created SPARK-49094: Summary: ignoreCorruptFiles file source option is partially supported for orc format Key: SPARK-49094 URL: https://issues.apache.org/jira/browse/SPARK-49094 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49000) Aggregation with DISTINCT gives wrong results when dealing with literals
[ https://issues.apache.org/jira/browse/SPARK-49000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-49000: - Fix Version/s: 3.4.4 > Aggregation with DISTINCT gives wrong results when dealing with literals > > > Key: SPARK-49000 > URL: https://issues.apache.org/jira/browse/SPARK-49000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.3, 3.2.4, 3.5.1, 3.3.4, 3.4.3 >Reporter: Uroš Bojanić >Assignee: Uroš Bojanić >Priority: Critical > Labels: correctness, pull-request-available > Fix For: 4.0.0, 3.4.4 > > > Aggregation with *DISTINCT* gives wrong results when dealing with literals. > It appears that this bug affects all (or most) released versions of Spark. > > For example: > {code:java} > select count(distinct 1) from t{code} > returns 1, while the correct result should be 0. > > For reference: > {code:java} > select count(1) from t{code} > returns 0, which is the correct and expected result. > > In these examples, suppose that *t* is a table with any columns). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-49030) Self join of a CTE seems non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870341#comment-17870341 ] Kent Yao commented on SPARK-49030: -- I retargeted this to version 3.5.3 because we have reached a consensus that the priority of this issue is low, although we still have different technical options. > Self join of a CTE seems non-deterministic > -- > > Key: SPARK-49030 > URL: https://issues.apache.org/jira/browse/SPARK-49030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 > Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview. >Reporter: Jihoon Son >Priority: Minor > Fix For: 3.5.3 > > Attachments: screenshot-1.png > > > {code:java} > WITH c AS (SELECT * FROM customer LIMIT 10) > SELECT count(*) > FROM c c1, c c2 > WHERE c1.c_customer_sk > c2.c_customer_sk{code} > Suppose a self join query on a CTE such as the one above. > Spark generates a physical plan like the one below for this query. > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L]) > +- HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#233L]) > +- Project > +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > > c_customer_sk#214) > :- Filter isnotnull(c_customer_sk#0) > : +- GlobalLimit 10, 0 > : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=256] > : +- LocalLimit 10 > : +- FileScan parquet [c_customer_sk#0] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > +- BroadcastExchange IdentityBroadcastMode, [plan_id=263] > +- Filter isnotnull(c_customer_sk#214) > +- GlobalLimit 10, 0 > +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=259] > +- LocalLimit 10 > +- FileScan parquet [c_customer_sk#214] Batched: > true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct{code} > Evaluating this plan produces non-deterministic result because the limit is > independently pushed into the two sides of the join. Each limit can produce > different data, and thus the join can produce results that vary across runs. > I understand that the query in question is not deterministic (and thus not > very practical) as, due to the nature of the limit in distributed engines, it > is not expected to produce the same result anyway across repeated runs. > However, I would still expect that the query plan evaluation remains > deterministic. > Per extended analysis as seen below, it seems that the query plan has changed > at some point during optimization. > {code:java} > == Analyzed Logical Plan == > count(1): bigint > WithCTE > :- CTERelationDef 2, false > : +- SubqueryAlias c > : +- GlobalLimit 10 > : +- LocalLimit 10 > : +- Project [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, > c_email_address#16, c_last_review_date_sk#17] > : +- SubqueryAlias customer > : +- View (`customer`, [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, > c_email_address#16, c_last_review_date_sk#17]) > : +- Relation > [c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4,c_first_shipto_date_sk#5,c_first_sales_date_sk#6,c_salutation#7,c_first_name#8,c_last_name#9,c_preferred_cust_flag#10,c_birth_day#11L,c_birth_month#12L,c_birth_year#13L,c_birth_country#14,c_login#15,c_email_address#16,c_last_review_date_sk#17] > parquet > +- Aggregate [count(1) AS count(1)#194L] > +- Filter (c_customer_sk#0 > c_customer_sk#176) > +- Join Inner > :- SubqueryAlias c1 > : +- SubqueryAlias c > : +- CTERelationRef 2, true, [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current
[jira] [Updated] (SPARK-49030) Self join of a CTE seems non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-49030: - Fix Version/s: 3.5.3 > Self join of a CTE seems non-deterministic > -- > > Key: SPARK-49030 > URL: https://issues.apache.org/jira/browse/SPARK-49030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 > Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview. >Reporter: Jihoon Son >Priority: Minor > Fix For: 3.5.3 > > Attachments: screenshot-1.png > > > {code:java} > WITH c AS (SELECT * FROM customer LIMIT 10) > SELECT count(*) > FROM c c1, c c2 > WHERE c1.c_customer_sk > c2.c_customer_sk{code} > Suppose a self join query on a CTE such as the one above. > Spark generates a physical plan like the one below for this query. > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L]) > +- HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#233L]) > +- Project > +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > > c_customer_sk#214) > :- Filter isnotnull(c_customer_sk#0) > : +- GlobalLimit 10, 0 > : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=256] > : +- LocalLimit 10 > : +- FileScan parquet [c_customer_sk#0] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > +- BroadcastExchange IdentityBroadcastMode, [plan_id=263] > +- Filter isnotnull(c_customer_sk#214) > +- GlobalLimit 10, 0 > +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=259] > +- LocalLimit 10 > +- FileScan parquet [c_customer_sk#214] Batched: > true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct{code} > Evaluating this plan produces non-deterministic result because the limit is > independently pushed into the two sides of the join. Each limit can produce > different data, and thus the join can produce results that vary across runs. > I understand that the query in question is not deterministic (and thus not > very practical) as, due to the nature of the limit in distributed engines, it > is not expected to produce the same result anyway across repeated runs. > However, I would still expect that the query plan evaluation remains > deterministic. > Per extended analysis as seen below, it seems that the query plan has changed > at some point during optimization. > {code:java} > == Analyzed Logical Plan == > count(1): bigint > WithCTE > :- CTERelationDef 2, false > : +- SubqueryAlias c > : +- GlobalLimit 10 > : +- LocalLimit 10 > : +- Project [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, > c_email_address#16, c_last_review_date_sk#17] > : +- SubqueryAlias customer > : +- View (`customer`, [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, > c_email_address#16, c_last_review_date_sk#17]) > : +- Relation > [c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4,c_first_shipto_date_sk#5,c_first_sales_date_sk#6,c_salutation#7,c_first_name#8,c_last_name#9,c_preferred_cust_flag#10,c_birth_day#11L,c_birth_month#12L,c_birth_year#13L,c_birth_country#14,c_login#15,c_email_address#16,c_last_review_date_sk#17] > parquet > +- Aggregate [count(1) AS count(1)#194L] > +- Filter (c_customer_sk#0 > c_customer_sk#176) > +- Join Inner > :- SubqueryAlias c1 > : +- SubqueryAlias c > : +- CTERelationRef 2, true, [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L,
[jira] [Commented] (SPARK-48950) Corrupt data from parquet scans
[ https://issues.apache.org/jira/browse/SPARK-48950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870316#comment-17870316 ] Kent Yao commented on SPARK-48950: -- Thank you [~dongjoon] Hi [~Tom_Newton] Please let us know if you have new findings, thank you > Corrupt data from parquet scans > --- > > Key: SPARK-48950 > URL: https://issues.apache.org/jira/browse/SPARK-48950 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.5.0, 4.0.0, 3.5.1 > Environment: Spark 3.5.0 > Running on kubernetes > Using Azure Blob storage with hierarchical namespace enabled >Reporter: Thomas Newton >Priority: Major > Labels: correctness > Attachments: example_task_errors.txt, job_dag.png, sql_query_plan.png > > > Its very rare and non-deterministic but since Spark 3.5.0 we have started > seeing a correctness bug in parquet scans when using the vectorized reader. > We've noticed this on double type columns where occasionally small groups > (typically 10s to 100s) of rows are replaced with crazy values like > `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, > -7.60562076e+240, -3.1806e-064, 2.89435993e-116`. I think this is the > result of interpreting uniform random bits as a double type. Most of my > testing has been on an array of double type column but we have also seen it > on un-nested plain double type columns. > I've been testing this by adding a filter that should return zero results but > will return non-zero if the parquet scan has problems. I've attached > screenshots of this from the Spark UI. > I did a `git bisect` and found that the problem starts with > [https://github.com/apache/spark/pull/39950], but I haven't yet understood > why. Its possible that this change is fine but it reveals a problem > elsewhere? I did also notice [https://github.com/apache/spark/pull/44853] > which appears to be a different implementation of the same thing so maybe > that could help. > Its not a major problem by itself but another symptom appears to be that > Parquet scan tasks fail at a rate of approximately 0.03% with errors like > those in the attached `example_task_errors.txt`. If I revert > [https://github.com/apache/spark/pull/39950] I get exactly 0 task failures on > the same test. > > The problem seems to be a bit dependant on how the parquet files happen to be > organised on blob storage so I don't yet have a reproduce that I can share > that doesn't depend on private data. > I tested on a pre-release 4.0.0 and the problem was still present. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44638) Unable to read from JDBC data sources when using custom schema containing varchar
[ https://issues.apache.org/jira/browse/SPARK-44638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-44638: - Fix Version/s: 3.4.4 > Unable to read from JDBC data sources when using custom schema containing > varchar > - > > Key: SPARK-44638 > URL: https://issues.apache.org/jira/browse/SPARK-44638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.2.4, 3.3.2, 3.4.1 >Reporter: Michael Said >Assignee: Kent Yao >Priority: Critical > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.4 > > > When querying the data from JDBC databases with custom schema containing > varchar I got this error : > {code:java} > [23/07/14 06:12:19 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) ( > executor 1): java.sql.SQLException: Unsupported type varchar(100) at > org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedJdbcTypeError(QueryExecutionErrors.scala:818) > 23/07/14 06:12:21 INFO TaskSetManager: Lost task 0.1 in stage 1.0 (TID 2) on > , executor 0: java.sql.SQLException (Unsupported type varchar(100)){code} > Code example: > {code:java} > CUSTOM_SCHEMA="ID Integer, NAME VARCHAR(100)" > df = spark.read.format("jdbc") > .option("url", "jdbc:oracle:thin:@0.0.0.0:1521:db") > .option("driver", "oracle.jdbc.OracleDriver") > .option("dbtable", "table") > .option("customSchema", CUSTOM_SCHEMA) > .option("user", "user") > .option("password", "password") > .load() > df.show(){code} > I tried to set {{spark.sql.legacy.charVarcharAsString = true}} to restore the > behavior before Spark 3.1 but it doesn't help. > The issue occurs in version 3.1.0 and above. I believe that this issue is > caused by https://issues.apache.org/jira/browse/SPARK-33480 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48950) Corrupt data from parquet scans
[ https://issues.apache.org/jira/browse/SPARK-48950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870190#comment-17870190 ] Kent Yao commented on SPARK-48950: -- Thank you [~Tom_Newton] for the additional inputs. I’d retarget this to 3.5.3 to unblock 3.5.2. WDYT?[~dongjoon] > Corrupt data from parquet scans > --- > > Key: SPARK-48950 > URL: https://issues.apache.org/jira/browse/SPARK-48950 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.5.0, 4.0.0, 3.5.1 > Environment: Spark 3.5.0 > Running on kubernetes > Using Azure Blob storage with hierarchical namespace enabled >Reporter: Thomas Newton >Priority: Major > Labels: correctness > Attachments: example_task_errors.txt, job_dag.png, sql_query_plan.png > > > Its very rare and non-deterministic but since Spark 3.5.0 we have started > seeing a correctness bug in parquet scans when using the vectorized reader. > We've noticed this on double type columns where occasionally small groups > (typically 10s to 100s) of rows are replaced with crazy values like > `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, > -7.60562076e+240, -3.1806e-064, 2.89435993e-116`. I think this is the > result of interpreting uniform random bits as a double type. Most of my > testing has been on an array of double type column but we have also seen it > on un-nested plain double type columns. > I've been testing this by adding a filter that should return zero results but > will return non-zero if the parquet scan has problems. I've attached > screenshots of this from the Spark UI. > I did a `git bisect` and found that the problem starts with > [https://github.com/apache/spark/pull/39950], but I haven't yet understood > why. Its possible that this change is fine but it reveals a problem > elsewhere? I did also notice [https://github.com/apache/spark/pull/44853] > which appears to be a different implementation of the same thing so maybe > that could help. > Its not a major problem by itself but another symptom appears to be that > Parquet scan tasks fail at a rate of approximately 0.03% with errors like > those in the attached `example_task_errors.txt`. If I revert > [https://github.com/apache/spark/pull/39950] I get exactly 0 task failures on > the same test. > > The problem seems to be a bit dependant on how the parquet files happen to be > organised on blob storage so I don't yet have a reproduce that I can share > that doesn't depend on private data. > I tested on a pre-release 4.0.0 and the problem was still present. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48950) Corrupt data from parquet scans
[ https://issues.apache.org/jira/browse/SPARK-48950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870183#comment-17870183 ] Kent Yao commented on SPARK-48950: -- Did your app run with ignoreCorruptFiles? BTW,I wonder if we could have a reproduciable case on the OSS spark > Corrupt data from parquet scans > --- > > Key: SPARK-48950 > URL: https://issues.apache.org/jira/browse/SPARK-48950 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.5.0, 4.0.0, 3.5.1 > Environment: Spark 3.5.0 > Running on kubernetes > Using Azure Blob storage with hierarchical namespace enabled >Reporter: Thomas Newton >Priority: Major > Labels: correctness > Attachments: example_task_errors.txt, job_dag.png, sql_query_plan.png > > > Its very rare and non-deterministic but since Spark 3.5.0 we have started > seeing a correctness bug in parquet scans when using the vectorized reader. > We've noticed this on double type columns where occasionally small groups > (typically 10s to 100s) of rows are replaced with crazy values like > `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, > -7.60562076e+240, -3.1806e-064, 2.89435993e-116`. I think this is the > result of interpreting uniform random bits as a double type. Most of my > testing has been on an array of double type column but we have also seen it > on un-nested plain double type columns. > I've been testing this by adding a filter that should return zero results but > will return non-zero if the parquet scan has problems. I've attached > screenshots of this from the Spark UI. > I did a `git bisect` and found that the problem starts with > [https://github.com/apache/spark/pull/39950], but I haven't yet understood > why. Its possible that this change is fine but it reveals a problem > elsewhere? I did also notice [https://github.com/apache/spark/pull/44853] > which appears to be a different implementation of the same thing so maybe > that could help. > Its not a major problem by itself but another symptom appears to be that > Parquet scan tasks fail at a rate of approximately 0.03% with errors like > those in the attached `example_task_errors.txt`. If I revert > [https://github.com/apache/spark/pull/39950] I get exactly 0 task failures on > the same test. > > The problem seems to be a bit dependant on how the parquet files happen to be > organised on blob storage so I don't yet have a reproduce that I can share > that doesn't depend on private data. > I tested on a pre-release 4.0.0 and the problem was still present. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48950) Corrupt data from parquet scans
[ https://issues.apache.org/jira/browse/SPARK-48950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870152#comment-17870152 ] Kent Yao commented on SPARK-48950: -- According to the error stacks you provided, ``` Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:112) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.uncompress(Snappy.java:554) at org.apache.parquet.hadoop.codec.SnappyDecompressor.uncompress(SnappyDecompressor.java:30) at org.apache.parquet.hadoop.codec.NonBlockedDecompressor.decompress(NonBlockedDecompressor.java:73) at org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51) at java.base/java.io.DataInputStream.readFully(DataInputStream.java:201) at java.base/java.io.DataInputStream.readFully(DataInputStream.java:172) at org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286) at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237) at org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV1(VectorizedColumnReader.java:367) ... 41 more ``` The root cause isn't directly caused by Spark, right? > Corrupt data from parquet scans > --- > > Key: SPARK-48950 > URL: https://issues.apache.org/jira/browse/SPARK-48950 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.5.0, 4.0.0, 3.5.1 > Environment: Spark 3.5.0 > Running on kubernetes > Using Azure Blob storage with hierarchical namespace enabled >Reporter: Thomas Newton >Priority: Major > Labels: correctness > Attachments: example_task_errors.txt, job_dag.png, sql_query_plan.png > > > Its very rare and non-deterministic but since Spark 3.5.0 we have started > seeing a correctness bug in parquet scans when using the vectorized reader. > We've noticed this on double type columns where occasionally small groups > (typically 10s to 100s) of rows are replaced with crazy values like > `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, > -7.60562076e+240, -3.1806e-064, 2.89435993e-116`. I think this is the > result of interpreting uniform random bits as a double type. Most of my > testing has been on an array of double type column but we have also seen it > on un-nested plain double type columns. > I've been testing this by adding a filter that should return zero results but > will return non-zero if the parquet scan has problems. I've attached > screenshots of this from the Spark UI. > I did a `git bisect` and found that the problem starts with > [https://github.com/apache/spark/pull/39950], but I haven't yet understood > why. Its possible that this change is fine but it reveals a problem > elsewhere? I did also notice [https://github.com/apache/spark/pull/44853] > which appears to be a different implementation of the same thing so maybe > that could help. > Its not a major problem by itself but another symptom appears to be that > Parquet scan tasks fail at a rate of approximately 0.03% with errors like > those in the attached `example_task_errors.txt`. If I revert > [https://github.com/apache/spark/pull/39950] I get exactly 0 task failures on > the same test. > > The problem seems to be a bit dependant on how the parquet files happen to be > organised on blob storage so I don't yet have a reproduce that I can share > that doesn't depend on private data. > I tested on a pre-release 4.0.0 and the problem was still present. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-49030) Self join of a CTE seems non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870147#comment-17870147 ] Kent Yao commented on SPARK-49030: -- Thank you [~ulysses] for the verification > Self join of a CTE seems non-deterministic > -- > > Key: SPARK-49030 > URL: https://issues.apache.org/jira/browse/SPARK-49030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 > Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview. >Reporter: Jihoon Son >Priority: Minor > > {code:java} > WITH c AS (SELECT * FROM customer LIMIT 10) > SELECT count(*) > FROM c c1, c c2 > WHERE c1.c_customer_sk > c2.c_customer_sk{code} > Suppose a self join query on a CTE such as the one above. > Spark generates a physical plan like the one below for this query. > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L]) > +- HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#233L]) > +- Project > +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > > c_customer_sk#214) > :- Filter isnotnull(c_customer_sk#0) > : +- GlobalLimit 10, 0 > : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=256] > : +- LocalLimit 10 > : +- FileScan parquet [c_customer_sk#0] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > +- BroadcastExchange IdentityBroadcastMode, [plan_id=263] > +- Filter isnotnull(c_customer_sk#214) > +- GlobalLimit 10, 0 > +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=259] > +- LocalLimit 10 > +- FileScan parquet [c_customer_sk#214] Batched: > true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct{code} > Evaluating this plan produces non-deterministic result because the limit is > independently pushed into the two sides of the join. Each limit can produce > different data, and thus the join can produce results that vary across runs. > I understand that the query in question is not deterministic (and thus not > very practical) as, due to the nature of the limit in distributed engines, it > is not expected to produce the same result anyway across repeated runs. > However, I would still expect that the query plan evaluation remains > deterministic. > Per extended analysis as seen below, it seems that the query plan has changed > at some point during optimization. > {code:java} > == Analyzed Logical Plan == > count(1): bigint > WithCTE > :- CTERelationDef 2, false > : +- SubqueryAlias c > : +- GlobalLimit 10 > : +- LocalLimit 10 > : +- Project [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, > c_email_address#16, c_last_review_date_sk#17] > : +- SubqueryAlias customer > : +- View (`customer`, [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, > c_email_address#16, c_last_review_date_sk#17]) > : +- Relation > [c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4,c_first_shipto_date_sk#5,c_first_sales_date_sk#6,c_salutation#7,c_first_name#8,c_last_name#9,c_preferred_cust_flag#10,c_birth_day#11L,c_birth_month#12L,c_birth_year#13L,c_birth_country#14,c_login#15,c_email_address#16,c_last_review_date_sk#17] > parquet > +- Aggregate [count(1) AS count(1)#194L] > +- Filter (c_customer_sk#0 > c_customer_sk#176) > +- Join Inner > :- SubqueryAlias c1 > : +- SubqueryAlias c > : +- CTERelationRef 2, true, [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13
[jira] [Commented] (SPARK-49000) Aggregation with DISTINCT gives wrong results when dealing with literals
[ https://issues.apache.org/jira/browse/SPARK-49000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870120#comment-17870120 ] Kent Yao commented on SPARK-49000: -- The commit is reverted to fix CI, please send a backport PR for 3.5 > Aggregation with DISTINCT gives wrong results when dealing with literals > > > Key: SPARK-49000 > URL: https://issues.apache.org/jira/browse/SPARK-49000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.3, 3.2.4, 3.5.1, 3.3.4, 3.4.3 >Reporter: Uroš Bojanić >Assignee: Uroš Bojanić >Priority: Critical > Labels: correctness, pull-request-available > Fix For: 4.0.0 > > > Aggregation with *DISTINCT* gives wrong results when dealing with literals. > It appears that this bug affects all (or most) released versions of Spark. > > For example: > {code:java} > select count(distinct 1) from t{code} > returns 1, while the correct result should be 0. > > For reference: > {code:java} > select count(1) from t{code} > returns 0, which is the correct and expected result. > > In these examples, suppose that *t* is a table with any columns). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49000) Aggregation with DISTINCT gives wrong results when dealing with literals
[ https://issues.apache.org/jira/browse/SPARK-49000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-49000: - Fix Version/s: (was: 3.5.2) > Aggregation with DISTINCT gives wrong results when dealing with literals > > > Key: SPARK-49000 > URL: https://issues.apache.org/jira/browse/SPARK-49000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.3, 3.2.4, 3.5.1, 3.3.4, 3.4.3 >Reporter: Uroš Bojanić >Assignee: Uroš Bojanić >Priority: Critical > Labels: correctness, pull-request-available > Fix For: 4.0.0 > > > Aggregation with *DISTINCT* gives wrong results when dealing with literals. > It appears that this bug affects all (or most) released versions of Spark. > > For example: > {code:java} > select count(distinct 1) from t{code} > returns 1, while the correct result should be 0. > > For reference: > {code:java} > select count(1) from t{code} > returns 0, which is the correct and expected result. > > In these examples, suppose that *t* is a table with any columns). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49054) Column default value should support current_* functions
[ https://issues.apache.org/jira/browse/SPARK-49054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-49054: - Fix Version/s: 4.0.0 > Column default value should support current_* functions > --- > > Key: SPARK-49054 > URL: https://issues.apache.org/jira/browse/SPARK-49054 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0, 3.5.2 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49067) Move utf-8 literal into the methods of UrlCodec class
[ https://issues.apache.org/jira/browse/SPARK-49067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-49067: Assignee: Zhen Wang > Move utf-8 literal into the methods of UrlCodec class > - > > Key: SPARK-49067 > URL: https://issues.apache.org/jira/browse/SPARK-49067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Zhen Wang >Assignee: Zhen Wang >Priority: Major > Labels: pull-request-available > > Move utf-8 literals in url encode/decode functions to internal methods of > UrlCodec class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49067) Move utf-8 literal into the methods of UrlCodec class
[ https://issues.apache.org/jira/browse/SPARK-49067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-49067. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47544 [https://github.com/apache/spark/pull/47544] > Move utf-8 literal into the methods of UrlCodec class > - > > Key: SPARK-49067 > URL: https://issues.apache.org/jira/browse/SPARK-49067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Zhen Wang >Assignee: Zhen Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Move utf-8 literals in url encode/decode functions to internal methods of > UrlCodec class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48608) Spark 3.5: fails to build with value defaultValueNotConstantError is not a member of object org.apache.spark.sql.errors.QueryCompilationErrors
[ https://issues.apache.org/jira/browse/SPARK-48608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-48608. -- Fix Version/s: 3.5.2 Resolution: Fixed Issue resolved by https://github.com/apache/spark/pull/46978 > Spark 3.5: fails to build with value defaultValueNotConstantError is not a > member of object org.apache.spark.sql.errors.QueryCompilationErrors > --- > > Key: SPARK-48608 > URL: https://issues.apache.org/jira/browse/SPARK-48608 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.2 >Reporter: Thomas Graves >Priority: Blocker > Fix For: 3.5.2 > > > PR [https://github.com/apache/spark/pull/46594] seems to have broken the > Spark 3.5 build. > [ERROR] [Error] > ...sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala:299: > value defaultValueNotConstantError is not a member of object > org.apache.spark.sql.errors.QueryCompilationErrors > I don't see that definition defined on the 3.5 branch - > [https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala] > I see it defined on master by > https://issues.apache.org/jira/browse/SPARK-46905 which only went into 4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy
[ https://issues.apache.org/jira/browse/SPARK-48308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-48308: - Fix Version/s: 3.5.2 > Unify getting data schema without partition columns in FileSourceStrategy > - > > Key: SPARK-48308 > URL: https://issues.apache.org/jira/browse/SPARK-48308 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 >Reporter: Johan Lasperas >Assignee: Johan Lasperas >Priority: Trivial > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > > In > [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191] > the schema of the data excluding partition columns is computed 2 times in a > slightly different way: > > {code:java} > val dataColumnsWithoutPartitionCols = > dataColumns.filterNot(partitionSet.contains) {code} > vs > {code:java} > val readDataColumns = dataColumns > .filterNot(partitionColumns.contains) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48991) FileStreamSink.hasMetadata handles invalid path
[ https://issues.apache.org/jira/browse/SPARK-48991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-48991: - Fix Version/s: 3.5.3 (was: 3.5.2) > FileStreamSink.hasMetadata handles invalid path > --- > > Key: SPARK-48991 > URL: https://issues.apache.org/jira/browse/SPARK-48991 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.4.3 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.4.4, 3.5.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48991) FileStreamSink.hasMetadata handles invalid path
Kent Yao created SPARK-48991: Summary: FileStreamSink.hasMetadata handles invalid path Key: SPARK-48991 URL: https://issues.apache.org/jira/browse/SPARK-48991 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.3, 3.5.1, 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org