[jira] [Assigned] (SPARK-50235) Clean up ColumnVector resource after processing all rows in ColumnarToRowExec

2024-11-06 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-50235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-50235:


Assignee: L. C. Hsieh

> Clean up ColumnVector resource after processing all rows in ColumnarToRowExec
> -
>
> Key: SPARK-50235
> URL: https://issues.apache.org/jira/browse/SPARK-50235
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.4.4, 3.5.3
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: pull-request-available
>
> Currently we only assign null to ColumnarBatch object but it doesn't release 
> the resources hold by the vectors in the batch. For OnHeapColumnVector, the 
> Java arrays may be automatically collected by JVM, but for 
> OffHeapColumnVector, the allocated off-heap memory will be leaked.
> For custom ColumnVector implementations like Arrow-based, it also possibly 
> causes issues on memory safety if the underlying buffers are reused across 
> batches. Because when ColumnarToRowExec begins to fill values for next batch, 
> the arrays in previous batch are still hold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-50235) Clean up ColumnVector resource after processing all rows in ColumnarToRowExec

2024-11-06 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-50235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-50235.
--
Fix Version/s: 3.5.4
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 48767
[https://github.com/apache/spark/pull/48767]

> Clean up ColumnVector resource after processing all rows in ColumnarToRowExec
> -
>
> Key: SPARK-50235
> URL: https://issues.apache.org/jira/browse/SPARK-50235
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.4.4, 3.5.3
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.4, 4.0.0
>
>
> Currently we only assign null to ColumnarBatch object but it doesn't release 
> the resources hold by the vectors in the batch. For OnHeapColumnVector, the 
> Java arrays may be automatically collected by JVM, but for 
> OffHeapColumnVector, the allocated off-heap memory will be leaked.
> For custom ColumnVector implementations like Arrow-based, it also possibly 
> causes issues on memory safety if the underlying buffers are reused across 
> batches. Because when ColumnarToRowExec begins to fill values for next batch, 
> the arrays in previous batch are still hold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-50224) The replacements of IsValidUTF8|ValidateUTF8|TryValidateUTF8|MakeValidUTF8 shall be NullIntolerant

2024-11-06 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-50224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-50224.
--
   Fix Version/s: 4.0.0
Target Version/s: 4.0.0
  Resolution: Fixed

Issue resolved by https://github.com/apache/spark/pull/48758

> The replacements of IsValidUTF8|ValidateUTF8|TryValidateUTF8|MakeValidUTF8 
> shall be NullIntolerant 
> ---
>
> Key: SPARK-50224
> URL: https://issues.apache.org/jira/browse/SPARK-50224
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-50224) The replacements of IsValidUTF8|ValidateUTF8|TryValidateUTF8|MakeValidUTF8 shall be NullIntolerant

2024-11-04 Thread Kent Yao (Jira)
Kent Yao created SPARK-50224:


 Summary: The replacements of 
IsValidUTF8|ValidateUTF8|TryValidateUTF8|MakeValidUTF8 shall be NullIntolerant 
 Key: SPARK-50224
 URL: https://issues.apache.org/jira/browse/SPARK-50224
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-50223) RuntimeReplaceable lost NullIntolerant optimization

2024-11-04 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-50223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-50223:


Assignee: Kent Yao

> RuntimeReplaceable lost NullIntolerant optimization
> ---
>
> Key: SPARK-50223
> URL: https://issues.apache.org/jira/browse/SPARK-50223
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-50223) RuntimeReplaceable lost NullIntolerant optimization

2024-11-04 Thread Kent Yao (Jira)
Kent Yao created SPARK-50223:


 Summary: RuntimeReplaceable lost NullIntolerant optimization
 Key: SPARK-50223
 URL: https://issues.apache.org/jira/browse/SPARK-50223
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-50201) Perf improvement for cryptographic hash functions

2024-10-31 Thread Kent Yao (Jira)
Kent Yao created SPARK-50201:


 Summary: Perf improvement for cryptographic hash functions
 Key: SPARK-50201
 URL: https://issues.apache.org/jira/browse/SPARK-50201
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-50123) Move BitmapExpressionUtilsSuite & ExpressionImplUtilsSuite from java to scala test sources folder

2024-10-25 Thread Kent Yao (Jira)
Kent Yao created SPARK-50123:


 Summary: Move BitmapExpressionUtilsSuite & 
ExpressionImplUtilsSuite from java to scala test sources folder
 Key: SPARK-50123
 URL: https://issues.apache.org/jira/browse/SPARK-50123
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.5.3
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-50080) Add benchmark cases for parquet adaptive bloom filter in BloomFilterBenchmark

2024-10-22 Thread Kent Yao (Jira)
Kent Yao created SPARK-50080:


 Summary: Add benchmark cases for parquet adaptive bloom filter in 
BloomFilterBenchmark
 Key: SPARK-50080
 URL: https://issues.apache.org/jira/browse/SPARK-50080
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49944) Fix Javascript and image imports in the documentation site

2024-10-21 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49944.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 48438
[https://github.com/apache/spark/pull/48438]

> Fix Javascript and image imports in the documentation site
> --
>
> Key: SPARK-49944
> URL: https://issues.apache.org/jira/browse/SPARK-49944
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Neil Ramaswamy
>Assignee: Neil Ramaswamy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In [SPARK-49378|https://github.com/apache/spark/pull/47864/], we introduced a 
> change to break apart the Structured Streaming programming guide. This 
> created several pages under `/streaming`.
> To make this change work, we had to modify some file-paths in script files; 
> previously, they used to rely on the fact that every page were siblings. But 
> after the nesting of `/streaming` was added, this assumption was broken. We 
> introduced the `rel_path_to_root` Jekyll variable to make this work, and we 
> used it in most places.
> However, we [inadvertently 
> modified|https://github.com/apache/spark/pull/47864/files#diff-729ad9c4e852768f70b7c45195e7e5f8271a7f3146df73045e441d024b907819R201-R202]
>  the paths to our main Javascript file and AnchorJS import to be absolute. 
> These should instead be prefixed with `rel_path_to_root`. (The net effect is 
> that the language-specific code blocks aren't rendering properly on any of 
> the pages.)
> Also, the images in the Structured Streaming programming guide use `/img`, 
> which is not correct. Those also need `rel_path_to_root`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49944) Fix Javascript and image imports in the documentation site

2024-10-21 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49944:


Assignee: Neil Ramaswamy

> Fix Javascript and image imports in the documentation site
> --
>
> Key: SPARK-49944
> URL: https://issues.apache.org/jira/browse/SPARK-49944
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Neil Ramaswamy
>Assignee: Neil Ramaswamy
>Priority: Major
>  Labels: pull-request-available
>
> In [SPARK-49378|https://github.com/apache/spark/pull/47864/], we introduced a 
> change to break apart the Structured Streaming programming guide. This 
> created several pages under `/streaming`.
> To make this change work, we had to modify some file-paths in script files; 
> previously, they used to rely on the fact that every page were siblings. But 
> after the nesting of `/streaming` was added, this assumption was broken. We 
> introduced the `rel_path_to_root` Jekyll variable to make this work, and we 
> used it in most places.
> However, we [inadvertently 
> modified|https://github.com/apache/spark/pull/47864/files#diff-729ad9c4e852768f70b7c45195e7e5f8271a7f3146df73045e441d024b907819R201-R202]
>  the paths to our main Javascript file and AnchorJS import to be absolute. 
> These should instead be prefixed with `rel_path_to_root`. (The net effect is 
> that the language-specific code blocks aren't rendering properly on any of 
> the pages.)
> Also, the images in the Structured Streaming programming guide use `/img`, 
> which is not correct. Those also need `rel_path_to_root`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49996) Upgrade `mysql-connector-j` to 9.1.0

2024-10-16 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49996:


Assignee: BingKun Pan

> Upgrade `mysql-connector-j` to 9.1.0
> 
>
> Key: SPARK-49996
> URL: https://issues.apache.org/jira/browse/SPARK-49996
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49996) Upgrade `mysql-connector-j` to 9.1.0

2024-10-16 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49996.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 48504
[https://github.com/apache/spark/pull/48504]

> Upgrade `mysql-connector-j` to 9.1.0
> 
>
> Key: SPARK-49996
> URL: https://issues.apache.org/jira/browse/SPARK-49996
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49991) Make HadoopMapReduceCommitProtocol respect 'mapreduce.output.basename' to generate file names

2024-10-16 Thread Kent Yao (Jira)
Kent Yao created SPARK-49991:


 Summary: Make HadoopMapReduceCommitProtocol respect 
'mapreduce.output.basename' to generate file names
 Key: SPARK-49991
 URL: https://issues.apache.org/jira/browse/SPARK-49991
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49915) Handle zeros and ones in ReorderAssociativeOperator

2024-10-11 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49915:


Assignee: Kent Yao

> Handle zeros and ones in ReorderAssociativeOperator
> ---
>
> Key: SPARK-49915
> URL: https://issues.apache.org/jira/browse/SPARK-49915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49915) Handle zeros and ones in ReorderAssociativeOperator

2024-10-11 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49915.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 48395
[https://github.com/apache/spark/pull/48395]

> Handle zeros and ones in ReorderAssociativeOperator
> ---
>
> Key: SPARK-49915
> URL: https://issues.apache.org/jira/browse/SPARK-49915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49915) Handle zeros and ones in ReorderAssociativeOperator

2024-10-09 Thread Kent Yao (Jira)
Kent Yao created SPARK-49915:


 Summary: Handle zeros and ones in ReorderAssociativeOperator
 Key: SPARK-49915
 URL: https://issues.apache.org/jira/browse/SPARK-49915
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49495) Document and Feature Preview on master branch via Live GitHub Pages Updates

2024-09-21 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49495.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 48175
[https://github.com/apache/spark/pull/48175]

> Document and Feature Preview on master branch via Live GitHub Pages Updates
> ---
>
> Key: SPARK-49495
> URL: https://issues.apache.org/jira/browse/SPARK-49495
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49495) Document and Feature Preview on master branch via Live GitHub Pages Updates

2024-09-21 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49495:


Assignee: Kent Yao

> Document and Feature Preview on master branch via Live GitHub Pages Updates
> ---
>
> Key: SPARK-49495
> URL: https://issues.apache.org/jira/browse/SPARK-49495
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49495) Document and Feature Preview on master branch via Live GitHub Pages Updates

2024-09-17 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49495.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47968
[https://github.com/apache/spark/pull/47968]

> Document and Feature Preview on master branch via Live GitHub Pages Updates
> ---
>
> Key: SPARK-49495
> URL: https://issues.apache.org/jira/browse/SPARK-49495
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49495) Document and Feature Preview on master branch via Live GitHub Pages Updates

2024-09-17 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49495:


Assignee: Kent Yao

> Document and Feature Preview on master branch via Live GitHub Pages Updates
> ---
>
> Key: SPARK-49495
> URL: https://issues.apache.org/jira/browse/SPARK-49495
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-49508) Optimized hadoop-aws dependency, aws-java-sdk-bundle jar is too large

2024-09-06 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879827#comment-17879827
 ] 

Kent Yao commented on SPARK-49508:
--

That‘s the 'provided' scope for. Spark already provides a lot of official 
images and you can build yours with them as base images

> Optimized hadoop-aws dependency, aws-java-sdk-bundle jar is too large
> -
>
> Key: SPARK-49508
> URL: https://issues.apache.org/jira/browse/SPARK-49508
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0, 3.5.2
>Reporter: melin
>Priority: Major
> Attachments: image-2024-09-06-17-29-33-066.png
>
>
> aws-java-sdk-bundle jar is too large,The size of the spark image will 
> double。hadoop aws only requires the use of aws-java-sdk-s3 and 
> aws-java-sdk-dynamodb
>  
> {code:java}
> // code placeholder
> 
> org.apache.hadoop
> hadoop-aws
> ${hadoop.version}
> 
> 
> com.amazonaws
> aws-java-sdk-bundle
> 
> 
> 
> 
> com.amazonaws
> aws-java-sdk-s3
> ${awssdk.v1.version}
> 
> 
> com.amazonaws
> aws-java-sdk-dynamodb
> ${awssdk.v1.version}
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-49508) Optimized hadoop-aws dependency, aws-java-sdk-bundle jar is too large

2024-09-06 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879818#comment-17879818
 ] 

Kent Yao commented on SPARK-49508:
--

I don't see this in tar.gz files. Do they only pulled when you turn on 
hadoop-cloud file

> Optimized hadoop-aws dependency, aws-java-sdk-bundle jar is too large
> -
>
> Key: SPARK-49508
> URL: https://issues.apache.org/jira/browse/SPARK-49508
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0, 3.5.2
>Reporter: melin
>Priority: Major
>
> aws-java-sdk-bundle jar is too large,The size of the spark image will 
> double。hadoop aws only requires the use of aws-java-sdk-s3 and 
> aws-java-sdk-dynamodb
>  
> {code:java}
> // code placeholder
> 
> org.apache.hadoop
> hadoop-aws
> ${hadoop.version}
> 
> 
> com.amazonaws
> aws-java-sdk-bundle
> 
> 
> 
> 
> com.amazonaws
> aws-java-sdk-s3
> ${awssdk.v1.version}
> 
> 
> com.amazonaws
> aws-java-sdk-dynamodb
> ${awssdk.v1.version}
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49408) Poor performance in ProjectingInternalRow

2024-09-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49408.
--
Fix Version/s: 3.4.4
   4.0.0
   3.5.3
   Resolution: Fixed

Issue resolved by pull request 47890
[https://github.com/apache/spark/pull/47890]

> Poor performance in ProjectingInternalRow
> -
>
> Key: SPARK-49408
> URL: https://issues.apache.org/jira/browse/SPARK-49408
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.2
>Reporter: Frank Wong
>Assignee: Frank Wong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.4, 4.0.0, 3.5.3
>
> Attachments: 20240827-172739.html
>
>
> In {*}ProjectingInternalRow{*}, the *colOrdinals* is passed as a {_}List{_}. 
> According to the Scala documentation, the _{{apply}}_ method for _{{List}}_ 
> has a linear time complexity, and it is used in all methods of 
> ProjectingInternalRow for every row. This can have a significant impact on 
> performance.
> The following flame graph was captured in a {*}merge into sql{*}. A 
> considerable amount of time was spent on {{{}List.apply{}}}. Changing this to 
> _{{IndexedSeq}}_ would improve the performance.
>  
> [^20240827-172739.html]
> [https://docs.scala-lang.org/overviews/collections-2.13/performance-characteristics.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49408) Poor performance in ProjectingInternalRow

2024-09-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49408:


Assignee: Frank Wong

> Poor performance in ProjectingInternalRow
> -
>
> Key: SPARK-49408
> URL: https://issues.apache.org/jira/browse/SPARK-49408
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.2
>Reporter: Frank Wong
>Assignee: Frank Wong
>Priority: Major
>  Labels: pull-request-available
> Attachments: 20240827-172739.html
>
>
> In {*}ProjectingInternalRow{*}, the *colOrdinals* is passed as a {_}List{_}. 
> According to the Scala documentation, the _{{apply}}_ method for _{{List}}_ 
> has a linear time complexity, and it is used in all methods of 
> ProjectingInternalRow for every row. This can have a significant impact on 
> performance.
> The following flame graph was captured in a {*}merge into sql{*}. A 
> considerable amount of time was spent on {{{}List.apply{}}}. Changing this to 
> _{{IndexedSeq}}_ would improve the performance.
>  
> [^20240827-172739.html]
> [https://docs.scala-lang.org/overviews/collections-2.13/performance-characteristics.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49445) Support show tooltip in the progress bar of UI

2024-09-04 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49445.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47908
[https://github.com/apache/spark/pull/47908]

> Support show tooltip in the progress bar of UI
> --
>
> Key: SPARK-49445
> URL: https://issues.apache.org/jira/browse/SPARK-49445
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 4.0.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49470) Update dataTables from 1.13.5 to 1.13.11

2024-09-03 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49470:


Assignee: Kent Yao

> Update dataTables from 1.13.5 to 1.13.11
> 
>
> Key: SPARK-49470
> URL: https://issues.apache.org/jira/browse/SPARK-49470
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Web UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49470) Update dataTables from 1.13.5 to 1.13.11

2024-09-03 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49470.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47938
[https://github.com/apache/spark/pull/47938]

> Update dataTables from 1.13.5 to 1.13.11
> 
>
> Key: SPARK-49470
> URL: https://issues.apache.org/jira/browse/SPARK-49470
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Web UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49495) Document and Feature Preview on master branch via Live GitHub Pages Updates

2024-09-02 Thread Kent Yao (Jira)
Kent Yao created SPARK-49495:


 Summary: Document and Feature Preview on master branch via Live 
GitHub Pages Updates
 Key: SPARK-49495
 URL: https://issues.apache.org/jira/browse/SPARK-49495
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Project Infra
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49494) Using spark logos from spark-website in docs

2024-09-02 Thread Kent Yao (Jira)
Kent Yao created SPARK-49494:


 Summary: Using spark logos from spark-website in docs
 Key: SPARK-49494
 URL: https://issues.apache.org/jira/browse/SPARK-49494
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49378) Break apart the Structured Streaming Programming Guide

2024-08-30 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49378.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47864
[https://github.com/apache/spark/pull/47864]

> Break apart the Structured Streaming Programming Guide
> --
>
> Key: SPARK-49378
> URL: https://issues.apache.org/jira/browse/SPARK-49378
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Neil Ramaswamy
>Assignee: Neil Ramaswamy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> As discussed in [this email 
> thread,|https://lists.apache.org/thread/tbqxg28w4njsp4ws5gbssfckx5zydbdj] we 
> should break apart the Structured Streaming programming guide to make it 
> easier for readers to consume (it will also help SEO). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49378) Break apart the Structured Streaming Programming Guide

2024-08-30 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49378:


Assignee: Neil Ramaswamy

> Break apart the Structured Streaming Programming Guide
> --
>
> Key: SPARK-49378
> URL: https://issues.apache.org/jira/browse/SPARK-49378
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Neil Ramaswamy
>Assignee: Neil Ramaswamy
>Priority: Major
>  Labels: pull-request-available
>
> As discussed in [this email 
> thread,|https://lists.apache.org/thread/tbqxg28w4njsp4ws5gbssfckx5zydbdj] we 
> should break apart the Structured Streaming programming guide to make it 
> easier for readers to consume (it will also help SEO). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49470) Update dataTables from 1.13.5 to 1.13.11

2024-08-30 Thread Kent Yao (Jira)
Kent Yao created SPARK-49470:


 Summary: Update dataTables from 1.13.5 to 1.13.11
 Key: SPARK-49470
 URL: https://issues.apache.org/jira/browse/SPARK-49470
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Web UI
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49459) Support CRC32C for Shuffle Checksum

2024-08-29 Thread Kent Yao (Jira)
Kent Yao created SPARK-49459:


 Summary: Support CRC32C for Shuffle Checksum
 Key: SPARK-49459
 URL: https://issues.apache.org/jira/browse/SPARK-49459
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46037) When Left Join build Left, ShuffledHashJoinExec may result in incorrect results

2024-08-28 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-46037:
-
Priority: Blocker  (was: Minor)

> When Left Join build Left, ShuffledHashJoinExec may result in incorrect 
> results
> ---
>
> Key: SPARK-46037
> URL: https://issues.apache.org/jira/browse/SPARK-46037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: mcdull_zhang
>Priority: Blocker
>  Labels: correctness, pull-request-available
>
> When Left Join build Left and codegen is turned off, ShuffledHashJoinExec may 
> have incorrect results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49405) Restrict charsets in JsonOptions

2024-08-27 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49405.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47887
[https://github.com/apache/spark/pull/47887]

> Restrict charsets in JsonOptions
> 
>
> Key: SPARK-49405
> URL: https://issues.apache.org/jira/browse/SPARK-49405
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49405) Restrict charsets in JsonOptions

2024-08-27 Thread Kent Yao (Jira)
Kent Yao created SPARK-49405:


 Summary: Restrict charsets in JsonOptions
 Key: SPARK-49405
 URL: https://issues.apache.org/jira/browse/SPARK-49405
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49314) Upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11

2024-08-26 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49314.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47810
[https://github.com/apache/spark/pull/47810]

> Upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11
> ---
>
> Key: SPARK-49314
> URL: https://issues.apache.org/jira/browse/SPARK-49314
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49314) Upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11

2024-08-26 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49314:


Assignee: Wei Guo

> Upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11
> ---
>
> Key: SPARK-49314
> URL: https://issues.apache.org/jira/browse/SPARK-49314
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49180) Optimize spark-website and doc release size

2024-08-25 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49180.
--
  Assignee: Kent Yao
Resolution: Fixed

The repo size is reduced from 17G to 6.5G now, so I mark this issue as resolved
{code:shell}
du -sh .
17G .
du -sh .
6.5G .
{code}


> Optimize spark-website and doc release size
> ---
>
> Key: SPARK-49180
> URL: https://issues.apache.org/jira/browse/SPARK-49180
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49209) Archive Spark Documentations in Apache Archives

2024-08-25 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49209.
--
  Assignee: Kent Yao
Resolution: Fixed

Archived at https://archive.apache.org/dist/spark/docs/

> Archive Spark Documentations in Apache Archives
> ---
>
> Key: SPARK-49209
> URL: https://issues.apache.org/jira/browse/SPARK-49209
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> To address the issue of the Spark website repository size 
> reaching the storage limit for GitHub-hosted runners [1], I suggest 
> enhancing step [2] in our release process by relocating the 
> documentation releases from the dev[3] directory to the release 
> directory[4]. Then it would captured by the Apache Archives 
> service[5] to create permanent links, which would be alternative 
> endpoints for our documentation, like
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/_site/index.html 
> for 
> https://spark.apache.org/docs/3.5.2/index.html
> Note that the previous example still uses the staging repository, 
> which will become
> https://archive.apache.org/dist/spark/docs/3.5.2/index.html.
> For older releases hosted on the Spark website [6], we also need to
> upload them via SVN manually.
> After that, when we reach the threshold again, we can delete some of 
> the old ones on page [6], and update their links on page [7] or use
> redirection.
> [1] https://lists.apache.org/thread/o0w4gqoks23xztdmjjj26jkp1yyg2bvq
> [2] 
> https://spark.apache.org/release-process.html#upload-to-apache-release-directory
> [3] https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/
> [4] https://dist.apache.org/repos/dist/release/spark/docs/3.5.2
> [5] https://archive.apache.org/dist/spark/
> [6] https://github.com/apache/spark-website/tree/asf-site/site/docs
> [7] https://spark.apache.org/documentation.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49039) Reset checkbox when executor metrics are loaded in the Stages tab

2024-08-16 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49039:


Assignee: dzcxzl

> Reset checkbox when executor metrics are loaded in the Stages tab
> -
>
> Key: SPARK-49039
> URL: https://issues.apache.org/jira/browse/SPARK-49039
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0, 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49039) Reset checkbox when executor metrics are loaded in the Stages tab

2024-08-16 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49039.
--
Fix Version/s: 3.4.4
   4.0.0
   3.5.3
   Resolution: Fixed

Issue resolved by pull request 47519
[https://github.com/apache/spark/pull/47519]

> Reset checkbox when executor metrics are loaded in the Stages tab
> -
>
> Key: SPARK-49039
> URL: https://issues.apache.org/jira/browse/SPARK-49039
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0, 3.2.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.4.4, 4.0.0, 3.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45590) okio-1.15.0 CVE-2023-3635

2024-08-15 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-45590:
-
Fix Version/s: 4.0.0
   3.4.4

> okio-1.15.0 CVE-2023-3635
> -
>
> Key: SPARK-45590
> URL: https://issues.apache.org/jira/browse/SPARK-45590
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Colm O hEigeartaigh
>Assignee: Gabor Roczei
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.4.4, 3.5.3
>
>
> CVE-2023-3635 is being flagged against okio-1.15.0 present in the Spark 3.5.0 
> build:
>  * ./spark-3.5.0-bin-without-hadoop/jars/okio-1.15.0.jar
>  * ./spark-3.5.0-bin-hadoop3/jars/okio-1.15.0.jar
> I don't see okio in the dependency tree, it must be coming in via some 
> profile.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45590) okio-1.15.0 CVE-2023-3635

2024-08-15 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-45590.
--
Fix Version/s: 3.5.3
   Resolution: Fixed

Issue resolved by pull request 47769
[https://github.com/apache/spark/pull/47769]

> okio-1.15.0 CVE-2023-3635
> -
>
> Key: SPARK-45590
> URL: https://issues.apache.org/jira/browse/SPARK-45590
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Colm O hEigeartaigh
>Assignee: Gabor Roczei
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.3
>
>
> CVE-2023-3635 is being flagged against okio-1.15.0 present in the Spark 3.5.0 
> build:
>  * ./spark-3.5.0-bin-without-hadoop/jars/okio-1.15.0.jar
>  * ./spark-3.5.0-bin-hadoop3/jars/okio-1.15.0.jar
> I don't see okio in the dependency tree, it must be coming in via some 
> profile.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45590) okio-1.15.0 CVE-2023-3635

2024-08-15 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-45590:


Assignee: Gabor Roczei

> okio-1.15.0 CVE-2023-3635
> -
>
> Key: SPARK-45590
> URL: https://issues.apache.org/jira/browse/SPARK-45590
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Colm O hEigeartaigh
>Assignee: Gabor Roczei
>Priority: Minor
>  Labels: pull-request-available
>
> CVE-2023-3635 is being flagged against okio-1.15.0 present in the Spark 3.5.0 
> build:
>  * ./spark-3.5.0-bin-without-hadoop/jars/okio-1.15.0.jar
>  * ./spark-3.5.0-bin-hadoop3/jars/okio-1.15.0.jar
> I don't see okio in the dependency tree, it must be coming in via some 
> profile.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49242) Upgrade commons-cli to 1.9.0

2024-08-14 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49242.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47764
[https://github.com/apache/spark/pull/47764]

> Upgrade commons-cli to 1.9.0
> 
>
> Key: SPARK-49242
> URL: https://issues.apache.org/jira/browse/SPARK-49242
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48947) Use lowercased charset name to decrease cache missing in Charset.forName

2024-08-14 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48947.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47420
[https://github.com/apache/spark/pull/47420]

> Use lowercased charset name to decrease cache missing in Charset.forName
> 
>
> Key: SPARK-48947
> URL: https://issues.apache.org/jira/browse/SPARK-48947
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49205) KeyGroupedPartitioning should inherit HashPartitioningLike

2024-08-13 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49205.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47734
[https://github.com/apache/spark/pull/47734]

> KeyGroupedPartitioning should inherit HashPartitioningLike
> --
>
> Key: SPARK-49205
> URL: https://issues.apache.org/jira/browse/SPARK-49205
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49205) KeyGroupedPartitioning should inherit HashPartitioningLike

2024-08-13 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49205:


Assignee: XiDuo You

> KeyGroupedPartitioning should inherit HashPartitioningLike
> --
>
> Key: SPARK-49205
> URL: https://issues.apache.org/jira/browse/SPARK-49205
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49209) Archive Spark Documentations in Apache Archives

2024-08-12 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-49209:
-
Description: 

To address the issue of the Spark website repository size 
reaching the storage limit for GitHub-hosted runners [1], I suggest 
enhancing step [2] in our release process by relocating the 
documentation releases from the dev[3] directory to the release 
directory[4]. Then it would captured by the Apache Archives 
service[5] to create permanent links, which would be alternative 
endpoints for our documentation, like

https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/_site/index.html 
for 
https://spark.apache.org/docs/3.5.2/index.html

Note that the previous example still uses the staging repository, 
which will become
https://archive.apache.org/dist/spark/docs/3.5.2/index.html.

For older releases hosted on the Spark website [6], we also need to
upload them via SVN manually.

After that, when we reach the threshold again, we can delete some of 
the old ones on page [6], and update their links on page [7] or use
redirection.

[1] https://lists.apache.org/thread/o0w4gqoks23xztdmjjj26jkp1yyg2bvq
[2] 
https://spark.apache.org/release-process.html#upload-to-apache-release-directory
[3] https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/
[4] https://dist.apache.org/repos/dist/release/spark/docs/3.5.2
[5] https://archive.apache.org/dist/spark/
[6] https://github.com/apache/spark-website/tree/asf-site/site/docs
[7] https://spark.apache.org/documentation.html

> Archive Spark Documentations in Apache Archives
> ---
>
> Key: SPARK-49209
> URL: https://issues.apache.org/jira/browse/SPARK-49209
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>
> To address the issue of the Spark website repository size 
> reaching the storage limit for GitHub-hosted runners [1], I suggest 
> enhancing step [2] in our release process by relocating the 
> documentation releases from the dev[3] directory to the release 
> directory[4]. Then it would captured by the Apache Archives 
> service[5] to create permanent links, which would be alternative 
> endpoints for our documentation, like
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/_site/index.html 
> for 
> https://spark.apache.org/docs/3.5.2/index.html
> Note that the previous example still uses the staging repository, 
> which will become
> https://archive.apache.org/dist/spark/docs/3.5.2/index.html.
> For older releases hosted on the Spark website [6], we also need to
> upload them via SVN manually.
> After that, when we reach the threshold again, we can delete some of 
> the old ones on page [6], and update their links on page [7] or use
> redirection.
> [1] https://lists.apache.org/thread/o0w4gqoks23xztdmjjj26jkp1yyg2bvq
> [2] 
> https://spark.apache.org/release-process.html#upload-to-apache-release-directory
> [3] https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/
> [4] https://dist.apache.org/repos/dist/release/spark/docs/3.5.2
> [5] https://archive.apache.org/dist/spark/
> [6] https://github.com/apache/spark-website/tree/asf-site/site/docs
> [7] https://spark.apache.org/documentation.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49209) Archive Spark Documentations in Apache Archives

2024-08-12 Thread Kent Yao (Jira)
Kent Yao created SPARK-49209:


 Summary: Archive Spark Documentations in Apache Archives
 Key: SPARK-49209
 URL: https://issues.apache.org/jira/browse/SPARK-49209
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49181) Remove site/docs/{version}/api/python/_sources folder

2024-08-10 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49181:


Assignee: Kent Yao

> Remove site/docs/{version}/api/python/_sources folder
> -
>
> Key: SPARK-49181
> URL: https://issues.apache.org/jira/browse/SPARK-49181
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49181) Remove site/docs/{version}/api/python/_sources folder

2024-08-10 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49181.
--
Resolution: Fixed

> Remove site/docs/{version}/api/python/_sources folder
> -
>
> Key: SPARK-49181
> URL: https://issues.apache.org/jira/browse/SPARK-49181
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11583) Make MapStatus use less memory uage

2024-08-10 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-11583:


Assignee: Kent Yao  (was: Davies Liu)

> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao 2
>Assignee: Kent Yao
>Priority: Major
> Fix For: 1.6.0
>
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49182) Stop publish site/docs/{version}/api/python/_sources

2024-08-09 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-49182:
-
Fix Version/s: 3.5.3
   (was: 3.5.2)

> Stop publish site/docs/{version}/api/python/_sources
> 
>
> Key: SPARK-49182
> URL: https://issues.apache.org/jira/browse/SPARK-49182
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.4.4, 3.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49182) Stop publish site/docs/{version}/api/python/_sources

2024-08-09 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49182:


Assignee: Kent Yao

> Stop publish site/docs/{version}/api/python/_sources
> 
>
> Key: SPARK-49182
> URL: https://issues.apache.org/jira/browse/SPARK-49182
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49182) Stop publish site/docs/{version}/api/python/_sources

2024-08-09 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49182.
--
Fix Version/s: 3.4.4
   3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 47686
[https://github.com/apache/spark/pull/47686]

> Stop publish site/docs/{version}/api/python/_sources
> 
>
> Key: SPARK-49182
> URL: https://issues.apache.org/jira/browse/SPARK-49182
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.4, 3.5.2, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49174) Exclude the dir `docs/util` from `_site`

2024-08-09 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49174.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47678
[https://github.com/apache/spark/pull/47678]

> Exclude the dir `docs/util` from `_site`
> 
>
> Key: SPARK-49174
> URL: https://issues.apache.org/jira/browse/SPARK-49174
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49182) Stop publish site/docs/{version}/api/python/_sources

2024-08-09 Thread Kent Yao (Jira)
Kent Yao created SPARK-49182:


 Summary: Stop publish site/docs/{version}/api/python/_sources
 Key: SPARK-49182
 URL: https://issues.apache.org/jira/browse/SPARK-49182
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49181) Remove site/docs/{version}/api/python/_sources folder

2024-08-09 Thread Kent Yao (Jira)
Kent Yao created SPARK-49181:


 Summary: Remove site/docs/{version}/api/python/_sources folder
 Key: SPARK-49181
 URL: https://issues.apache.org/jira/browse/SPARK-49181
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49180) Optimize spark-website and doc release size

2024-08-09 Thread Kent Yao (Jira)
Kent Yao created SPARK-49180:


 Summary: Optimize spark-website and doc release size
 Key: SPARK-49180
 URL: https://issues.apache.org/jira/browse/SPARK-49180
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49147) Mark KryoRegistrator as DeveloperApi interface

2024-08-08 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49147.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47657
[https://github.com/apache/spark/pull/47657]

> Mark KryoRegistrator as DeveloperApi interface
> --
>
> Key: SPARK-49147
> URL: https://issues.apache.org/jira/browse/SPARK-49147
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Rob Reeves
>Assignee: Rob Reeves
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Trait org.apache.spark.serializer.KryoRegistrator is a public interface 
> because it is exposed via config "spark.kryo.registrator" since version 
> 0.5.0. It should have an annotation to describe its stability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49147) Mark KryoRegistrator as DeveloperApi interface

2024-08-08 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49147:


Assignee: Rob Reeves

> Mark KryoRegistrator as DeveloperApi interface
> --
>
> Key: SPARK-49147
> URL: https://issues.apache.org/jira/browse/SPARK-49147
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Rob Reeves
>Assignee: Rob Reeves
>Priority: Major
>  Labels: pull-request-available
>
> Trait org.apache.spark.serializer.KryoRegistrator is a public interface 
> because it is exposed via config "spark.kryo.registrator" since version 
> 0.5.0. It should have an annotation to describe its stability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49154) Upgrade `Volcano` to 1.9.0

2024-08-07 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49154.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47661
[https://github.com/apache/spark/pull/47661]

> Upgrade `Volcano` to 1.9.0
> --
>
> Key: SPARK-49154
> URL: https://issues.apache.org/jira/browse/SPARK-49154
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Kubernetes, Project Infra, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49124) Upgrade tink to 1.14.1

2024-08-07 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49124.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47632
[https://github.com/apache/spark/pull/47632]

> Upgrade tink to 1.14.1
> --
>
> Key: SPARK-49124
> URL: https://issues.apache.org/jira/browse/SPARK-49124
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49124) Upgrade tink to 1.14.1

2024-08-07 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49124:


Assignee: Wei Guo

> Upgrade tink to 1.14.1
> --
>
> Key: SPARK-49124
> URL: https://issues.apache.org/jira/browse/SPARK-49124
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49141) Mark variant as hive incompatible data type

2024-08-07 Thread Kent Yao (Jira)
Kent Yao created SPARK-49141:


 Summary: Mark variant as  hive incompatible data type
 Key: SPARK-49141
 URL: https://issues.apache.org/jira/browse/SPARK-49141
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49134) Support retry for deploying artifacts to Nexus staging repository

2024-08-06 Thread Kent Yao (Jira)
Kent Yao created SPARK-49134:


 Summary: Support retry for deploying artifacts to Nexus staging 
repository
 Key: SPARK-49134
 URL: https://issues.apache.org/jira/browse/SPARK-49134
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49120) Bump Gson 2.11.0

2024-08-06 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49120.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47627
[https://github.com/apache/spark/pull/47627]

> Bump Gson 2.11.0
> 
>
> Key: SPARK-49120
> URL: https://issues.apache.org/jira/browse/SPARK-49120
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49099) Refactor CatalogManager.setCurrentNamespace

2024-08-06 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-49099:
-
Affects Version/s: 3.5.1

> Refactor CatalogManager.setCurrentNamespace
> ---
>
> Key: SPARK-49099
> URL: https://issues.apache.org/jira/browse/SPARK-49099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49099) Refactor CatalogManager.setCurrentNamespace

2024-08-06 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-49099:
-
Issue Type: Bug  (was: Improvement)

> Refactor CatalogManager.setCurrentNamespace
> ---
>
> Key: SPARK-49099
> URL: https://issues.apache.org/jira/browse/SPARK-49099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49099) Refactor CatalogManager.setCurrentNamespace

2024-08-06 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49099.
--
Fix Version/s: 4.0.0
   3.5.2
   Resolution: Fixed

> Refactor CatalogManager.setCurrentNamespace
> ---
>
> Key: SPARK-49099
> URL: https://issues.apache.org/jira/browse/SPARK-49099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49091) Some broadcasts cannot be cleared from memory storage

2024-08-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49091.
--
  Assignee: Zhen Wang
Resolution: Not A Problem

> Some broadcasts cannot be cleared from memory storage
> -
>
> Key: SPARK-49091
> URL: https://issues.apache.org/jira/browse/SPARK-49091
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Yuming Wang
>Assignee: Zhen Wang
>Priority: Major
> Attachments: SPARK-49091.patch, driver heap.png, 
> image-2024-08-02-20-45-48-252.png, image-2024-08-02-20-52-33-896.png
>
>
> Please apply this patch([^SPARK-49091.patch]) to reproduce this issue. This 
> issue may cause driver memory leak.
>  !driver heap.png|thumbnail!
> This issue was introduced by SPARK-41914.
> Before SPARK-41914:
> {noformat}
> [info] BroadcastCleanerSuite:
> 10:30:16.228 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 0, names: 
> [info] - Test broadcast cleaner (1 minute, 4 seconds)
> 10:31:21.552 WARN org.apache.spark.sql.BroadcastCleanerSuite:
> {noformat}
> After SPARK-41914:
> {noformat}
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2,

[jira] [Resolved] (SPARK-49107) ROUTINE_ALREADY_EXISTS supports RoutineType

2024-08-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49107.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47600
[https://github.com/apache/spark/pull/47600]

> ROUTINE_ALREADY_EXISTS supports RoutineType
> ---
>
> Key: SPARK-49107
> URL: https://issues.apache.org/jira/browse/SPARK-49107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49108) Add `submit_pi.sh` REST API example

2024-08-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49108.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47601
[https://github.com/apache/spark/pull/47601]

> Add `submit_pi.sh` REST API example
> ---
>
> Key: SPARK-49108
> URL: https://issues.apache.org/jira/browse/SPARK-49108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49078) Support show columns syntax in v2 table

2024-08-04 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49078:


Assignee: xy

> Support show columns syntax in v2 table
> ---
>
> Key: SPARK-49078
> URL: https://issues.apache.org/jira/browse/SPARK-49078
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Support show columns syntax in v2 table



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49078) Support show columns syntax in v2 table

2024-08-04 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49078.
--
Resolution: Fixed

Issue resolved by pull request 47568
[https://github.com/apache/spark/pull/47568]

> Support show columns syntax in v2 table
> ---
>
> Key: SPARK-49078
> URL: https://issues.apache.org/jira/browse/SPARK-49078
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Support show columns syntax in v2 table



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49094) ignoreCorruptFiles file source option is partially supported for orc format

2024-08-02 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-49094:
-
Issue Type: Bug  (was: Improvement)

> ignoreCorruptFiles file source option is partially supported for orc format
> ---
>
> Key: SPARK-49094
> URL: https://issues.apache.org/jira/browse/SPARK-49094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.4.3
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-49091) Some broadcasts cannot be cleared from memory storage

2024-08-02 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870519#comment-17870519
 ] 

Kent Yao commented on SPARK-49091:
--

Thank you [~wforget] for the verification, sounds reasonable to me  

> Some broadcasts cannot be cleared from memory storage
> -
>
> Key: SPARK-49091
> URL: https://issues.apache.org/jira/browse/SPARK-49091
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: SPARK-49091.patch, driver heap.png, 
> image-2024-08-02-20-45-48-252.png, image-2024-08-02-20-52-33-896.png
>
>
> Please apply this patch([^SPARK-49091.patch]) to reproduce this issue. This 
> issue may cause driver memory leak.
>  !driver heap.png|thumbnail!
> This issue was introduced by SPARK-41914.
> Before SPARK-41914:
> {noformat}
> [info] BroadcastCleanerSuite:
> 10:30:16.228 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 0, names: 
> [info] - Test broadcast cleaner (1 minute, 4 seconds)
> 10:31:21.552 WARN org.apache.spark.sql.BroadcastCleanerSuite:
> {noformat}
> After SPARK-41914:
> {noformat}
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 8, names: broadcast_0_piece0, broadcast_0, broadcast_1_piece0, 
> broadcast_2_piece0, broadcast_2, broadcast_1, broadcast_3_piece0, broadcast_3
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_piece0, broadcast_1
> entries size: 2, names: broadcast_1_pi

[jira] [Updated] (SPARK-49094) ignoreCorruptFiles file source option is partially supported for orc format

2024-08-02 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-49094:
-
Affects Version/s: 3.4.3
   3.5.1

> ignoreCorruptFiles file source option is partially supported for orc format
> ---
>
> Key: SPARK-49094
> URL: https://issues.apache.org/jira/browse/SPARK-49094
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.4.3
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49094) ignoreCorruptFiles file source option is partially supported for orc format

2024-08-02 Thread Kent Yao (Jira)
Kent Yao created SPARK-49094:


 Summary: ignoreCorruptFiles file source option is partially 
supported for orc format
 Key: SPARK-49094
 URL: https://issues.apache.org/jira/browse/SPARK-49094
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49000) Aggregation with DISTINCT gives wrong results when dealing with literals

2024-08-01 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-49000:
-
Fix Version/s: 3.4.4

> Aggregation with DISTINCT gives wrong results when dealing with literals
> 
>
> Key: SPARK-49000
> URL: https://issues.apache.org/jira/browse/SPARK-49000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.3, 3.2.4, 3.5.1, 3.3.4, 3.4.3
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Critical
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0, 3.4.4
>
>
> Aggregation with *DISTINCT* gives wrong results when dealing with literals. 
> It appears that this bug affects all (or most) released versions of Spark.
>  
> For example:
> {code:java}
> select count(distinct 1) from t{code}
> returns 1, while the correct result should be 0.
>  
> For reference:
> {code:java}
> select count(1) from t{code}
> returns 0, which is the correct and expected result.
>  
> In these examples, suppose that *t* is a table with any columns).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-49030) Self join of a CTE seems non-deterministic

2024-08-01 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870341#comment-17870341
 ] 

Kent Yao commented on SPARK-49030:
--

I retargeted this to version 3.5.3 because we have reached a consensus that the 
priority of this issue is low, although we still have different technical 
options.

> Self join of a CTE seems non-deterministic
> --
>
> Key: SPARK-49030
> URL: https://issues.apache.org/jira/browse/SPARK-49030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview.
>Reporter: Jihoon Son
>Priority: Minor
> Fix For: 3.5.3
>
> Attachments: screenshot-1.png
>
>
> {code:java}
> WITH c AS (SELECT * FROM customer LIMIT 10)
> SELECT count(*)
> FROM c c1, c c2
> WHERE c1.c_customer_sk > c2.c_customer_sk{code}
> Suppose a self join query on a CTE such as the one above.
> Spark generates a physical plan like the one below for this query.
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L])
>    +- HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#233L])
>       +- Project
>          +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > 
> c_customer_sk#214)
>             :- Filter isnotnull(c_customer_sk#0)
>             :  +- GlobalLimit 10, 0
>             :     +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=256]
>             :        +- LocalLimit 10
>             :           +- FileScan parquet [c_customer_sk#0] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
>             +- BroadcastExchange IdentityBroadcastMode, [plan_id=263]
>                +- Filter isnotnull(c_customer_sk#214)
>                   +- GlobalLimit 10, 0
>                      +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=259]
>                         +- LocalLimit 10
>                            +- FileScan parquet [c_customer_sk#214] Batched: 
> true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct{code}
> Evaluating this plan produces non-deterministic result because the limit is 
> independently pushed into the two sides of the join. Each limit can produce 
> different data, and thus the join can produce results that vary across runs.
> I understand that the query in question is not deterministic (and thus not 
> very practical) as, due to the nature of the limit in distributed engines, it 
> is not expected to produce the same result anyway across repeated runs. 
> However, I would still expect that the query plan evaluation remains 
> deterministic.
> Per extended analysis as seen below, it seems that the query plan has changed 
> at some point during optimization.
> {code:java}
> == Analyzed Logical Plan ==
> count(1): bigint
> WithCTE
> :- CTERelationDef 2, false
> :  +- SubqueryAlias c
> :     +- GlobalLimit 10
> :        +- LocalLimit 10
> :           +- Project [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, 
> c_email_address#16, c_last_review_date_sk#17]
> :              +- SubqueryAlias customer
> :                 +- View (`customer`, [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, 
> c_email_address#16, c_last_review_date_sk#17])
> :                    +- Relation 
> [c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4,c_first_shipto_date_sk#5,c_first_sales_date_sk#6,c_salutation#7,c_first_name#8,c_last_name#9,c_preferred_cust_flag#10,c_birth_day#11L,c_birth_month#12L,c_birth_year#13L,c_birth_country#14,c_login#15,c_email_address#16,c_last_review_date_sk#17]
>  parquet
> +- Aggregate [count(1) AS count(1)#194L]
>    +- Filter (c_customer_sk#0 > c_customer_sk#176)
>       +- Join Inner
>          :- SubqueryAlias c1
>          :  +- SubqueryAlias c
>          :     +- CTERelationRef 2, true, [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current

[jira] [Updated] (SPARK-49030) Self join of a CTE seems non-deterministic

2024-08-01 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-49030:
-
Fix Version/s: 3.5.3

> Self join of a CTE seems non-deterministic
> --
>
> Key: SPARK-49030
> URL: https://issues.apache.org/jira/browse/SPARK-49030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview.
>Reporter: Jihoon Son
>Priority: Minor
> Fix For: 3.5.3
>
> Attachments: screenshot-1.png
>
>
> {code:java}
> WITH c AS (SELECT * FROM customer LIMIT 10)
> SELECT count(*)
> FROM c c1, c c2
> WHERE c1.c_customer_sk > c2.c_customer_sk{code}
> Suppose a self join query on a CTE such as the one above.
> Spark generates a physical plan like the one below for this query.
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L])
>    +- HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#233L])
>       +- Project
>          +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > 
> c_customer_sk#214)
>             :- Filter isnotnull(c_customer_sk#0)
>             :  +- GlobalLimit 10, 0
>             :     +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=256]
>             :        +- LocalLimit 10
>             :           +- FileScan parquet [c_customer_sk#0] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
>             +- BroadcastExchange IdentityBroadcastMode, [plan_id=263]
>                +- Filter isnotnull(c_customer_sk#214)
>                   +- GlobalLimit 10, 0
>                      +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=259]
>                         +- LocalLimit 10
>                            +- FileScan parquet [c_customer_sk#214] Batched: 
> true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct{code}
> Evaluating this plan produces non-deterministic result because the limit is 
> independently pushed into the two sides of the join. Each limit can produce 
> different data, and thus the join can produce results that vary across runs.
> I understand that the query in question is not deterministic (and thus not 
> very practical) as, due to the nature of the limit in distributed engines, it 
> is not expected to produce the same result anyway across repeated runs. 
> However, I would still expect that the query plan evaluation remains 
> deterministic.
> Per extended analysis as seen below, it seems that the query plan has changed 
> at some point during optimization.
> {code:java}
> == Analyzed Logical Plan ==
> count(1): bigint
> WithCTE
> :- CTERelationDef 2, false
> :  +- SubqueryAlias c
> :     +- GlobalLimit 10
> :        +- LocalLimit 10
> :           +- Project [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, 
> c_email_address#16, c_last_review_date_sk#17]
> :              +- SubqueryAlias customer
> :                 +- View (`customer`, [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, 
> c_email_address#16, c_last_review_date_sk#17])
> :                    +- Relation 
> [c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4,c_first_shipto_date_sk#5,c_first_sales_date_sk#6,c_salutation#7,c_first_name#8,c_last_name#9,c_preferred_cust_flag#10,c_birth_day#11L,c_birth_month#12L,c_birth_year#13L,c_birth_country#14,c_login#15,c_email_address#16,c_last_review_date_sk#17]
>  parquet
> +- Aggregate [count(1) AS count(1)#194L]
>    +- Filter (c_customer_sk#0 > c_customer_sk#176)
>       +- Join Inner
>          :- SubqueryAlias c1
>          :  +- SubqueryAlias c
>          :     +- CTERelationRef 2, true, [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L,

[jira] [Commented] (SPARK-48950) Corrupt data from parquet scans

2024-08-01 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870316#comment-17870316
 ] 

Kent Yao commented on SPARK-48950:
--

Thank you [~dongjoon]

Hi [~Tom_Newton] Please let us know if you have new findings, thank you

> Corrupt data from parquet scans
> ---
>
> Key: SPARK-48950
> URL: https://issues.apache.org/jira/browse/SPARK-48950
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
> Environment: Spark 3.5.0
> Running on kubernetes
> Using Azure Blob storage with hierarchical namespace enabled 
>Reporter: Thomas Newton
>Priority: Major
>  Labels: correctness
> Attachments: example_task_errors.txt, job_dag.png, sql_query_plan.png
>
>
> Its very rare and non-deterministic but since Spark 3.5.0 we have started 
> seeing a correctness bug in parquet scans when using the vectorized reader. 
> We've noticed this on double type columns where occasionally small groups 
> (typically 10s to 100s) of rows are replaced with crazy values like 
> `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, 
> -7.60562076e+240, -3.1806e-064, 2.89435993e-116`. I think this is the 
> result of interpreting uniform random bits as a double type. Most of my 
> testing has been on an array of double type column but we have also seen it 
> on un-nested plain double type columns. 
> I've been testing this by adding a filter that should return zero results but 
> will return non-zero if the parquet scan has problems. I've attached 
> screenshots of this from the Spark UI. 
> I did a `git bisect` and found that the problem starts with 
> [https://github.com/apache/spark/pull/39950], but I haven't yet understood 
> why. Its possible that this change is fine but it reveals a problem 
> elsewhere? I did also notice  [https://github.com/apache/spark/pull/44853] 
> which appears to be a different implementation of the same thing so maybe 
> that could help. 
> Its not a major problem by itself but another symptom appears to be that 
> Parquet scan tasks fail at a rate of approximately 0.03% with errors like 
> those in the attached `example_task_errors.txt`. If I revert 
> [https://github.com/apache/spark/pull/39950] I get exactly 0 task failures on 
> the same test. 
>  
> The problem seems to be a bit dependant on how the parquet files happen to be 
> organised on blob storage so I don't yet have a reproduce that I can share 
> that doesn't depend on private data. 
> I tested on a pre-release 4.0.0 and the problem was still present. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44638) Unable to read from JDBC data sources when using custom schema containing varchar

2024-08-01 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-44638:
-
Fix Version/s: 3.4.4

> Unable to read from JDBC data sources when using custom schema containing 
> varchar
> -
>
> Key: SPARK-44638
> URL: https://issues.apache.org/jira/browse/SPARK-44638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.4, 3.3.2, 3.4.1
>Reporter: Michael Said
>Assignee: Kent Yao
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.4
>
>
> When querying the data from JDBC databases with custom schema containing 
> varchar I got this error :
> {code:java}
> [23/07/14 06:12:19 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) ( 
> executor 1): java.sql.SQLException: Unsupported type varchar(100) at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedJdbcTypeError(QueryExecutionErrors.scala:818)
>  23/07/14 06:12:21 INFO TaskSetManager: Lost task 0.1 in stage 1.0 (TID 2) on 
> , executor 0: java.sql.SQLException (Unsupported type varchar(100)){code}
> Code example: 
> {code:java}
> CUSTOM_SCHEMA="ID Integer, NAME VARCHAR(100)"
> df = spark.read.format("jdbc")
> .option("url", "jdbc:oracle:thin:@0.0.0.0:1521:db")
> .option("driver", "oracle.jdbc.OracleDriver")
> .option("dbtable", "table")
> .option("customSchema", CUSTOM_SCHEMA)
> .option("user", "user")
> .option("password", "password")
> .load()
> df.show(){code}
> I tried to set {{spark.sql.legacy.charVarcharAsString = true}} to restore the 
> behavior before Spark 3.1 but it doesn't help.
> The issue occurs in version 3.1.0 and above. I believe that this issue is 
> caused by https://issues.apache.org/jira/browse/SPARK-33480



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48950) Corrupt data from parquet scans

2024-08-01 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870190#comment-17870190
 ] 

Kent Yao commented on SPARK-48950:
--

Thank you [~Tom_Newton] for the additional inputs.

I’d retarget this to 3.5.3 to unblock 3.5.2. WDYT?[~dongjoon]

> Corrupt data from parquet scans
> ---
>
> Key: SPARK-48950
> URL: https://issues.apache.org/jira/browse/SPARK-48950
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
> Environment: Spark 3.5.0
> Running on kubernetes
> Using Azure Blob storage with hierarchical namespace enabled 
>Reporter: Thomas Newton
>Priority: Major
>  Labels: correctness
> Attachments: example_task_errors.txt, job_dag.png, sql_query_plan.png
>
>
> Its very rare and non-deterministic but since Spark 3.5.0 we have started 
> seeing a correctness bug in parquet scans when using the vectorized reader. 
> We've noticed this on double type columns where occasionally small groups 
> (typically 10s to 100s) of rows are replaced with crazy values like 
> `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, 
> -7.60562076e+240, -3.1806e-064, 2.89435993e-116`. I think this is the 
> result of interpreting uniform random bits as a double type. Most of my 
> testing has been on an array of double type column but we have also seen it 
> on un-nested plain double type columns. 
> I've been testing this by adding a filter that should return zero results but 
> will return non-zero if the parquet scan has problems. I've attached 
> screenshots of this from the Spark UI. 
> I did a `git bisect` and found that the problem starts with 
> [https://github.com/apache/spark/pull/39950], but I haven't yet understood 
> why. Its possible that this change is fine but it reveals a problem 
> elsewhere? I did also notice  [https://github.com/apache/spark/pull/44853] 
> which appears to be a different implementation of the same thing so maybe 
> that could help. 
> Its not a major problem by itself but another symptom appears to be that 
> Parquet scan tasks fail at a rate of approximately 0.03% with errors like 
> those in the attached `example_task_errors.txt`. If I revert 
> [https://github.com/apache/spark/pull/39950] I get exactly 0 task failures on 
> the same test. 
>  
> The problem seems to be a bit dependant on how the parquet files happen to be 
> organised on blob storage so I don't yet have a reproduce that I can share 
> that doesn't depend on private data. 
> I tested on a pre-release 4.0.0 and the problem was still present. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48950) Corrupt data from parquet scans

2024-08-01 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870183#comment-17870183
 ] 

Kent Yao commented on SPARK-48950:
--

Did your app run with ignoreCorruptFiles?

BTW,I wonder if we could have a reproduciable case on the OSS spark

> Corrupt data from parquet scans
> ---
>
> Key: SPARK-48950
> URL: https://issues.apache.org/jira/browse/SPARK-48950
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
> Environment: Spark 3.5.0
> Running on kubernetes
> Using Azure Blob storage with hierarchical namespace enabled 
>Reporter: Thomas Newton
>Priority: Major
>  Labels: correctness
> Attachments: example_task_errors.txt, job_dag.png, sql_query_plan.png
>
>
> Its very rare and non-deterministic but since Spark 3.5.0 we have started 
> seeing a correctness bug in parquet scans when using the vectorized reader. 
> We've noticed this on double type columns where occasionally small groups 
> (typically 10s to 100s) of rows are replaced with crazy values like 
> `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, 
> -7.60562076e+240, -3.1806e-064, 2.89435993e-116`. I think this is the 
> result of interpreting uniform random bits as a double type. Most of my 
> testing has been on an array of double type column but we have also seen it 
> on un-nested plain double type columns. 
> I've been testing this by adding a filter that should return zero results but 
> will return non-zero if the parquet scan has problems. I've attached 
> screenshots of this from the Spark UI. 
> I did a `git bisect` and found that the problem starts with 
> [https://github.com/apache/spark/pull/39950], but I haven't yet understood 
> why. Its possible that this change is fine but it reveals a problem 
> elsewhere? I did also notice  [https://github.com/apache/spark/pull/44853] 
> which appears to be a different implementation of the same thing so maybe 
> that could help. 
> Its not a major problem by itself but another symptom appears to be that 
> Parquet scan tasks fail at a rate of approximately 0.03% with errors like 
> those in the attached `example_task_errors.txt`. If I revert 
> [https://github.com/apache/spark/pull/39950] I get exactly 0 task failures on 
> the same test. 
>  
> The problem seems to be a bit dependant on how the parquet files happen to be 
> organised on blob storage so I don't yet have a reproduce that I can share 
> that doesn't depend on private data. 
> I tested on a pre-release 4.0.0 and the problem was still present. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48950) Corrupt data from parquet scans

2024-08-01 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870152#comment-17870152
 ] 

Kent Yao commented on SPARK-48950:
--

According to the error stacks you provided, 

```
Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:112)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:554)
at 
org.apache.parquet.hadoop.codec.SnappyDecompressor.uncompress(SnappyDecompressor.java:30)
at 
org.apache.parquet.hadoop.codec.NonBlockedDecompressor.decompress(NonBlockedDecompressor.java:73)
at 
org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)
at java.base/java.io.DataInputStream.readFully(DataInputStream.java:201)
at java.base/java.io.DataInputStream.readFully(DataInputStream.java:172)
at 
org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286)
at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
at 
org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV1(VectorizedColumnReader.java:367)
... 41 more
```

The root cause isn't directly caused by Spark, right?

> Corrupt data from parquet scans
> ---
>
> Key: SPARK-48950
> URL: https://issues.apache.org/jira/browse/SPARK-48950
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
> Environment: Spark 3.5.0
> Running on kubernetes
> Using Azure Blob storage with hierarchical namespace enabled 
>Reporter: Thomas Newton
>Priority: Major
>  Labels: correctness
> Attachments: example_task_errors.txt, job_dag.png, sql_query_plan.png
>
>
> Its very rare and non-deterministic but since Spark 3.5.0 we have started 
> seeing a correctness bug in parquet scans when using the vectorized reader. 
> We've noticed this on double type columns where occasionally small groups 
> (typically 10s to 100s) of rows are replaced with crazy values like 
> `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, 
> -7.60562076e+240, -3.1806e-064, 2.89435993e-116`. I think this is the 
> result of interpreting uniform random bits as a double type. Most of my 
> testing has been on an array of double type column but we have also seen it 
> on un-nested plain double type columns. 
> I've been testing this by adding a filter that should return zero results but 
> will return non-zero if the parquet scan has problems. I've attached 
> screenshots of this from the Spark UI. 
> I did a `git bisect` and found that the problem starts with 
> [https://github.com/apache/spark/pull/39950], but I haven't yet understood 
> why. Its possible that this change is fine but it reveals a problem 
> elsewhere? I did also notice  [https://github.com/apache/spark/pull/44853] 
> which appears to be a different implementation of the same thing so maybe 
> that could help. 
> Its not a major problem by itself but another symptom appears to be that 
> Parquet scan tasks fail at a rate of approximately 0.03% with errors like 
> those in the attached `example_task_errors.txt`. If I revert 
> [https://github.com/apache/spark/pull/39950] I get exactly 0 task failures on 
> the same test. 
>  
> The problem seems to be a bit dependant on how the parquet files happen to be 
> organised on blob storage so I don't yet have a reproduce that I can share 
> that doesn't depend on private data. 
> I tested on a pre-release 4.0.0 and the problem was still present. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-49030) Self join of a CTE seems non-deterministic

2024-08-01 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870147#comment-17870147
 ] 

Kent Yao commented on SPARK-49030:
--

Thank you [~ulysses] for the verification

> Self join of a CTE seems non-deterministic
> --
>
> Key: SPARK-49030
> URL: https://issues.apache.org/jira/browse/SPARK-49030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview.
>Reporter: Jihoon Son
>Priority: Minor
>
> {code:java}
> WITH c AS (SELECT * FROM customer LIMIT 10)
> SELECT count(*)
> FROM c c1, c c2
> WHERE c1.c_customer_sk > c2.c_customer_sk{code}
> Suppose a self join query on a CTE such as the one above.
> Spark generates a physical plan like the one below for this query.
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L])
>    +- HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#233L])
>       +- Project
>          +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > 
> c_customer_sk#214)
>             :- Filter isnotnull(c_customer_sk#0)
>             :  +- GlobalLimit 10, 0
>             :     +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=256]
>             :        +- LocalLimit 10
>             :           +- FileScan parquet [c_customer_sk#0] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
>             +- BroadcastExchange IdentityBroadcastMode, [plan_id=263]
>                +- Filter isnotnull(c_customer_sk#214)
>                   +- GlobalLimit 10, 0
>                      +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=259]
>                         +- LocalLimit 10
>                            +- FileScan parquet [c_customer_sk#214] Batched: 
> true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct{code}
> Evaluating this plan produces non-deterministic result because the limit is 
> independently pushed into the two sides of the join. Each limit can produce 
> different data, and thus the join can produce results that vary across runs.
> I understand that the query in question is not deterministic (and thus not 
> very practical) as, due to the nature of the limit in distributed engines, it 
> is not expected to produce the same result anyway across repeated runs. 
> However, I would still expect that the query plan evaluation remains 
> deterministic.
> Per extended analysis as seen below, it seems that the query plan has changed 
> at some point during optimization.
> {code:java}
> == Analyzed Logical Plan ==
> count(1): bigint
> WithCTE
> :- CTERelationDef 2, false
> :  +- SubqueryAlias c
> :     +- GlobalLimit 10
> :        +- LocalLimit 10
> :           +- Project [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, 
> c_email_address#16, c_last_review_date_sk#17]
> :              +- SubqueryAlias customer
> :                 +- View (`customer`, [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, 
> c_email_address#16, c_last_review_date_sk#17])
> :                    +- Relation 
> [c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4,c_first_shipto_date_sk#5,c_first_sales_date_sk#6,c_salutation#7,c_first_name#8,c_last_name#9,c_preferred_cust_flag#10,c_birth_day#11L,c_birth_month#12L,c_birth_year#13L,c_birth_country#14,c_login#15,c_email_address#16,c_last_review_date_sk#17]
>  parquet
> +- Aggregate [count(1) AS count(1)#194L]
>    +- Filter (c_customer_sk#0 > c_customer_sk#176)
>       +- Join Inner
>          :- SubqueryAlias c1
>          :  +- SubqueryAlias c
>          :     +- CTERelationRef 2, true, [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13

[jira] [Commented] (SPARK-49000) Aggregation with DISTINCT gives wrong results when dealing with literals

2024-08-01 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870120#comment-17870120
 ] 

Kent Yao commented on SPARK-49000:
--

The commit is reverted to fix CI, please send a backport PR for 3.5

> Aggregation with DISTINCT gives wrong results when dealing with literals
> 
>
> Key: SPARK-49000
> URL: https://issues.apache.org/jira/browse/SPARK-49000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.3, 3.2.4, 3.5.1, 3.3.4, 3.4.3
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Critical
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0
>
>
> Aggregation with *DISTINCT* gives wrong results when dealing with literals. 
> It appears that this bug affects all (or most) released versions of Spark.
>  
> For example:
> {code:java}
> select count(distinct 1) from t{code}
> returns 1, while the correct result should be 0.
>  
> For reference:
> {code:java}
> select count(1) from t{code}
> returns 0, which is the correct and expected result.
>  
> In these examples, suppose that *t* is a table with any columns).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49000) Aggregation with DISTINCT gives wrong results when dealing with literals

2024-08-01 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-49000:
-
Fix Version/s: (was: 3.5.2)

> Aggregation with DISTINCT gives wrong results when dealing with literals
> 
>
> Key: SPARK-49000
> URL: https://issues.apache.org/jira/browse/SPARK-49000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.3, 3.2.4, 3.5.1, 3.3.4, 3.4.3
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Critical
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0
>
>
> Aggregation with *DISTINCT* gives wrong results when dealing with literals. 
> It appears that this bug affects all (or most) released versions of Spark.
>  
> For example:
> {code:java}
> select count(distinct 1) from t{code}
> returns 1, while the correct result should be 0.
>  
> For reference:
> {code:java}
> select count(1) from t{code}
> returns 0, which is the correct and expected result.
>  
> In these examples, suppose that *t* is a table with any columns).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49054) Column default value should support current_* functions

2024-07-31 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-49054:
-
Fix Version/s: 4.0.0

> Column default value should support current_* functions
> ---
>
> Key: SPARK-49054
> URL: https://issues.apache.org/jira/browse/SPARK-49054
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49067) Move utf-8 literal into the methods of UrlCodec class

2024-07-30 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-49067:


Assignee: Zhen Wang

> Move utf-8 literal into the methods of UrlCodec class
> -
>
> Key: SPARK-49067
> URL: https://issues.apache.org/jira/browse/SPARK-49067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Zhen Wang
>Assignee: Zhen Wang
>Priority: Major
>  Labels: pull-request-available
>
> Move utf-8 literals in url encode/decode functions to internal methods of 
> UrlCodec class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49067) Move utf-8 literal into the methods of UrlCodec class

2024-07-30 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-49067.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47544
[https://github.com/apache/spark/pull/47544]

> Move utf-8 literal into the methods of UrlCodec class
> -
>
> Key: SPARK-49067
> URL: https://issues.apache.org/jira/browse/SPARK-49067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Zhen Wang
>Assignee: Zhen Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Move utf-8 literals in url encode/decode functions to internal methods of 
> UrlCodec class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48608) Spark 3.5: fails to build with value defaultValueNotConstantError is not a member of object org.apache.spark.sql.errors.QueryCompilationErrors

2024-07-29 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48608.
--
Fix Version/s: 3.5.2
   Resolution: Fixed

Issue resolved by https://github.com/apache/spark/pull/46978

> Spark 3.5: fails to build with value defaultValueNotConstantError is not a 
> member of object org.apache.spark.sql.errors.QueryCompilationErrors 
> ---
>
> Key: SPARK-48608
> URL: https://issues.apache.org/jira/browse/SPARK-48608
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.2
>Reporter: Thomas Graves
>Priority: Blocker
> Fix For: 3.5.2
>
>
> PR [https://github.com/apache/spark/pull/46594] seems to have broken the 
> Spark 3.5 build.
> [ERROR] [Error] 
> ...sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala:299:
>  value defaultValueNotConstantError is not a member of object 
> org.apache.spark.sql.errors.QueryCompilationErrors
> I don't see that definition defined on the 3.5 branch - 
> [https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala]
> I see it defined on master by 
> https://issues.apache.org/jira/browse/SPARK-46905 which only went into 4.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy

2024-07-25 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-48308:
-
Fix Version/s: 3.5.2

> Unify getting data schema without partition columns in FileSourceStrategy
> -
>
> Key: SPARK-48308
> URL: https://issues.apache.org/jira/browse/SPARK-48308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Johan Lasperas
>Assignee: Johan Lasperas
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>
> In 
> [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
>  the schema of the data excluding partition columns is computed 2 times in a 
> slightly different way:
>  
> {code:java}
> val dataColumnsWithoutPartitionCols = 
> dataColumns.filterNot(partitionSet.contains) {code}
> vs 
> {code:java}
> val readDataColumns = dataColumns
>   .filterNot(partitionColumns.contains) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48991) FileStreamSink.hasMetadata handles invalid path

2024-07-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-48991:
-
Fix Version/s: 3.5.3
   (was: 3.5.2)

> FileStreamSink.hasMetadata handles invalid path
> ---
>
> Key: SPARK-48991
> URL: https://issues.apache.org/jira/browse/SPARK-48991
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.4.3
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.4.4, 3.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48991) FileStreamSink.hasMetadata handles invalid path

2024-07-24 Thread Kent Yao (Jira)
Kent Yao created SPARK-48991:


 Summary: FileStreamSink.hasMetadata handles invalid path
 Key: SPARK-48991
 URL: https://issues.apache.org/jira/browse/SPARK-48991
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.3, 3.5.1, 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >