[jira] [Created] (SPARK-32732) Convert schema only once in OrcSerializer
Muhammad Samir Khan created SPARK-32732: --- Summary: Convert schema only once in OrcSerializer Key: SPARK-32732 URL: https://issues.apache.org/jira/browse/SPARK-32732 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Muhammad Samir Khan This is to track a TODO item in [Pull Request|[https://github.com/apache/spark/pull/29352]] for SPARK-32532 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32731) Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse
[ https://issues.apache.org/jira/browse/SPARK-32731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated SPARK-32731: Summary: Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse (was: Added tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse) > Add tests for arrays/maps of nested structs to ReadSchemaSuite to test > structs reuse > > > Key: SPARK-32731 > URL: https://issues.apache.org/jira/browse/SPARK-32731 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Splitting tests originally posted in > [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added > tests cover cases for maps and arrays of nested structs for different file > formats. Eg, [https://github.com/apache/spark/pull/29353] and > [https://github.com/apache/spark/pull/29354] add object reuse when reading > ORC and Avro files. However, for dynamic data structures like arrays and > maps, we do not know just by looking at the schema what the size of the data > structure will be so it has to be allocated when reading the data points. The > added tests provide coverage so that objects are not accidentally reused when > encountering maps and arrays. > AFAIK this is not covered by existing tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32731) Added tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse
Muhammad Samir Khan created SPARK-32731: --- Summary: Added tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse Key: SPARK-32731 URL: https://issues.apache.org/jira/browse/SPARK-32731 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Muhammad Samir Khan Splitting tests originally posted in [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added tests cover cases for maps and arrays of nested structs for different file formats. Eg, [https://github.com/apache/spark/pull/29353] and [https://github.com/apache/spark/pull/29354] add object reuse when reading ORC and Avro files. However, for dynamic data structures like arrays and maps, we do not know just by looking at the schema what the size of the data structure will be so it has to be allocated when reading the data points. The added tests provide coverage so that objects are not accidentally reused when encountering maps and arrays. AFAIK this is not covered by existing tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32561) Allow DataSourceReadBenchmark to run for select formats
Muhammad Samir Khan created SPARK-32561: --- Summary: Allow DataSourceReadBenchmark to run for select formats Key: SPARK-32561 URL: https://issues.apache.org/jira/browse/SPARK-32561 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Muhammad Samir Khan Currently DataSourceReadBenchmark runs benchmarks for Parquet, ORC, CSV, and Json file formats and there is no way to specify at runtime a single format or a subset of formats, like there is for BuiltInDataSourceWriteBenchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats
[ https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated SPARK-32531: Component/s: Tests > Add benchmarks for nested structs and arrays for different file formats > --- > > Key: SPARK-32531 > URL: https://issues.apache.org/jira/browse/SPARK-32531 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > We had found that Spark performance was slow as compared to PIG on some > schemas in our pipelines. On investigation, it was found that Spark > performance was slow for nested structs and array'd structs and these cases > were not being profiled by the current benchmarks. I have some improvements > for ORC (SPARK-32532) and Avro (SPARK-32533) file formats which improve the > performance in these cases and will be putting up the PRs soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32550) Make SpecificInternalRow constructors faster by using while loops instead of maps
[ https://issues.apache.org/jira/browse/SPARK-32550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171868#comment-17171868 ] Muhammad Samir Khan commented on SPARK-32550: - [~maropu] added some benchmark results in [https://github.com/apache/spark/pull/29366]. For Avro, with a couple of benchmarks I am proposing to add in SPARK-32531 (PR: [https://github.com/apache/spark/pull/29352]) I get the following on master branch: *Before:* Nested Struct 75s average Array of Structs 34s average *After:* Nested Struct 49s average Array of Structs 19s average > Make SpecificInternalRow constructors faster by using while loops instead of > maps > - > > Key: SPARK-32550 > URL: https://issues.apache.org/jira/browse/SPARK-32550 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Two constructors in SpecificInternalRow can be made faster by using while > loops instead of maps. This was originally noticed while working on > SPARK-32532 and SPARK-32533 and will have impacts on performance of reading > ORC and Avro files. Profiled the change using added benchmarks in SPARK-32531 > for nested and array'd structs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32550) Make SpecificInternalRow constructors faster by using while loops instead of maps
Muhammad Samir Khan created SPARK-32550: --- Summary: Make SpecificInternalRow constructors faster by using while loops instead of maps Key: SPARK-32550 URL: https://issues.apache.org/jira/browse/SPARK-32550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Muhammad Samir Khan Two constructors in SpecificInternalRow can be made faster by using while loops instead of maps. This was originally noticed while working on SPARK-32532 and SPARK-32533 and will have impacts on performance of reading ORC and Avro files. Profiled the change using added benchmarks in SPARK-32531 for nested and array'd structs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats
[ https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated SPARK-32531: Description: We had found that Spark performance was slow as compared to PIG on some schemas in our pipelines. On investigation, it was found that Spark performance was slow for nested structs and array'd structs and these cases were not being profiled by the current benchmarks. I have some improvements for ORC (SPARK-32532) and Avro (SPARK-32533) file formats which improve the performance in these cases and will be putting up the PRs soon. (was: Additions to benchmarks for different file formats for nested structs and arrays which are not being currently benchmarked. I have some improvements for ORC and Avro file formats which improve the performance in these cases. I will be putting up the PRs soon.) > Add benchmarks for nested structs and arrays for different file formats > --- > > Key: SPARK-32531 > URL: https://issues.apache.org/jira/browse/SPARK-32531 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > We had found that Spark performance was slow as compared to PIG on some > schemas in our pipelines. On investigation, it was found that Spark > performance was slow for nested structs and array'd structs and these cases > were not being profiled by the current benchmarks. I have some improvements > for ORC (SPARK-32532) and Avro (SPARK-32533) file formats which improve the > performance in these cases and will be putting up the PRs soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats
[ https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated SPARK-32531: Summary: Add benchmarks for nested structs and arrays for different file formats (was: Add benchmarks for nested structs and arrays for different data types) > Add benchmarks for nested structs and arrays for different file formats > --- > > Key: SPARK-32531 > URL: https://issues.apache.org/jira/browse/SPARK-32531 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Additions to benchmarks for different file formats for nested structs and > arrays which are not being currently benchmarked. I have some improvements > for ORC and Avro file formats which improve the performance in these cases. > I will be putting up the PRs soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs
[ https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated SPARK-32532: Description: Have some improvements for ORC file format to reduce time taken when reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was able to improve performance on branch-3.0 as follows (measurements in seconds): Read: Nested Structs: 184 -> 44 Array of Struct: 66 -> 15 Write Nested Structs: 543 -> 39 Array of Struct: 330 -> 37 Will be putting up the PR soon with the changes. was: Have some improvements for ORC file format to reduce time taken when reading/writing nested/array'd structs. Using benchmarks in [SPARK-32531] was able to improve performance as follows (measurements in seconds): Read: Nested Structs: 184 -> 44 Array of Struct: 66 -> 15 Write Nested Structs: 543 -> 39 Array of Struct: 330 -> 37 Will be putting up the PR soon with the changes. > Improve ORC read/write performance on nested structs and array of structs > - > > Key: SPARK-32532 > URL: https://issues.apache.org/jira/browse/SPARK-32532 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Have some improvements for ORC file format to reduce time taken when > reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was > able to improve performance on branch-3.0 as follows (measurements in > seconds): > Read: > Nested Structs: 184 -> 44 > Array of Struct: 66 -> 15 > Write > Nested Structs: 543 -> 39 > Array of Struct: 330 -> 37 > Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32533) Improve Avro read/write performance on nested structs and array of structs
Muhammad Samir Khan created SPARK-32533: --- Summary: Improve Avro read/write performance on nested structs and array of structs Key: SPARK-32533 URL: https://issues.apache.org/jira/browse/SPARK-32533 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Muhammad Samir Khan Have some improvements for Avro file format to reduce time taken when reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was able to improve performance on branch-3.0 as follows (measurements in seconds): Read: Nested Structs: 75 -> 46 Array of Struct: 47 -> 17 Write Nested Structs: 147 -> 36 Array of Struct: 139 -> 34 Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs
[ https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated SPARK-32532: Description: Have some improvements for ORC file format to reduce time taken when reading/writing nested/array'd structs. Using benchmarks in [SPARK-32531] was able to improve performance as follows (measurements in seconds): Read: Nested Structs: 184 -> 44 Array of Struct: 66 -> 15 Write Nested Structs: 543 -> 39 Array of Struct: 330 -> 37 Will be putting up the PR soon with the changes. was: Have some improvements for ORC file format to reduce time taken when reading/writing nested/array'd structs. Using benchmarks in [SPARK-32071] was able to improve performance as follows (measurements in seconds): Read: Nested Structs: 184 -> 44 Array of Struct: 66 -> 15 Write Nested Structs: 543 -> 39 Array of Struct: 330 -> 37 Will be putting up the PR soon with the changes. > Improve ORC read/write performance on nested structs and array of structs > - > > Key: SPARK-32532 > URL: https://issues.apache.org/jira/browse/SPARK-32532 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Have some improvements for ORC file format to reduce time taken when > reading/writing nested/array'd structs. Using benchmarks in [SPARK-32531] was > able to improve performance as follows (measurements in seconds): > Read: > Nested Structs: 184 -> 44 > Array of Struct: 66 -> 15 > Write > Nested Structs: 543 -> 39 > Array of Struct: 330 -> 37 > Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs
Muhammad Samir Khan created SPARK-32532: --- Summary: Improve ORC read/write performance on nested structs and array of structs Key: SPARK-32532 URL: https://issues.apache.org/jira/browse/SPARK-32532 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Muhammad Samir Khan Have some improvements for ORC file format to reduce time taken when reading/writing nested/array'd structs. Using benchmarks in [SPARK-32071] was able to improve performance as follows (measurements in seconds): Read: Nested Structs: 184 -> 44 Array of Struct: 66 -> 15 Write Nested Structs: 543 -> 39 Array of Struct: 330 -> 37 Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32531) Add benchmarks for nested structs and arrays for different data types
Muhammad Samir Khan created SPARK-32531: --- Summary: Add benchmarks for nested structs and arrays for different data types Key: SPARK-32531 URL: https://issues.apache.org/jira/browse/SPARK-32531 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Muhammad Samir Khan Additions to benchmarks for different file formats for nested structs and arrays which are not being currently benchmarked. I have some improvements for ORC and Avro file formats which improve the performance in these cases. I will be putting up the PRs soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org