[jira] [Created] (SPARK-32732) Convert schema only once in OrcSerializer

2020-08-28 Thread Muhammad Samir Khan (Jira)
Muhammad Samir Khan created SPARK-32732:
---

 Summary: Convert schema only once in OrcSerializer
 Key: SPARK-32732
 URL: https://issues.apache.org/jira/browse/SPARK-32732
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Muhammad Samir Khan


This is to track a TODO item in [Pull 
Request|[https://github.com/apache/spark/pull/29352]] for SPARK-32532



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32731) Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse

2020-08-28 Thread Muhammad Samir Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated SPARK-32731:

Summary: Add tests for arrays/maps of nested structs to ReadSchemaSuite to 
test structs reuse  (was: Added tests for arrays/maps of nested structs to 
ReadSchemaSuite to test structs reuse)

> Add tests for arrays/maps of nested structs to ReadSchemaSuite to test 
> structs reuse
> 
>
> Key: SPARK-32731
> URL: https://issues.apache.org/jira/browse/SPARK-32731
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Splitting tests originally posted in 
> [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added 
> tests cover cases for maps and arrays of nested structs for different file 
> formats. Eg, [https://github.com/apache/spark/pull/29353] and 
> [https://github.com/apache/spark/pull/29354] add object reuse when reading 
> ORC and Avro files. However, for dynamic data structures like arrays and 
> maps, we do not know just by looking at the schema what the size of the data 
> structure will be so it has to be allocated when reading the data points. The 
> added tests provide coverage so that objects are not accidentally reused when 
> encountering maps and arrays.
> AFAIK this is not covered by existing tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32731) Added tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse

2020-08-28 Thread Muhammad Samir Khan (Jira)
Muhammad Samir Khan created SPARK-32731:
---

 Summary: Added tests for arrays/maps of nested structs to 
ReadSchemaSuite to test structs reuse
 Key: SPARK-32731
 URL: https://issues.apache.org/jira/browse/SPARK-32731
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Muhammad Samir Khan


Splitting tests originally posted in 
[PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added 
tests cover cases for maps and arrays of nested structs for different file 
formats. Eg, [https://github.com/apache/spark/pull/29353] and 
[https://github.com/apache/spark/pull/29354] add object reuse when reading ORC 
and Avro files. However, for dynamic data structures like arrays and maps, we 
do not know just by looking at the schema what the size of the data structure 
will be so it has to be allocated when reading the data points. The added tests 
provide coverage so that objects are not accidentally reused when encountering 
maps and arrays.

AFAIK this is not covered by existing tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32561) Allow DataSourceReadBenchmark to run for select formats

2020-08-06 Thread Muhammad Samir Khan (Jira)
Muhammad Samir Khan created SPARK-32561:
---

 Summary: Allow DataSourceReadBenchmark to run for select formats
 Key: SPARK-32561
 URL: https://issues.apache.org/jira/browse/SPARK-32561
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Muhammad Samir Khan


Currently DataSourceReadBenchmark runs benchmarks for Parquet, ORC, CSV, and 
Json file formats and there is no way to specify at runtime a single format or 
a subset of formats, like there is for BuiltInDataSourceWriteBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats

2020-08-06 Thread Muhammad Samir Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated SPARK-32531:

Component/s: Tests

> Add benchmarks for nested structs and arrays for different file formats
> ---
>
> Key: SPARK-32531
> URL: https://issues.apache.org/jira/browse/SPARK-32531
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> We had found that Spark performance was slow as compared to PIG on some 
> schemas in our pipelines. On investigation, it was found that Spark 
> performance was slow for nested structs and array'd structs and these cases 
> were not being profiled by the current benchmarks. I have some improvements 
> for ORC (SPARK-32532) and Avro (SPARK-32533) file formats which improve the 
> performance in these cases and will be putting up the PRs soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32550) Make SpecificInternalRow constructors faster by using while loops instead of maps

2020-08-05 Thread Muhammad Samir Khan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171868#comment-17171868
 ] 

Muhammad Samir Khan commented on SPARK-32550:
-

[~maropu] added some benchmark results in 
[https://github.com/apache/spark/pull/29366]. For Avro, with a couple of 
benchmarks I am proposing to add in SPARK-32531 (PR: 
[https://github.com/apache/spark/pull/29352]) I get the following on master 
branch:



*Before:*
Nested Struct 75s average
Array of Structs 34s average

*After:*
Nested Struct 49s average
Array of Structs 19s average

> Make SpecificInternalRow constructors faster by using while loops instead of 
> maps
> -
>
> Key: SPARK-32550
> URL: https://issues.apache.org/jira/browse/SPARK-32550
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Two constructors in SpecificInternalRow can be made faster by using while 
> loops instead of maps. This was originally noticed while working on 
> SPARK-32532 and SPARK-32533 and will have impacts on performance of reading 
> ORC and Avro files. Profiled the change using added benchmarks in SPARK-32531 
> for nested and array'd structs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32550) Make SpecificInternalRow constructors faster by using while loops instead of maps

2020-08-05 Thread Muhammad Samir Khan (Jira)
Muhammad Samir Khan created SPARK-32550:
---

 Summary: Make SpecificInternalRow constructors faster by using 
while loops instead of maps
 Key: SPARK-32550
 URL: https://issues.apache.org/jira/browse/SPARK-32550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Muhammad Samir Khan


Two constructors in SpecificInternalRow can be made faster by using while loops 
instead of maps. This was originally noticed while working on SPARK-32532 and 
SPARK-32533 and will have impacts on performance of reading ORC and Avro files. 
Profiled the change using added benchmarks in SPARK-32531 for nested and 
array'd structs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats

2020-08-04 Thread Muhammad Samir Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated SPARK-32531:

Description: We had found that Spark performance was slow as compared to 
PIG on some schemas in our pipelines. On investigation, it was found that Spark 
performance was slow for nested structs and array'd structs and these cases 
were not being profiled by the current benchmarks. I have some improvements for 
ORC (SPARK-32532) and Avro (SPARK-32533) file formats which improve the 
performance in these cases and will be putting up the PRs soon.  (was: 
Additions to benchmarks for different file formats for nested structs and 
arrays which are not being currently benchmarked. I have some improvements for 
ORC and Avro file formats which improve the performance in these cases.

I will be putting up the PRs soon.)

> Add benchmarks for nested structs and arrays for different file formats
> ---
>
> Key: SPARK-32531
> URL: https://issues.apache.org/jira/browse/SPARK-32531
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> We had found that Spark performance was slow as compared to PIG on some 
> schemas in our pipelines. On investigation, it was found that Spark 
> performance was slow for nested structs and array'd structs and these cases 
> were not being profiled by the current benchmarks. I have some improvements 
> for ORC (SPARK-32532) and Avro (SPARK-32533) file formats which improve the 
> performance in these cases and will be putting up the PRs soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats

2020-08-04 Thread Muhammad Samir Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated SPARK-32531:

Summary: Add benchmarks for nested structs and arrays for different file 
formats  (was: Add benchmarks for nested structs and arrays for different data 
types)

> Add benchmarks for nested structs and arrays for different file formats
> ---
>
> Key: SPARK-32531
> URL: https://issues.apache.org/jira/browse/SPARK-32531
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Additions to benchmarks for different file formats for nested structs and 
> arrays which are not being currently benchmarked. I have some improvements 
> for ORC and Avro file formats which improve the performance in these cases.
> I will be putting up the PRs soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs

2020-08-04 Thread Muhammad Samir Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated SPARK-32532:

Description: 
Have some improvements for ORC file format to reduce time taken when 
reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
able to improve performance on branch-3.0 as follows (measurements in seconds):

Read:
 Nested Structs: 184 -> 44
 Array of Struct: 66 -> 15

Write
 Nested Structs: 543 -> 39
 Array of Struct: 330 -> 37

Will be putting up the PR soon with the changes.

  was:
Have some improvements for ORC file format to reduce time taken when 
reading/writing nested/array'd structs. Using benchmarks in [SPARK-32531] was 
able to improve performance as follows (measurements in seconds):

Read:
 Nested Structs: 184 -> 44
 Array of Struct: 66 -> 15

Write
 Nested Structs: 543 -> 39
 Array of Struct: 330 -> 37

Will be putting up the PR soon with the changes.


> Improve ORC read/write performance on nested structs and array of structs
> -
>
> Key: SPARK-32532
> URL: https://issues.apache.org/jira/browse/SPARK-32532
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Have some improvements for ORC file format to reduce time taken when 
> reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
> able to improve performance on branch-3.0 as follows (measurements in 
> seconds):
> Read:
>  Nested Structs: 184 -> 44
>  Array of Struct: 66 -> 15
> Write
>  Nested Structs: 543 -> 39
>  Array of Struct: 330 -> 37
> Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32533) Improve Avro read/write performance on nested structs and array of structs

2020-08-04 Thread Muhammad Samir Khan (Jira)
Muhammad Samir Khan created SPARK-32533:
---

 Summary: Improve Avro read/write performance on nested structs and 
array of structs
 Key: SPARK-32533
 URL: https://issues.apache.org/jira/browse/SPARK-32533
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Muhammad Samir Khan


Have some improvements for Avro file format to reduce time taken when 
reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
able to improve performance on branch-3.0 as follows (measurements in seconds):

Read:
Nested Structs: 75 -> 46
Array of Struct: 47 -> 17

Write
Nested Structs: 147 -> 36
Array of Struct: 139 -> 34

Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs

2020-08-04 Thread Muhammad Samir Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated SPARK-32532:

Description: 
Have some improvements for ORC file format to reduce time taken when 
reading/writing nested/array'd structs. Using benchmarks in [SPARK-32531] was 
able to improve performance as follows (measurements in seconds):

Read:
 Nested Structs: 184 -> 44
 Array of Struct: 66 -> 15

Write
 Nested Structs: 543 -> 39
 Array of Struct: 330 -> 37

Will be putting up the PR soon with the changes.

  was:
Have some improvements for ORC file format to reduce time taken when 
reading/writing nested/array'd structs. Using benchmarks in [SPARK-32071] was 
able to improve performance as follows (measurements in seconds):

Read:
Nested Structs: 184 -> 44
Array of Struct: 66 -> 15

Write
Nested Structs: 543 -> 39
Array of Struct: 330 -> 37

Will be putting up the PR soon with the changes.


> Improve ORC read/write performance on nested structs and array of structs
> -
>
> Key: SPARK-32532
> URL: https://issues.apache.org/jira/browse/SPARK-32532
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Have some improvements for ORC file format to reduce time taken when 
> reading/writing nested/array'd structs. Using benchmarks in [SPARK-32531] was 
> able to improve performance as follows (measurements in seconds):
> Read:
>  Nested Structs: 184 -> 44
>  Array of Struct: 66 -> 15
> Write
>  Nested Structs: 543 -> 39
>  Array of Struct: 330 -> 37
> Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs

2020-08-04 Thread Muhammad Samir Khan (Jira)
Muhammad Samir Khan created SPARK-32532:
---

 Summary: Improve ORC read/write performance on nested structs and 
array of structs
 Key: SPARK-32532
 URL: https://issues.apache.org/jira/browse/SPARK-32532
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Muhammad Samir Khan


Have some improvements for ORC file format to reduce time taken when 
reading/writing nested/array'd structs. Using benchmarks in [SPARK-32071] was 
able to improve performance as follows (measurements in seconds):

Read:
Nested Structs: 184 -> 44
Array of Struct: 66 -> 15

Write
Nested Structs: 543 -> 39
Array of Struct: 330 -> 37

Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32531) Add benchmarks for nested structs and arrays for different data types

2020-08-04 Thread Muhammad Samir Khan (Jira)
Muhammad Samir Khan created SPARK-32531:
---

 Summary: Add benchmarks for nested structs and arrays for 
different data types
 Key: SPARK-32531
 URL: https://issues.apache.org/jira/browse/SPARK-32531
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Muhammad Samir Khan


Additions to benchmarks for different file formats for nested structs and 
arrays which are not being currently benchmarked. I have some improvements for 
ORC and Avro file formats which improve the performance in these cases.

I will be putting up the PRs soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org