[jira] [Comment Edited] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-03 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406306#comment-15406306
 ] 

Sean Zhong edited comment on SPARK-16841 at 8/3/16 5:54 PM:


This jira is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.


was (Author: clockfly):
This jira is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing performance code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-03 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406306#comment-15406306
 ] 

Sean Zhong edited comment on SPARK-16841 at 8/3/16 5:54 PM:


This jira is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing performance code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.


was (Author: clockfly):
This PR is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing performance code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org