[jira] [Comment Edited] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406306#comment-15406306 ] Sean Zhong edited comment on SPARK-16841 at 8/3/16 5:54 PM: This jira is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. was (Author: clockfly): This jira is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing performance code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406306#comment-15406306 ] Sean Zhong edited comment on SPARK-16841 at 8/3/16 5:54 PM: This jira is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing performance code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. was (Author: clockfly): This PR is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing performance code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org