[jira] [Created] (SPARK-33583) Query on large dataset with forEachPartitionAsync performance needs to improve
Miron created SPARK-33583: - Summary: Query on large dataset with forEachPartitionAsync performance needs to improve Key: SPARK-33583 URL: https://issues.apache.org/jira/browse/SPARK-33583 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Environment: Spark 2.4.4 Scala 2.11.10 Reporter: Miron Repro steps: Load 300GB of data from JSON file into a table. Note in this table field ID with reasonably well sized sets, identified by ID, some 50,000 rows a set. Issue query against this table returning DataFrame instance. Issue df.rdd.foreachPartitionAsync styled row harvesting. Place a logging line into first lambda expression, iterating over partitions as a first line. Let's say it will read "Line #1 ( some timestamp with milliseconds )" Place a logging line into nested lambda expression, reading rows, such, that it would run only when accessing first row. Let's say it will read "Line #2 ( some timestamp with milliseconds )" Once query completed take time difference in milliseconds between time noted in logging records from line #1 and line #2 above. It would be fairly reasonable to assume that the time difference should be as close to 0 as possible. In reality the difference is more then 1 second, usually more than 2. This really hurts query performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33583) Query on large dataset with foreachPartitionAsync performance needs to improve
[ https://issues.apache.org/jira/browse/SPARK-33583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miron updated SPARK-33583: -- Summary: Query on large dataset with foreachPartitionAsync performance needs to improve (was: Query on large dataset with forEachPartitionAsync performance needs to improve) > Query on large dataset with foreachPartitionAsync performance needs to improve > -- > > Key: SPARK-33583 > URL: https://issues.apache.org/jira/browse/SPARK-33583 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Spark 2.4.4 > Scala 2.11.10 >Reporter: Miron >Priority: Major > > Repro steps: > Load 300GB of data from JSON file into a table. > Note in this table field ID with reasonably well sized sets, identified by > ID, some 50,000 rows a set. > Issue query against this table returning DataFrame instance. > Issue df.rdd.foreachPartitionAsync styled row harvesting. > Place a logging line into first lambda expression, iterating over partitions > as a first line. > Let's say it will read "Line #1 ( some timestamp with milliseconds )" > Place a logging line into nested lambda expression, reading rows, such, that > it would run only when accessing first row. > Let's say it will read "Line #2 ( some timestamp with milliseconds )" > Once query completed take time difference in milliseconds between time noted > in logging records from line #1 and line #2 above. > It would be fairly reasonable to assume that the time difference should be as > close to 0 as possible. In reality the difference is more then 1 second, > usually more than 2. > This really hurts query performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org