Miron created SPARK-33583: ----------------------------- Summary: Query on large dataset with forEachPartitionAsync performance needs to improve Key: SPARK-33583 URL: https://issues.apache.org/jira/browse/SPARK-33583 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Environment: Spark 2.4.4
Scala 2.11.10 Reporter: Miron Repro steps: Load 300GB of data from JSON file into a table. Note in this table field ID with reasonably well sized sets, identified by ID, some 50,000 rows a set. Issue query against this table returning DataFrame instance. Issue df.rdd.foreachPartitionAsync styled row harvesting. Place a logging line into first lambda expression, iterating over partitions as a first line. Let's say it will read "Line #1 ( some timestamp with milliseconds )" Place a logging line into nested lambda expression, reading rows, such, that it would run only when accessing first row. Let's say it will read "Line #2 ( some timestamp with milliseconds )" Once query completed take time difference in milliseconds between time noted in logging records from line #1 and line #2 above. It would be fairly reasonable to assume that the time difference should be as close to 0 as possible. In reality the difference is more then 1 second, usually more than 2. This really hurts query performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org