Miron created SPARK-33583:
-----------------------------

             Summary: Query on large dataset with forEachPartitionAsync 
performance needs to improve
                 Key: SPARK-33583
                 URL: https://issues.apache.org/jira/browse/SPARK-33583
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.4
         Environment: Spark 2.4.4

Scala 2.11.10
            Reporter: Miron


Repro steps:

Load 300GB of data from JSON file into a table.

Note in this table field ID with reasonably well sized sets, identified by ID, 
some 50,000 rows a set.

Issue query against this table returning DataFrame instance.

Issue df.rdd.foreachPartitionAsync styled row harvesting.

Place a logging line into first lambda expression, iterating over partitions as 
a first line.

Let's say it will read "Line #1 ( some timestamp with milliseconds )"

Place a logging line into nested lambda expression, reading rows, such, that it 
would run only when accessing first row.

Let's say it will read "Line #2 ( some timestamp with milliseconds )"

Once query completed take time difference in milliseconds between time noted in 
logging records from line #1 and line #2 above.

It would be fairly reasonable to assume that the time difference should be as 
close to 0 as possible. In reality the difference is more then 1 second, 
usually more than 2.

This really hurts query performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to