[jira] [Created] (SPARK-33583) Query on large dataset with forEachPartitionAsync performance needs to improve

2020-11-28 Thread Miron (Jira)
Miron created SPARK-33583:
-

 Summary: Query on large dataset with forEachPartitionAsync 
performance needs to improve
 Key: SPARK-33583
 URL: https://issues.apache.org/jira/browse/SPARK-33583
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
 Environment: Spark 2.4.4

Scala 2.11.10
Reporter: Miron


Repro steps:

Load 300GB of data from JSON file into a table.

Note in this table field ID with reasonably well sized sets, identified by ID, 
some 50,000 rows a set.

Issue query against this table returning DataFrame instance.

Issue df.rdd.foreachPartitionAsync styled row harvesting.

Place a logging line into first lambda expression, iterating over partitions as 
a first line.

Let's say it will read "Line #1 ( some timestamp with milliseconds )"

Place a logging line into nested lambda expression, reading rows, such, that it 
would run only when accessing first row.

Let's say it will read "Line #2 ( some timestamp with milliseconds )"

Once query completed take time difference in milliseconds between time noted in 
logging records from line #1 and line #2 above.

It would be fairly reasonable to assume that the time difference should be as 
close to 0 as possible. In reality the difference is more then 1 second, 
usually more than 2.

This really hurts query performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33583) Query on large dataset with foreachPartitionAsync performance needs to improve

2020-11-28 Thread Miron (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miron updated SPARK-33583:
--
Summary: Query on large dataset with foreachPartitionAsync performance 
needs to improve  (was: Query on large dataset with forEachPartitionAsync 
performance needs to improve)

> Query on large dataset with foreachPartitionAsync performance needs to improve
> --
>
> Key: SPARK-33583
> URL: https://issues.apache.org/jira/browse/SPARK-33583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4
> Scala 2.11.10
>Reporter: Miron
>Priority: Major
>
> Repro steps:
> Load 300GB of data from JSON file into a table.
> Note in this table field ID with reasonably well sized sets, identified by 
> ID, some 50,000 rows a set.
> Issue query against this table returning DataFrame instance.
> Issue df.rdd.foreachPartitionAsync styled row harvesting.
> Place a logging line into first lambda expression, iterating over partitions 
> as a first line.
> Let's say it will read "Line #1 ( some timestamp with milliseconds )"
> Place a logging line into nested lambda expression, reading rows, such, that 
> it would run only when accessing first row.
> Let's say it will read "Line #2 ( some timestamp with milliseconds )"
> Once query completed take time difference in milliseconds between time noted 
> in logging records from line #1 and line #2 above.
> It would be fairly reasonable to assume that the time difference should be as 
> close to 0 as possible. In reality the difference is more then 1 second, 
> usually more than 2.
> This really hurts query performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org