[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

Arwin S Tio (Jira) Sun, 15 Sep 2019 03:59:27 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Arwin S Tio updated SPARK-29089:
--------------------------------
    Target Version/s:   (was: 2.4.5, 3.0.0)

> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29089
>                 URL: https://issues.apache.org/jira/browse/SPARK-29089
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.4
>            Reporter: Arwin S Tio
>            Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the mailing list [0], it was suggested that an 
> improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   
> I am currently working on a patch that implements this improvement
>  [0] 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

Reply via email to