[ https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arwin S Tio updated SPARK-29089: -------------------------------- Target Version/s: (was: 2.4.5, 3.0.0) > DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when > reading large amount of S3 files > ---------------------------------------------------------------------------------------------------------- > > Key: SPARK-29089 > URL: https://issues.apache.org/jira/browse/SPARK-29089 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.4.4 > Reporter: Arwin S Tio > Priority: Minor > > When using DataFrameReader#csv to read many S3 files (in my case 300k), I've > noticed that it took about an hour for the files to be loaded on the driver. > > You can see the timestamp difference when the log from InMemoryFileIndex > occurs from 7:45 to 8:54: > {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4 > 19/09/06 07:44:42 INFO SparkContext: Submitted application: > LoglineParquetGenerator > ... > 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered > StateStoreCoordinator endpoint > 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories > in parallel under: [300K files...] > {quote} > > A major source of the bottleneck comes from > DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] > and do a [FileSystem#exists|#L557] on all the paths in a single thread. On > S3, these are slow network calls. > After a discussion on the mailing list [0], it was suggested that an > improvement could be to: > > * have SparkHadoopUtils differentiate between files returned by > globStatus(), and which therefore exist, and those which it didn't glob for > -it will only need to check those. > * add parallel execution to the glob and existence checks > > I am currently working on a patch that implements this improvement > [0] > [http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html] -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org