[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-29089:
---
Labels: pull-request-available  (was: )

> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Arwin S Tio
>Assignee: Arwin S Tio
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.1.0
>
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the mailing list [0], it was suggested that an 
> improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   
> I am currently working on a patch that implements this improvement
>  [0] 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-16 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Component/s: (was: Spark Core)
 SQL

> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the mailing list [0], it was suggested that an 
> improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   
> I am currently working on a patch that implements this improvement
>  [0] 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Target Version/s:   (was: 2.4.5, 3.0.0)

> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the mailing list [0], it was suggested that an 
> improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   
> I am currently working on a patch that implements this improvement
>  [0] 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the mailing list [0], it was suggested that an 
improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

I am currently working on a patch that implements this improvement

 [0] 
[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the mailing list [0], it was suggested that an 
improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

I am currently working on a patch 

 [0] 
[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the mailing list [0], it was suggested that an 
> improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   
> I am currently working on a patch that implements this improvement
>  [0] 
> 

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the mailing list [0], it was suggested that an 
improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

 [0] 
[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the mailing list, it was suggested that an improvement 
could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the mailing list [0], it was suggested that an 
> improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   
>  [0] 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the mailing list [0], it was suggested that an 
improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

I am currently working on a patch 

 [0] 
[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the mailing list [0], it was suggested that an 
improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

 [0] 
[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the mailing list [0], it was suggested that an 
> improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   
> I am currently working on a patch 
>  [0] 
> 

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the mailing list, it was suggested that an improvement 
could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the 
[mailing|[https://apache.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the mailing list, it was suggested that an improvement 
> could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the 
[mailing|[https://apache.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[https://apache.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557]] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the 
> [mailing|[https://apache.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
>  it was suggested that an improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   



--
This message was 

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[https://apache.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557]] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the [mailing 
> list|[https://apache.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
>  it was suggested that an improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   



--
This 

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557]] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the [mailing 
> list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
>  it was suggested that an improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557]] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the [mailing 
> list|[http://apache-spark-developers-list.1001551.n3.nabble.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
>  it was suggested that an improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel 

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557]] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the [mailing 
> list|[http://apache-spark-developers-list.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
>  it was suggested that an improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557]] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the [mailing 
> list|[http://apache.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
>  it was suggested that an improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
> 

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing list|http://test.com], it was suggested that 
an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] 
> and do a [FileSystem#exists|#L557]] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the [mailing list|http://test.com], it was suggested 
> that an improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing list|http://test.com], it was suggested that 
an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] 
> and do a [FileSystem#exists|#L557]] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the [mailing 
> list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
>  it was suggested that an improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] 
> and do a [FileSystem#exists|#L557]] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the [mailing 
> list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
>  it was suggested that an improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   



--
This 

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] 
> and do a [FileSystem#exists|#L557]] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the [mailing 
> list|[http://apache-spark-developers-list.1001551.n3.nabble.com|http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
>  it was suggested that an improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel 

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] and 
do a [FileSystem#exists|#L557]] on all the paths in a single thread. On S3, 
these are slow network calls.

After a discussion on the [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] and 
do a 
[FileSystem#exists|[https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L557]]
 on all the paths in a single thread. On S3, these are slow network calls.

After a discussion on the mailing list, [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] 
> and do a [FileSystem#exists|#L557]] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the [mailing 
> list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
>  it was suggested that an improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to 

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2019-09-15 Thread Arwin S Tio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arwin S Tio updated SPARK-29089:

Description: 
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
 You can see the timestamp difference when the log from InMemoryFileIndex 
occurs from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
 19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
 ...
 19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
 19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
in parallel under: [300K files...]
{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] and 
do a 
[FileSystem#exists|[https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L557]]
 on all the paths in a single thread. On S3, these are slow network calls.

After a discussion on the mailing list, [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
  
 * have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
 * add parallel execution to the glob and existence checks
  

  was:
When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
noticed that it took about an hour for the files to be loaded on the driver.

 
You can see the timestamp difference when the log from InMemoryFileIndex occurs 
from 7:45 to 8:54:
{quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
19/09/06 07:44:42 INFO SparkContext: Submitted application: 
LoglineParquetGenerator
...
19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories in 
parallel under: [300K files...]{quote}
 

A major source of the bottleneck comes from 
DataSource#checkAndGlobPathIfNecessary, which will [(possibly) 
glob|[https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L549]]
 and do a [[FileSystem#exists||#exists] 
[https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L557]
 []|#exists] on all the paths in a single thread. On S3, these are slow network 
calls.

After a discussion on the mailing list, [mailing 
list|[http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]],
 it was suggested that an improvement could be to:
 
* have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those. 
* add parallel execution to the glob and existence checks
 


> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Arwin S Tio
>Priority: Minor
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549]] 
> and do a 
> [FileSystem#exists|[https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L557]]
>  on all the paths in a single thread. On S3, these are slow network calls.
> After a