Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-12-09 Thread Arwin Tio
Sent: September 7, 2019 9:22 AM To: Arwin Tio Cc: Sean Owen ; dev@spark.apache.org Subject: Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files On Fri, Sep 6, 2019 at 10:56 PM Arwin Tio mailto:arwin@hotmail.com>> wrote: I think the prob

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-23 Thread Arwin Tio
Cc: Sean Owen ; dev@spark.apache.org Subject: Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files On Fri, Sep 6, 2019 at 10:56 PM Arwin Tio mailto:arwin@hotmail.com>> wrote: I think the problem is calling globStatus to expand all 300K files.

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-07 Thread Steve Loughran
bs there as Hadoop JIRAs & PRs > > Thanks, > > Arwin > -- > *From:* Steve Loughran > *Sent:* September 6, 2019 4:15 PM > *To:* Sean Owen > *Cc:* Arwin Tio ; dev@spark.apache.org < > dev@spark.apache.org> > *Subject:* Re: DataFrameRead

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-06 Thread Arwin Tio
mber 6, 2019 4:15 PM To: Sean Owen Cc: Arwin Tio ; dev@spark.apache.org Subject: Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files On Fri, Sep 6, 2019 at 2:50 PM Sean Owen mailto:sro...@gmail.com>> wrote: I think the problem is calling glo

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-06 Thread Steve Loughran
On Fri, Sep 6, 2019 at 2:50 PM Sean Owen wrote: > I think the problem is calling globStatus to expand all 300K files. > This is a general problem for object stores and huge numbers of files. > Steve L. may have better thoughts on real solutions. But you might > consider, if possible, running a lo

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-06 Thread Sean Owen
I think the problem is calling globStatus to expand all 300K files. This is a general problem for object stores and huge numbers of files. Steve L. may have better thoughts on real solutions. But you might consider, if possible, running a lot of .csv jobs in parallel to query subsets of all the fil

DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-06 Thread Arwin Tio
Hello, On Spark 2.4.4, I am using DataFrameReader#csv to read about 30 files on S3, and I've noticed that it takes about an hour for it to load the data on the Driver. You can see the timestamp difference when the log from InMemoryFileIndex occurs from 7:45 to 8:54: 19/09/06 07:44:42 INFO S