subject:"DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files"

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-12-09 Thread Arwin Tio

Sent: September 7, 2019 9:22 AM To: Arwin Tio Cc: Sean Owen ; dev@spark.apache.org Subject: Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files On Fri, Sep 6, 2019 at 10:56 PM Arwin Tio mailto:arwin@hotmail.com>> wrote: I think the prob

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-23 Thread Arwin Tio

Cc: Sean Owen ; dev@spark.apache.org Subject: Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files On Fri, Sep 6, 2019 at 10:56 PM Arwin Tio mailto:arwin@hotmail.com>> wrote: I think the problem is calling globStatus to expand all 300K files.

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-07 Thread Steve Loughran

bs there as Hadoop JIRAs & PRs > > Thanks, > > Arwin > -- > *From:* Steve Loughran > *Sent:* September 6, 2019 4:15 PM > *To:* Sean Owen > *Cc:* Arwin Tio ; dev@spark.apache.org < > dev@spark.apache.org> > *Subject:* Re: DataFrameRead

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-06 Thread Arwin Tio

mber 6, 2019 4:15 PM To: Sean Owen Cc: Arwin Tio ; dev@spark.apache.org Subject: Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files On Fri, Sep 6, 2019 at 2:50 PM Sean Owen mailto:sro...@gmail.com>> wrote: I think the problem is calling glo

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-06 Thread Steve Loughran

On Fri, Sep 6, 2019 at 2:50 PM Sean Owen wrote: > I think the problem is calling globStatus to expand all 300K files. > This is a general problem for object stores and huge numbers of files. > Steve L. may have better thoughts on real solutions. But you might > consider, if possible, running a lo

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-06 Thread Sean Owen

I think the problem is calling globStatus to expand all 300K files. This is a general problem for object stores and huge numbers of files. Steve L. may have better thoughts on real solutions. But you might consider, if possible, running a lot of .csv jobs in parallel to query subsets of all the fil

DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-06 Thread Arwin Tio

Hello, On Spark 2.4.4, I am using DataFrameReader#csv to read about 30 files on S3, and I've noticed that it takes about an hour for it to load the data on the Driver. You can see the timestamp difference when the log from InMemoryFileIndex occurs from 7:45 to 8:54: 19/09/06 07:44:42 INFO S

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

7 matches

Site Navigation

Mail list logo

Footer information