Sent: September 7, 2019 9:22 AM
To: Arwin Tio
Cc: Sean Owen ; dev@spark.apache.org
Subject: Re: DataFrameReader bottleneck in
DataSource#checkAndGlobPathIfNecessary when reading S3 files
On Fri, Sep 6, 2019 at 10:56 PM Arwin Tio
mailto:arwin@hotmail.com>> wrote:
I think the prob
Cc: Sean Owen ; dev@spark.apache.org
Subject: Re: DataFrameReader bottleneck in
DataSource#checkAndGlobPathIfNecessary when reading S3 files
On Fri, Sep 6, 2019 at 10:56 PM Arwin Tio
mailto:arwin@hotmail.com>> wrote:
I think the problem is calling globStatus to expand all 300K files.
bs there as Hadoop JIRAs & PRs
>
> Thanks,
>
> Arwin
> --
> *From:* Steve Loughran
> *Sent:* September 6, 2019 4:15 PM
> *To:* Sean Owen
> *Cc:* Arwin Tio ; dev@spark.apache.org <
> dev@spark.apache.org>
> *Subject:* Re: DataFrameRead
mber 6, 2019 4:15 PM
To: Sean Owen
Cc: Arwin Tio ; dev@spark.apache.org
Subject: Re: DataFrameReader bottleneck in
DataSource#checkAndGlobPathIfNecessary when reading S3 files
On Fri, Sep 6, 2019 at 2:50 PM Sean Owen
mailto:sro...@gmail.com>> wrote:
I think the problem is calling glo
On Fri, Sep 6, 2019 at 2:50 PM Sean Owen wrote:
> I think the problem is calling globStatus to expand all 300K files.
> This is a general problem for object stores and huge numbers of files.
> Steve L. may have better thoughts on real solutions. But you might
> consider, if possible, running a lo
I think the problem is calling globStatus to expand all 300K files.
This is a general problem for object stores and huge numbers of files.
Steve L. may have better thoughts on real solutions. But you might
consider, if possible, running a lot of .csv jobs in parallel to query
subsets of all the fil
Hello,
On Spark 2.4.4, I am using DataFrameReader#csv to read about 30 files on
S3, and I've noticed that it takes about an hour for it to load the data on the
Driver. You can see the timestamp difference when the log from
InMemoryFileIndex occurs from 7:45 to 8:54:
19/09/06 07:44:42 INFO S