Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-23 Thread Holden Karau
Awesome that sounds great :)

On Thu, Jul 23, 2020 at 3:43 AM Steve Loughran  wrote:

>
>
> On Wed, 22 Jul 2020 at 18:50, Holden Karau  wrote:
>
>> Wonderful. To be clear the patch is more to start the discussion about
>> how we want to do it and less what I think is the right way.
>>
>>
> be happy to give a quick online tour of ongoing work on S3A enhancements
> some time next week, get feedback
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-23 Thread Steve Loughran
On Wed, 22 Jul 2020 at 18:50, Holden Karau  wrote:

> Wonderful. To be clear the patch is more to start the discussion about how
> we want to do it and less what I think is the right way.
>
>
be happy to give a quick online tour of ongoing work on S3A enhancements
some time next week, get feedback


Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Felix Cheung
+1


From: Holden Karau 
Sent: Wednesday, July 22, 2020 10:49:49 AM
To: Steve Loughran 
Cc: dev 
Subject: Re: Exposing Spark parallelized directory listing & non-locality 
listing in core

Wonderful. To be clear the patch is more to start the discussion about how we 
want to do it and less what I think is the right way.

On Wed, Jul 22, 2020 at 10:47 AM Steve Loughran 
mailto:ste...@cloudera.com>> wrote:


On Wed, 22 Jul 2020 at 00:51, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Hi Folks,

In Spark SQL there is the ability to have Spark do it's partition 
discovery/file listing in parallel on the worker nodes and also avoid locality 
lookups. I'd like to expose this in core, but given the Hadoop APIs it's a bit 
more complicated to do right. I

That's ultimately fixable, if we can sort out what's good from the app side and 
reconcile that with 'what is not pathologically bad across both HDFS and object 
stores".

Bad: globStatus, anything which returns an array rather than a remote iterator, 
encourages treewalk
Good: deep recursive listings, remote iterator results for: incremental/async 
fetch of next page of listing, soon: option for iterator, if cast to 
IOStatisticsSource, actually serve up stats on IO performance during the 
listing. (e.g. #of list calls, mean time to get a list response back., store 
throttle events)

Also look at LocatedFileStatus to see how it parallelises its work. its not 
perfect because wildcards are supported, which means globStatus gets used

happy to talk about this some more, and I'll review the patch

-steve

made a quick POC and two potential different paths we could do for 
implementation and wanted to see if anyone had thoughts - 
https://github.com/apache/spark/pull/29179.

Cheers,

Holden

--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Holden Karau
Wonderful. To be clear the patch is more to start the discussion about how
we want to do it and less what I think is the right way.

On Wed, Jul 22, 2020 at 10:47 AM Steve Loughran  wrote:

>
>
> On Wed, 22 Jul 2020 at 00:51, Holden Karau  wrote:
>
>> Hi Folks,
>>
>> In Spark SQL there is the ability to have Spark do it's partition
>> discovery/file listing in parallel on the worker nodes and also avoid
>> locality lookups. I'd like to expose this in core, but given the Hadoop
>> APIs it's a bit more complicated to do right. I
>>
>
> That's ultimately fixable, if we can sort out what's good from the app
> side and reconcile that with 'what is not pathologically bad across both
> HDFS and object stores".
>
> Bad: globStatus, anything which returns an array rather than a remote
> iterator, encourages treewalk
> Good: deep recursive listings, remote iterator results for:
> incremental/async fetch of next page of listing, soon: option for iterator,
> if cast to IOStatisticsSource, actually serve up stats on IO performance
> during the listing. (e.g. #of list calls, mean time to get a list
> response back., store throttle events)
>
> Also look at LocatedFileStatus to see how it parallelises its work. its
> not perfect because wildcards are supported, which means globStatus gets
> used
>
> happy to talk about this some more, and I'll review the patch
>
> -steve
>
>
>> made a quick POC and two potential different paths we could do for
>> implementation and wanted to see if anyone had thoughts -
>> https://github.com/apache/spark/pull/29179.
>>
>> Cheers,
>>
>> Holden
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Steve Loughran
On Wed, 22 Jul 2020 at 00:51, Holden Karau  wrote:

> Hi Folks,
>
> In Spark SQL there is the ability to have Spark do it's partition
> discovery/file listing in parallel on the worker nodes and also avoid
> locality lookups. I'd like to expose this in core, but given the Hadoop
> APIs it's a bit more complicated to do right. I
>

That's ultimately fixable, if we can sort out what's good from the app side
and reconcile that with 'what is not pathologically bad across both HDFS
and object stores".

Bad: globStatus, anything which returns an array rather than a remote
iterator, encourages treewalk
Good: deep recursive listings, remote iterator results for:
incremental/async fetch of next page of listing, soon: option for iterator,
if cast to IOStatisticsSource, actually serve up stats on IO performance
during the listing. (e.g. #of list calls, mean time to get a list
response back., store throttle events)

Also look at LocatedFileStatus to see how it parallelises its work. its not
perfect because wildcards are supported, which means globStatus gets used

happy to talk about this some more, and I'll review the patch

-steve


> made a quick POC and two potential different paths we could do for
> implementation and wanted to see if anyone had thoughts -
> https://github.com/apache/spark/pull/29179.
>
> Cheers,
>
> Holden
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-21 Thread Holden Karau
Hi Folks,

In Spark SQL there is the ability to have Spark do it's partition
discovery/file listing in parallel on the worker nodes and also avoid
locality lookups. I'd like to expose this in core, but given the Hadoop
APIs it's a bit more complicated to do right. I made a quick POC and two
potential different paths we could do for implementation and wanted to see
if anyone had thoughts - https://github.com/apache/spark/pull/29179.

Cheers,

Holden

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau