Re: Spark-optimized Shuffle (SOS) any update?

2019-01-02 Thread marek-simunek

Hi, thanks for reply. I finally got time and glanced through the design doc.




It seems that it has nothing to do with the paper I mentioned. The paper is
trying to solve the problem of I/O ops required for shuffle are growing 
quadratically with number of tasks (shuffle files), therefore we need to 
keep number of tasks low.

Am I missing something?





"

Recently, the community has actively been working on this. The JIRA to
follow is: 
https://issues.apache.org/jira/browse/SPARK-25299
(https://issues.apache.org/jira/browse/SPARK-25299). A group of various 
companies including Bloomberg and Palantir are in the works of a WIP
solution that implements a varied version of Option #5 (which is elaborated
upon in the google doc linked in the JIRA summary). 






On Wed, Dec 19, 2018 at 5:20 AM mailto:marek-simu...@seznam.cz)> wrote:

"

Hi everyone,


    we are facing same problems as Facebook had, where shuffle service is a
bottleneck. For now we solved that with large task size (2g) to reduce
shuffle I/O.

I saw very nice presentation from Brian Cho on Optimizing shuffle I/O at 
large scale[1]. It is a implementation of white paper[2].
Brian Cho at the end of the lecture kindly mentioned about plans to
contribute it back to Spark[3]. I checked mailing list and spark JIRA and 
didn't find any ticket on this topic.

Please, does anyone has a contact on someone from Facebook who could know 
more about this? Or are there some plans to bring similar optimization to 
Spark?

[1] https://databricks.com/session/sos-optimizing-shuffle-i-o
(https://databricks.com/session/sos-optimizing-shuffle-i-o)
[2] https://haoyuzhang.org/publications/riffle-eurosys18.pdf
(https://haoyuzhang.org/publications/riffle-eurosys18.pdf)
[3] https://image.slidesharecdn.com/5brianchoerginseyfe-180613004126/95/sos-
optimizing-shuffle-io-with-brian-cho-and-ergin-seyfe-30-638.jpg?cb=
1528850545
(https://image.slidesharecdn.com/5brianchoerginseyfe-180613004126/95/sos-optimizing-shuffle-io-with-brian-cho-and-ergin-seyfe-30-638.jpg?cb=1528850545)

"

"

Re: Spark-optimized Shuffle (SOS) any update?

2018-12-19 Thread Ilan Filonenko
Recently, the community has actively been working on this. The JIRA to
follow is:
https://issues.apache.org/jira/browse/SPARK-25299. A group of various
companies including Bloomberg and Palantir are in the works of a WIP
solution that implements a varied version of Option #5 (which is elaborated
upon in the google doc linked in the JIRA summary).

On Wed, Dec 19, 2018 at 5:20 AM  wrote:

> Hi everyone,
> we are facing same problems as Facebook had, where shuffle service is
> a bottleneck. For now we solved that with large task size (2g) to reduce
> shuffle I/O.
>
> I saw very nice presentation from Brian Cho on Optimizing shuffle I/O at
> large scale[1]. It is a implementation of white paper[2].
> Brian Cho at the end of the lecture kindly mentioned about plans to
> contribute it back to Spark[3]. I checked mailing list and spark JIRA and
> didn't find any ticket on this topic.
>
> Please, does anyone has a contact on someone from Facebook who could know
> more about this? Or are there some plans to bring similar optimization to
> Spark?
>
> [1] https://databricks.com/session/sos-optimizing-shuffle-i-o
> [2] https://haoyuzhang.org/publications/riffle-eurosys18.pdf
> [3]
> https://image.slidesharecdn.com/5brianchoerginseyfe-180613004126/95/sos-optimizing-shuffle-io-with-brian-cho-and-ergin-seyfe-30-638.jpg?cb=1528850545
>


Spark-optimized Shuffle (SOS) any update?

2018-12-19 Thread marek-simunek

Hi everyone,


    we are facing same problems as Facebook had, where shuffle service is a
bottleneck. For now we solved that with large task size (2g) to reduce
shuffle I/O.

I saw very nice presentation from Brian Cho on Optimizing shuffle I/O at 
large scale[1]. It is a implementation of white paper[2].
Brian Cho at the end of the lecture kindly mentioned about plans to
contribute it back to Spark[3]. I checked mailing list and spark JIRA and 
didn't find any ticket on this topic.

Please, does anyone has a contact on someone from Facebook who could know 
more about this? Or are there some plans to bring similar optimization to 
Spark?

[1] https://databricks.com/session/sos-optimizing-shuffle-i-o
[2] https://haoyuzhang.org/publications/riffle-eurosys18.pdf
[3] https://image.slidesharecdn.com/5brianchoerginseyfe-180613004126/95/sos-
optimizing-shuffle-io-with-brian-cho-and-ergin-seyfe-30-638.jpg?cb=
1528850545