subject:"Why does SortShuffleWriter write to disk always\?"

Re: Why does SortShuffleWriter write to disk always?

2015-05-03 Thread Pramod Biligiri

Thanks for the info. I agree, it makes sense the way it is designed.

Pramod

On Sat, May 2, 2015 at 10:37 PM, Mridul Muralidharan mri...@gmail.com
wrote:

I agree, this is better handled by the filesystem cache - not to
mention, being able to do zero copy writes.

Regards,
Mridul

On Sat, May 2, 2015 at 10:26 PM, Reynold Xin r...@databricks.com wrote:
I've personally prototyped completely in-memory shuffle for Spark 3
times.
However, it is unclear how big of a gain it would be to put all of these
in
memory, under newer file systems (ext4, xfs). If the shuffle data is
small,
they are still in the file system buffer cache anyway. Note that network
throughput is often lower than disk throughput, so it won't be a problem
to
read them from disk. And not having to keep all of these stuff in-memory
substantially simplifies memory management.

On Fri, May 1, 2015 at 7:59 PM, Pramod Biligiri
pramodbilig...@gmail.com
wrote:

Hi,
I was trying to see if I can make Spark avoid hitting the disk for small
jobs, but I see that the SortShuffleWriter.write() always writes to
disk. I
found an older thread (

http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html
)
saying that it doesn't call fsync on this write path.

My question is why does it always write to disk?
Does it mean the reduce phase reads the result from the disk as well?
Isn't it possible to read the data from map/buffer in ExternalSorter
directly during the reduce phase?

Thanks,
Pramod

Why does SortShuffleWriter write to disk always?

2015-05-02 Thread Pramod Biligiri

Hi,
I was trying to see if I can make Spark avoid hitting the disk for small
jobs, but I see that the SortShuffleWriter.write() always writes to disk. I
found an older thread (
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html)
saying that it doesn't call fsync on this write path.

My question is why does it always write to disk?
Does it mean the reduce phase reads the result from the disk as well?
Isn't it possible to read the data from map/buffer in ExternalSorter
directly during the reduce phase?

Thanks,
Pramod

Re: Why does SortShuffleWriter write to disk always?

2015-05-02 Thread Reynold Xin

I've personally prototyped completely in-memory shuffle for Spark 3 times.
However, it is unclear how big of a gain it would be to put all of these in
memory, under newer file systems (ext4, xfs). If the shuffle data is small,
they are still in the file system buffer cache anyway. Note that network
throughput is often lower than disk throughput, so it won't be a problem to
read them from disk. And not having to keep all of these stuff in-memory
substantially simplifies memory management.



On Fri, May 1, 2015 at 7:59 PM, Pramod Biligiri pramodbilig...@gmail.com
wrote:

 Hi,
 I was trying to see if I can make Spark avoid hitting the disk for small
 jobs, but I see that the SortShuffleWriter.write() always writes to disk. I
 found an older thread (

 http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html
 )
 saying that it doesn't call fsync on this write path.

 My question is why does it always write to disk?
 Does it mean the reduce phase reads the result from the disk as well?
 Isn't it possible to read the data from map/buffer in ExternalSorter
 directly during the reduce phase?

 Thanks,
 Pramod

Re: Why does SortShuffleWriter write to disk always?

2015-05-02 Thread Mridul Muralidharan

I agree, this is better handled by the filesystem cache - not to
mention, being able to do zero copy writes.

Regards,
Mridul

On Sat, May 2, 2015 at 10:26 PM, Reynold Xin r...@databricks.com wrote:
 I've personally prototyped completely in-memory shuffle for Spark 3 times.
 However, it is unclear how big of a gain it would be to put all of these in
 memory, under newer file systems (ext4, xfs). If the shuffle data is small,
 they are still in the file system buffer cache anyway. Note that network
 throughput is often lower than disk throughput, so it won't be a problem to
 read them from disk. And not having to keep all of these stuff in-memory
 substantially simplifies memory management.



 On Fri, May 1, 2015 at 7:59 PM, Pramod Biligiri pramodbilig...@gmail.com
 wrote:

 Hi,
 I was trying to see if I can make Spark avoid hitting the disk for small
 jobs, but I see that the SortShuffleWriter.write() always writes to disk. I
 found an older thread (

 http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html
 )
 saying that it doesn't call fsync on this write path.

 My question is why does it always write to disk?
 Does it mean the reduce phase reads the result from the disk as well?
 Isn't it possible to read the data from map/buffer in ExternalSorter
 directly during the reduce phase?

 Thanks,
 Pramod


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Why does SortShuffleWriter write to disk always?

Why does SortShuffleWriter write to disk always?

Re: Why does SortShuffleWriter write to disk always?

Re: Why does SortShuffleWriter write to disk always?

4 matches

Site Navigation

Mail list logo

Footer information