os buffer cache does not cache shuffle output file

2014-05-15 Thread wxhsdp
Hi, 
  patrick said "The intermediate shuffle output gets written to disk, but it
often hits the OS-buffer cache
  since it's not explicitly fsync'ed, so in many cases it stays entirely in
memory. The behavior of the 
  shuffle is agnostic to whether the base RDD is in cache or in disk." 

  i do a test with one groupBy action and found the intermediate shuffle
files are written to disk
  with sufficient free memory, the shuffle size is about 500MB, and there 's
1.5GB free memory,
  and i notice that disk used increases about 500MB during the process.

  here's the log using vmstat, you can see the cache column increases when
reading from disk, but
  buff column is unchanged, so the data written to disk is not buffered 

procs ---memory-- ---swap-- -io -system--
cpu
 r  b   swpd   free buffcache  si   sobiboin   
cs us sy id wa
 2  0  10256 1616852   6664 55734400 0 51380  972  2852 88  7  0 
5
 1  0  10256 1592636   6664 58067600 0 0 949  3777 91  9 
0  0
 1  0  10256 1568228   6672 60401600 0   576   923  3640 94  6 
0  0
 2  0  10256 1545836   6672 62734800 0 0 893  3261 95  5 
0  0
 1  0  10256 1521552   6672 65066800 0 0 884  3401 89 11 
0  0
 2  0  10256 1497144   6672 67401200 0 0 911  3275 91  9 
0  0
 1  0  10256 1469260   6676 70072800 4 60668 1044 3366 85 15  0 
0
 1  0  10256 1453076   6684 70246400 0   924   853 2596 97  3  0 
0

  is the buffer cache in write through mode? something i need to configure? 
  my os is ubuntu 13.10 64bits.
  thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: os buffer cache does not cache shuffle output file

2014-05-10 Thread Aaron Davidson
Seems the mailing list was broken when you sent your original question, so
I appended it to the end of this message.

"Buffers" is relatively unimportant in today's Linux kernel; "cache" is
used for both writing and reading [1].
What you are seeing seems to be the expected behavior: the data is written
to the page cache (increasing its size),
and also written out asynchronously to the disk. As long as there's room in
the page cache, the write should not
block on IO.

[1] 
http://stackoverflow.com/questions/6345020/linux-memory-buffer-vs-cache(contains
better citations)

"""
Hi,
  patrick said "The intermediate shuffle output gets written to disk, but
it often hits the OS-buffer cache
  since it's not explicitly fsync'ed, so in many cases it stays entirely in
memory. The behavior of the
  shuffle is agnostic to whether the base RDD is in cache or in disk."

  i do a test with one groupBy action and found the intermediate shuffle
files are written to disk
  with sufficient free memory, the shuffle size is about 500MB, and there
's 1.5GB free memory,
  and i notice that disk used increases about 500MB during the process.

  here's the log using vmstat, you can see the cache column increases when
reading from disk, but
  buff column is unchanged, so the data written to disk is not buffered

procs ---memory-- ---swap-- -io -system--
cpu
 r  b   swpd   free buffcache  si   sobiboin
 cs us sy id wa
 2  0  10256 1616852   6664 55734400 0 51380  972  2852 88  7
 0  5
 1  0  10256 1592636   6664 58067600 0 0 949  3777 91
 9  0  0
 1  0  10256 1568228   6672 60401600 0   576   923  3640 94  6
 0  0
 2  0  10256 1545836   6672 62734800 0 0 893  3261 95
 5  0  0
 1  0  10256 1521552   6672 65066800 0 0 884  3401 89
11  0  0
 2  0  10256 1497144   6672 67401200 0 0 911  3275 91
 9  0  0
 1  0  10256 1469260   6676 70072800 4 60668 1044 3366 85 15  0
 0
 1  0  10256 1453076   6684 70246400 0   924   853 2596 97  3
 0  0

  is the buffer cache in write through mode? something i need to configure?
  my os is ubuntu 13.10 64bits.
  thanks!
"""
- wxhsdp


On Sat, May 10, 2014 at 4:41 PM, Koert Kuipers  wrote:

> yes it seems broken. i got only a few emails in last few days
>
>
> On Fri, May 9, 2014 at 7:24 AM, wxhsdp  wrote:
>
>> is there something wrong with the mailing list? very few people see my
>> thread
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478p5521.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


Re: os buffer cache does not cache shuffle output file

2014-05-10 Thread Koert Kuipers
yes it seems broken. i got only a few emails in last few days


On Fri, May 9, 2014 at 7:24 AM, wxhsdp  wrote:

> is there something wrong with the mailing list? very few people see my
> thread
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478p5521.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: os buffer cache does not cache shuffle output file

2014-05-10 Thread wxhsdp
is there something wrong with the mailing list? very few people see my thread



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478p5521.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.