os buffer cache does not cache shuffle output file
Hi, patrick said "The intermediate shuffle output gets written to disk, but it often hits the OS-buffer cache since it's not explicitly fsync'ed, so in many cases it stays entirely in memory. The behavior of the shuffle is agnostic to whether the base RDD is in cache or in disk." i do a test with one groupBy action and found the intermediate shuffle files are written to disk with sufficient free memory, the shuffle size is about 500MB, and there 's 1.5GB free memory, and i notice that disk used increases about 500MB during the process. here's the log using vmstat, you can see the cache column increases when reading from disk, but buff column is unchanged, so the data written to disk is not buffered procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buffcache si sobiboin cs us sy id wa 2 0 10256 1616852 6664 55734400 0 51380 972 2852 88 7 0 5 1 0 10256 1592636 6664 58067600 0 0 949 3777 91 9 0 0 1 0 10256 1568228 6672 60401600 0 576 923 3640 94 6 0 0 2 0 10256 1545836 6672 62734800 0 0 893 3261 95 5 0 0 1 0 10256 1521552 6672 65066800 0 0 884 3401 89 11 0 0 2 0 10256 1497144 6672 67401200 0 0 911 3275 91 9 0 0 1 0 10256 1469260 6676 70072800 4 60668 1044 3366 85 15 0 0 1 0 10256 1453076 6684 70246400 0 924 853 2596 97 3 0 0 is the buffer cache in write through mode? something i need to configure? my os is ubuntu 13.10 64bits. thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: os buffer cache does not cache shuffle output file
Seems the mailing list was broken when you sent your original question, so I appended it to the end of this message. "Buffers" is relatively unimportant in today's Linux kernel; "cache" is used for both writing and reading [1]. What you are seeing seems to be the expected behavior: the data is written to the page cache (increasing its size), and also written out asynchronously to the disk. As long as there's room in the page cache, the write should not block on IO. [1] http://stackoverflow.com/questions/6345020/linux-memory-buffer-vs-cache(contains better citations) """ Hi, patrick said "The intermediate shuffle output gets written to disk, but it often hits the OS-buffer cache since it's not explicitly fsync'ed, so in many cases it stays entirely in memory. The behavior of the shuffle is agnostic to whether the base RDD is in cache or in disk." i do a test with one groupBy action and found the intermediate shuffle files are written to disk with sufficient free memory, the shuffle size is about 500MB, and there 's 1.5GB free memory, and i notice that disk used increases about 500MB during the process. here's the log using vmstat, you can see the cache column increases when reading from disk, but buff column is unchanged, so the data written to disk is not buffered procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buffcache si sobiboin cs us sy id wa 2 0 10256 1616852 6664 55734400 0 51380 972 2852 88 7 0 5 1 0 10256 1592636 6664 58067600 0 0 949 3777 91 9 0 0 1 0 10256 1568228 6672 60401600 0 576 923 3640 94 6 0 0 2 0 10256 1545836 6672 62734800 0 0 893 3261 95 5 0 0 1 0 10256 1521552 6672 65066800 0 0 884 3401 89 11 0 0 2 0 10256 1497144 6672 67401200 0 0 911 3275 91 9 0 0 1 0 10256 1469260 6676 70072800 4 60668 1044 3366 85 15 0 0 1 0 10256 1453076 6684 70246400 0 924 853 2596 97 3 0 0 is the buffer cache in write through mode? something i need to configure? my os is ubuntu 13.10 64bits. thanks! """ - wxhsdp On Sat, May 10, 2014 at 4:41 PM, Koert Kuipers wrote: > yes it seems broken. i got only a few emails in last few days > > > On Fri, May 9, 2014 at 7:24 AM, wxhsdp wrote: > >> is there something wrong with the mailing list? very few people see my >> thread >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478p5521.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >
Re: os buffer cache does not cache shuffle output file
yes it seems broken. i got only a few emails in last few days On Fri, May 9, 2014 at 7:24 AM, wxhsdp wrote: > is there something wrong with the mailing list? very few people see my > thread > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478p5521.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: os buffer cache does not cache shuffle output file
is there something wrong with the mailing list? very few people see my thread -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478p5521.html Sent from the Apache Spark User List mailing list archive at Nabble.com.