So I think ramdisk is simple way to try.
Besides I think Reynold's suggestion is quite valid, with such high-end
machine, putting everything in memory might not improve the performance a
lot as assumed. Since bottleneck will be shifted, like memory bandwidth,
NUMA, CPU efficiency
Yes we see it on final write. Our preference is to eliminate this.
On Fri, Apr 1, 2016, 7:25 PM Saisai Shao wrote:
> Hi Michael, shuffle data (mapper output) have to be materialized into disk
> finally, no matter how large memory you have, it is the design purpose of
>
Hi Michael, shuffle data (mapper output) have to be materialized into disk
finally, no matter how large memory you have, it is the design purpose of
Spark. In you scenario, since you have a big memory, shuffle spill should
not happen frequently, most of the disk IO you see might be final shuffle
Shuffling a 1tb set of keys and values (aka sort by key) results in about
500gb of io to disk if compression is enabled. Is there any way to
eliminate shuffling causing io?
On Fri, Apr 1, 2016, 6:32 PM Reynold Xin wrote:
> Michael - I'm not sure if you actually read my
r a test writing to RAM Disk if that configuration is
> available.
>
> Thanks
>
> Yong
>
> --
> From: r...@databricks.com
> Date: Fri, 1 Apr 2016 15:32:23 -0700
> Subject: Re: Eliminating shuffle write and spill disk IO reads/writes in
> Spark
> To:
00
Subject: Re: Eliminating shuffle write and spill disk IO reads/writes in Spark
To: slavi...@gmail.com
CC: mri...@gmail.com; d...@spark.apache.org; user@spark.apache.org
Michael - I'm not sure if you actually read my email, but spill has nothing to
do with the shuffle fil
Michael - I'm not sure if you actually read my email, but spill has nothing
to do with the shuffle files on disk. It was for the partitioning (i.e.
sorting) process. If that flag is off, Spark will just run out of memory
when data doesn't fit in memory.
On Fri, Apr 1, 2016 at 3:28 PM, Michael
RAMdisk is a fine interim step but there is a lot of layers eliminated by
keeping things in memory unless there is need for spillover. At one time
there was support for turning off spilling. That was eliminated. Why?
On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan wrote:
I think Reynold's suggestion of using ram disk would be a good way to
test if these are the bottlenecks or something else is.
For most practical purposes, pointing local dir to ramdisk should
effectively give you 'similar' performance as shuffling from memory.
Are there concerns with taking that
I totally disagree that it’s not a problem.
- Network fetch throughput on 40G Ethernet exceeds the throughput of NVME
drives.
- What Spark is depending on is Linux’s IO cache as an effective buffer pool
This is fine for small jobs but not for jobs with datasets in the TB/node range.
- On
spark.shuffle.spill actually has nothing to do with whether we write
shuffle files to disk. Currently it is not possible to not write shuffle
files to disk, and typically it is not a problem because the network fetch
throughput is lower than what disks can sustain. In most cases, especially
with
Hello;
I’m working on spark with very large memory systems (2TB+) and notice that
Spark spills to disk in shuffle. Is there a way to force spark to stay in
memory when doing shuffle operations? The goal is to keep the shuffle data
either in the heap or in off-heap memory (in 1.6.x) and
Hello;
I’m working on spark with very large memory systems (2TB+) and notice that
Spark spills to disk in shuffle. Is there a way to force spark to stay in
memory when doing shuffle operations? The goal is to keep the shuffle data
either in the heap or in off-heap memory (in 1.6.x) and
13 matches
Mail list logo