Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Saisai Shao Fri, 01 Apr 2016 16:26:42 -0700

Hi Michael, shuffle data (mapper output) have to be materialized into disk
finally, no matter how large memory you have, it is the design purpose of
Spark. In you scenario, since you have a big memory, shuffle spill should
not happen frequently, most of the disk IO you see might be final shuffle
file write.


So if you want to avoid this disk IO, you could use ramdisk as Reynold
suggested. If you want to avoid FS overhead of ramdisk, you could try to
hack a new shuffle implementation, since shuffle framework is pluggable.


On Sat, Apr 2, 2016 at 6:48 AM, Michael Slavitch <slavi...@gmail.com> wrote:

> As I mentioned earlier this flag is now ignored.
>
>
> On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <slavi...@gmail.com> wrote:
>
>> Shuffling a 1tb set of keys and values (aka sort by key)  results in
>> about 500gb of io to disk if compression is enabled. Is there any way to
>> eliminate shuffling causing io?
>>
>> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <r...@databricks.com> wrote:
>>
>>> Michael - I'm not sure if you actually read my email, but spill has
>>> nothing to do with the shuffle files on disk. It was for the partitioning
>>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>>> memory when data doesn't fit in memory.
>>>
>>>
>>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <slavi...@gmail.com>
>>> wrote:
>>>
>>>> RAMdisk is a fine interim step but there is a lot of layers eliminated
>>>> by keeping things in memory unless there is need for spillover.   At one
>>>> time there was support for turning off spilling.  That was eliminated.
>>>> Why?
>>>>
>>>>
>>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mri...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think Reynold's suggestion of using ram disk would be a good way to
>>>>> test if these are the bottlenecks or something else is.
>>>>> For most practical purposes, pointing local dir to ramdisk should
>>>>> effectively give you 'similar' performance as shuffling from memory.
>>>>>
>>>>> Are there concerns with taking that approach to test ? (I dont see
>>>>> any, but I am not sure if I missed something).
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <slavi...@gmail.com>
>>>>> wrote:
>>>>> > I totally disagree that it’s not a problem.
>>>>> >
>>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput of
>>>>> NVME
>>>>> > drives.
>>>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>>>> buffer pool
>>>>> > This is fine for small jobs but not for jobs with datasets in the
>>>>> TB/node
>>>>> > range.
>>>>> > - On larger jobs flushing the cache causes Linux to block.
>>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>>>> multiple
>>>>> > executors writing out to disk increases greatly.
>>>>> >
>>>>> > I thought the whole point of Spark was in-memory computing?  It’s in
>>>>> fact
>>>>> > in-memory for some things but  use spark.local.dir as a buffer pool
>>>>> of
>>>>> > others.
>>>>> >
>>>>> > Hence, the performance of  Spark is gated by the performance of
>>>>> > spark.local.dir, even on large memory systems.
>>>>> >
>>>>> > "Currently it is not possible to not write shuffle files to disk.”
>>>>> >
>>>>> > What changes >would< make it possible?
>>>>> >
>>>>> > The only one that seems possible is to clone the shuffle service and
>>>>> make it
>>>>> > in-memory.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <r...@databricks.com> wrote:
>>>>> >
>>>>> > spark.shuffle.spill actually has nothing to do with whether we write
>>>>> shuffle
>>>>> > files to disk. Currently it is not possible to not write shuffle
>>>>> files to
>>>>> > disk, and typically it is not a problem because the network fetch
>>>>> throughput
>>>>> > is lower than what disks can sustain. In most cases, especially with
>>>>> SSDs,
>>>>> > there is little difference between putting all of those in memory
>>>>> and on
>>>>> > disk.
>>>>> >
>>>>> > However, it is becoming more common to run Spark on a few number of
>>>>> beefy
>>>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>>>> improving
>>>>> > performance for those. Meantime, you can setup local ramdisks on
>>>>> each node
>>>>> > for shuffle writes.
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <
>>>>> slavi...@gmail.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> Hello;
>>>>> >>
>>>>> >> I’m working on spark with very large memory systems (2TB+) and
>>>>> notice that
>>>>> >> Spark spills to disk in shuffle.  Is there a way to force spark to
>>>>> stay in
>>>>> >> memory when doing shuffle operations?   The goal is to keep the
>>>>> shuffle data
>>>>> >> either in the heap or in off-heap memory (in 1.6.x) and never touch
>>>>> the IO
>>>>> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
>>>>> >>
>>>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>>>>> >> Tungsten sort in 1.5.x
>>>>> >>
>>>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false,
>>>>> but this
>>>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>>>> shuffles will
>>>>> >> continue to spill to disk when necessary.”
>>>>> >>
>>>>> >> If this is impossible via configuration changes what code changes
>>>>> would be
>>>>> >> needed to accomplish this?
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> ---------------------------------------------------------------------
>>>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>>>> >>
>>>>> >
>>>>> >
>>>>>
>>>> --
>>>> Michael Slavitch
>>>> 62 Renfrew Ave.
>>>> Ottawa Ontario
>>>> K1S 1Z5
>>>>
>>>
>>> --
>> Michael Slavitch
>> 62 Renfrew Ave.
>> Ottawa Ontario
>> K1S 1Z5
>>
> --
> Michael Slavitch
> 62 Renfrew Ave.
> Ottawa Ontario
> K1S 1Z5
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Reply via email to