Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Saisai Shao Fri, 01 Apr 2016 16:44:48 -0700

So I think ramdisk is simple way to try.

Besides I think Reynold's suggestion is quite valid, with such high-end
machine, putting everything in memory might not improve the performance a
lot as assumed. Since bottleneck will be shifted, like memory bandwidth,
NUMA, CPU efficiency (serialization-deserialization, data processing...).
Also code design should well consider such usage scenario, to use resource
more efficiently.


Thanks
Saisai

On Sat, Apr 2, 2016 at 7:27 AM, Michael Slavitch <slavi...@gmail.com> wrote:

> Yes we see it on final write.  Our preference is to eliminate this.
>
>
> On Fri, Apr 1, 2016, 7:25 PM Saisai Shao <sai.sai.s...@gmail.com> wrote:
>
>> Hi Michael, shuffle data (mapper output) have to be materialized into
>> disk finally, no matter how large memory you have, it is the design purpose
>> of Spark. In you scenario, since you have a big memory, shuffle spill
>> should not happen frequently, most of the disk IO you see might be final
>> shuffle file write.
>>
>> So if you want to avoid this disk IO, you could use ramdisk as Reynold
>> suggested. If you want to avoid FS overhead of ramdisk, you could try to
>> hack a new shuffle implementation, since shuffle framework is pluggable.
>>
>>
>> On Sat, Apr 2, 2016 at 6:48 AM, Michael Slavitch <slavi...@gmail.com>
>> wrote:
>>
>>> As I mentioned earlier this flag is now ignored.
>>>
>>>
>>> On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <slavi...@gmail.com>
>>> wrote:
>>>
>>>> Shuffling a 1tb set of keys and values (aka sort by key)  results in
>>>> about 500gb of io to disk if compression is enabled. Is there any way to
>>>> eliminate shuffling causing io?
>>>>
>>>> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <r...@databricks.com> wrote:
>>>>
>>>>> Michael - I'm not sure if you actually read my email, but spill has
>>>>> nothing to do with the shuffle files on disk. It was for the partitioning
>>>>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>>>>> memory when data doesn't fit in memory.
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <slavi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> RAMdisk is a fine interim step but there is a lot of layers
>>>>>> eliminated by keeping things in memory unless there is need for 
>>>>>> spillover.
>>>>>>   At one time there was support for turning off spilling.  That was
>>>>>> eliminated.  Why?
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mri...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think Reynold's suggestion of using ram disk would be a good way to
>>>>>>> test if these are the bottlenecks or something else is.
>>>>>>> For most practical purposes, pointing local dir to ramdisk should
>>>>>>> effectively give you 'similar' performance as shuffling from memory.
>>>>>>>
>>>>>>> Are there concerns with taking that approach to test ? (I dont see
>>>>>>> any, but I am not sure if I missed something).
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <slavi...@gmail.com>
>>>>>>> wrote:
>>>>>>> > I totally disagree that it’s not a problem.
>>>>>>> >
>>>>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput
>>>>>>> of NVME
>>>>>>> > drives.
>>>>>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>>>>>> buffer pool
>>>>>>> > This is fine for small jobs but not for jobs with datasets in the
>>>>>>> TB/node
>>>>>>> > range.
>>>>>>> > - On larger jobs flushing the cache causes Linux to block.
>>>>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>>>>>> multiple
>>>>>>> > executors writing out to disk increases greatly.
>>>>>>> >
>>>>>>> > I thought the whole point of Spark was in-memory computing?  It’s
>>>>>>> in fact
>>>>>>> > in-memory for some things but  use spark.local.dir as a buffer
>>>>>>> pool of
>>>>>>> > others.
>>>>>>> >
>>>>>>> > Hence, the performance of  Spark is gated by the performance of
>>>>>>> > spark.local.dir, even on large memory systems.
>>>>>>> >
>>>>>>> > "Currently it is not possible to not write shuffle files to disk.”
>>>>>>> >
>>>>>>> > What changes >would< make it possible?
>>>>>>> >
>>>>>>> > The only one that seems possible is to clone the shuffle service
>>>>>>> and make it
>>>>>>> > in-memory.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <r...@databricks.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > spark.shuffle.spill actually has nothing to do with whether we
>>>>>>> write shuffle
>>>>>>> > files to disk. Currently it is not possible to not write shuffle
>>>>>>> files to
>>>>>>> > disk, and typically it is not a problem because the network fetch
>>>>>>> throughput
>>>>>>> > is lower than what disks can sustain. In most cases, especially
>>>>>>> with SSDs,
>>>>>>> > there is little difference between putting all of those in memory
>>>>>>> and on
>>>>>>> > disk.
>>>>>>> >
>>>>>>> > However, it is becoming more common to run Spark on a few number
>>>>>>> of beefy
>>>>>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>>>>>> improving
>>>>>>> > performance for those. Meantime, you can setup local ramdisks on
>>>>>>> each node
>>>>>>> > for shuffle writes.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <
>>>>>>> slavi...@gmail.com>
>>>>>>> > wrote:
>>>>>>> >>
>>>>>>> >> Hello;
>>>>>>> >>
>>>>>>> >> I’m working on spark with very large memory systems (2TB+) and
>>>>>>> notice that
>>>>>>> >> Spark spills to disk in shuffle.  Is there a way to force spark
>>>>>>> to stay in
>>>>>>> >> memory when doing shuffle operations?   The goal is to keep the
>>>>>>> shuffle data
>>>>>>> >> either in the heap or in off-heap memory (in 1.6.x) and never
>>>>>>> touch the IO
>>>>>>> >> subsystem.  I am willing to have the job fail if it runs out of
>>>>>>> RAM.
>>>>>>> >>
>>>>>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work
>>>>>>> in
>>>>>>> >> Tungsten sort in 1.5.x
>>>>>>> >>
>>>>>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false,
>>>>>>> but this
>>>>>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>>>>>> shuffles will
>>>>>>> >> continue to spill to disk when necessary.”
>>>>>>> >>
>>>>>>> >> If this is impossible via configuration changes what code changes
>>>>>>> would be
>>>>>>> >> needed to accomplish this?
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>> >>
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>> --
>>>>>> Michael Slavitch
>>>>>> 62 Renfrew Ave.
>>>>>> Ottawa Ontario
>>>>>> K1S 1Z5
>>>>>>
>>>>>
>>>>> --
>>>> Michael Slavitch
>>>> 62 Renfrew Ave.
>>>> Ottawa Ontario
>>>> K1S 1Z5
>>>>
>>> --
>>> Michael Slavitch
>>> 62 Renfrew Ave.
>>> Ottawa Ontario
>>> K1S 1Z5
>>>
>>
>> --
> Michael Slavitch
> 62 Renfrew Ave.
> Ottawa Ontario
> K1S 1Z5
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Reply via email to