Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Michael Slavitch Fri, 01 Apr 2016 16:28:36 -0700

Yes we see it on final write.  Our preference is to eliminate this.

On Fri, Apr 1, 2016, 7:25 PM Saisai Shao <sai.sai.s...@gmail.com> wrote:


> Hi Michael, shuffle data (mapper output) have to be materialized into disk
> finally, no matter how large memory you have, it is the design purpose of
> Spark. In you scenario, since you have a big memory, shuffle spill should
> not happen frequently, most of the disk IO you see might be final shuffle
> file write.
>
> So if you want to avoid this disk IO, you could use ramdisk as Reynold
> suggested. If you want to avoid FS overhead of ramdisk, you could try to
> hack a new shuffle implementation, since shuffle framework is pluggable.
>
>
> On Sat, Apr 2, 2016 at 6:48 AM, Michael Slavitch <slavi...@gmail.com>
> wrote:
>
>> As I mentioned earlier this flag is now ignored.
>>
>>
>> On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <slavi...@gmail.com> wrote:
>>
>>> Shuffling a 1tb set of keys and values (aka sort by key)  results in
>>> about 500gb of io to disk if compression is enabled. Is there any way to
>>> eliminate shuffling causing io?
>>>
>>> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> Michael - I'm not sure if you actually read my email, but spill has
>>>> nothing to do with the shuffle files on disk. It was for the partitioning
>>>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>>>> memory when data doesn't fit in memory.
>>>>
>>>>
>>>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <slavi...@gmail.com>
>>>> wrote:
>>>>
>>>>> RAMdisk is a fine interim step but there is a lot of layers eliminated
>>>>> by keeping things in memory unless there is need for spillover.   At one
>>>>> time there was support for turning off spilling.  That was eliminated.
>>>>> Why?
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mri...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think Reynold's suggestion of using ram disk would be a good way to
>>>>>> test if these are the bottlenecks or something else is.
>>>>>> For most practical purposes, pointing local dir to ramdisk should
>>>>>> effectively give you 'similar' performance as shuffling from memory.
>>>>>>
>>>>>> Are there concerns with taking that approach to test ? (I dont see
>>>>>> any, but I am not sure if I missed something).
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Mridul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <slavi...@gmail.com>
>>>>>> wrote:
>>>>>> > I totally disagree that it’s not a problem.
>>>>>> >
>>>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput
>>>>>> of NVME
>>>>>> > drives.
>>>>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>>>>> buffer pool
>>>>>> > This is fine for small jobs but not for jobs with datasets in the
>>>>>> TB/node
>>>>>> > range.
>>>>>> > - On larger jobs flushing the cache causes Linux to block.
>>>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>>>>> multiple
>>>>>> > executors writing out to disk increases greatly.
>>>>>> >
>>>>>> > I thought the whole point of Spark was in-memory computing?  It’s
>>>>>> in fact
>>>>>> > in-memory for some things but  use spark.local.dir as a buffer pool
>>>>>> of
>>>>>> > others.
>>>>>> >
>>>>>> > Hence, the performance of  Spark is gated by the performance of
>>>>>> > spark.local.dir, even on large memory systems.
>>>>>> >
>>>>>> > "Currently it is not possible to not write shuffle files to disk.”
>>>>>> >
>>>>>> > What changes >would< make it possible?
>>>>>> >
>>>>>> > The only one that seems possible is to clone the shuffle service
>>>>>> and make it
>>>>>> > in-memory.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <r...@databricks.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > spark.shuffle.spill actually has nothing to do with whether we
>>>>>> write shuffle
>>>>>> > files to disk. Currently it is not possible to not write shuffle
>>>>>> files to
>>>>>> > disk, and typically it is not a problem because the network fetch
>>>>>> throughput
>>>>>> > is lower than what disks can sustain. In most cases, especially
>>>>>> with SSDs,
>>>>>> > there is little difference between putting all of those in memory
>>>>>> and on
>>>>>> > disk.
>>>>>> >
>>>>>> > However, it is becoming more common to run Spark on a few number of
>>>>>> beefy
>>>>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>>>>> improving
>>>>>> > performance for those. Meantime, you can setup local ramdisks on
>>>>>> each node
>>>>>> > for shuffle writes.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <
>>>>>> slavi...@gmail.com>
>>>>>> > wrote:
>>>>>> >>
>>>>>> >> Hello;
>>>>>> >>
>>>>>> >> I’m working on spark with very large memory systems (2TB+) and
>>>>>> notice that
>>>>>> >> Spark spills to disk in shuffle.  Is there a way to force spark to
>>>>>> stay in
>>>>>> >> memory when doing shuffle operations?   The goal is to keep the
>>>>>> shuffle data
>>>>>> >> either in the heap or in off-heap memory (in 1.6.x) and never
>>>>>> touch the IO
>>>>>> >> subsystem.  I am willing to have the job fail if it runs out of
>>>>>> RAM.
>>>>>> >>
>>>>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>>>>>> >> Tungsten sort in 1.5.x
>>>>>> >>
>>>>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false,
>>>>>> but this
>>>>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>>>>> shuffles will
>>>>>> >> continue to spill to disk when necessary.”
>>>>>> >>
>>>>>> >> If this is impossible via configuration changes what code changes
>>>>>> would be
>>>>>> >> needed to accomplish this?
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> ---------------------------------------------------------------------
>>>>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>>
>>>>> --
>>>>> Michael Slavitch
>>>>> 62 Renfrew Ave.
>>>>> Ottawa Ontario
>>>>> K1S 1Z5
>>>>>
>>>>
>>>> --
>>> Michael Slavitch
>>> 62 Renfrew Ave.
>>> Ottawa Ontario
>>> K1S 1Z5
>>>
>> --
>> Michael Slavitch
>> 62 Renfrew Ave.
>> Ottawa Ontario
>> K1S 1Z5
>>
>
> --
Michael Slavitch
62 Renfrew Ave.
Ottawa Ontario
K1S 1Z5

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Reply via email to