So I think ramdisk is simple way to try. Besides I think Reynold's suggestion is quite valid, with such high-end machine, putting everything in memory might not improve the performance a lot as assumed. Since bottleneck will be shifted, like memory bandwidth, NUMA, CPU efficiency (serialization-deserialization, data processing...). Also code design should well consider such usage scenario, to use resource more efficiently.
Thanks Saisai On Sat, Apr 2, 2016 at 7:27 AM, Michael Slavitch <slavi...@gmail.com> wrote: > Yes we see it on final write. Our preference is to eliminate this. > > > On Fri, Apr 1, 2016, 7:25 PM Saisai Shao <sai.sai.s...@gmail.com> wrote: > >> Hi Michael, shuffle data (mapper output) have to be materialized into >> disk finally, no matter how large memory you have, it is the design purpose >> of Spark. In you scenario, since you have a big memory, shuffle spill >> should not happen frequently, most of the disk IO you see might be final >> shuffle file write. >> >> So if you want to avoid this disk IO, you could use ramdisk as Reynold >> suggested. If you want to avoid FS overhead of ramdisk, you could try to >> hack a new shuffle implementation, since shuffle framework is pluggable. >> >> >> On Sat, Apr 2, 2016 at 6:48 AM, Michael Slavitch <slavi...@gmail.com> >> wrote: >> >>> As I mentioned earlier this flag is now ignored. >>> >>> >>> On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <slavi...@gmail.com> >>> wrote: >>> >>>> Shuffling a 1tb set of keys and values (aka sort by key) results in >>>> about 500gb of io to disk if compression is enabled. Is there any way to >>>> eliminate shuffling causing io? >>>> >>>> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <r...@databricks.com> wrote: >>>> >>>>> Michael - I'm not sure if you actually read my email, but spill has >>>>> nothing to do with the shuffle files on disk. It was for the partitioning >>>>> (i.e. sorting) process. If that flag is off, Spark will just run out of >>>>> memory when data doesn't fit in memory. >>>>> >>>>> >>>>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <slavi...@gmail.com> >>>>> wrote: >>>>> >>>>>> RAMdisk is a fine interim step but there is a lot of layers >>>>>> eliminated by keeping things in memory unless there is need for >>>>>> spillover. >>>>>> At one time there was support for turning off spilling. That was >>>>>> eliminated. Why? >>>>>> >>>>>> >>>>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mri...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I think Reynold's suggestion of using ram disk would be a good way to >>>>>>> test if these are the bottlenecks or something else is. >>>>>>> For most practical purposes, pointing local dir to ramdisk should >>>>>>> effectively give you 'similar' performance as shuffling from memory. >>>>>>> >>>>>>> Are there concerns with taking that approach to test ? (I dont see >>>>>>> any, but I am not sure if I missed something). >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Mridul >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <slavi...@gmail.com> >>>>>>> wrote: >>>>>>> > I totally disagree that it’s not a problem. >>>>>>> > >>>>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput >>>>>>> of NVME >>>>>>> > drives. >>>>>>> > - What Spark is depending on is Linux’s IO cache as an effective >>>>>>> buffer pool >>>>>>> > This is fine for small jobs but not for jobs with datasets in the >>>>>>> TB/node >>>>>>> > range. >>>>>>> > - On larger jobs flushing the cache causes Linux to block. >>>>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by >>>>>>> multiple >>>>>>> > executors writing out to disk increases greatly. >>>>>>> > >>>>>>> > I thought the whole point of Spark was in-memory computing? It’s >>>>>>> in fact >>>>>>> > in-memory for some things but use spark.local.dir as a buffer >>>>>>> pool of >>>>>>> > others. >>>>>>> > >>>>>>> > Hence, the performance of Spark is gated by the performance of >>>>>>> > spark.local.dir, even on large memory systems. >>>>>>> > >>>>>>> > "Currently it is not possible to not write shuffle files to disk.” >>>>>>> > >>>>>>> > What changes >would< make it possible? >>>>>>> > >>>>>>> > The only one that seems possible is to clone the shuffle service >>>>>>> and make it >>>>>>> > in-memory. >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <r...@databricks.com> >>>>>>> wrote: >>>>>>> > >>>>>>> > spark.shuffle.spill actually has nothing to do with whether we >>>>>>> write shuffle >>>>>>> > files to disk. Currently it is not possible to not write shuffle >>>>>>> files to >>>>>>> > disk, and typically it is not a problem because the network fetch >>>>>>> throughput >>>>>>> > is lower than what disks can sustain. In most cases, especially >>>>>>> with SSDs, >>>>>>> > there is little difference between putting all of those in memory >>>>>>> and on >>>>>>> > disk. >>>>>>> > >>>>>>> > However, it is becoming more common to run Spark on a few number >>>>>>> of beefy >>>>>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into >>>>>>> improving >>>>>>> > performance for those. Meantime, you can setup local ramdisks on >>>>>>> each node >>>>>>> > for shuffle writes. >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch < >>>>>>> slavi...@gmail.com> >>>>>>> > wrote: >>>>>>> >> >>>>>>> >> Hello; >>>>>>> >> >>>>>>> >> I’m working on spark with very large memory systems (2TB+) and >>>>>>> notice that >>>>>>> >> Spark spills to disk in shuffle. Is there a way to force spark >>>>>>> to stay in >>>>>>> >> memory when doing shuffle operations? The goal is to keep the >>>>>>> shuffle data >>>>>>> >> either in the heap or in off-heap memory (in 1.6.x) and never >>>>>>> touch the IO >>>>>>> >> subsystem. I am willing to have the job fail if it runs out of >>>>>>> RAM. >>>>>>> >> >>>>>>> >> spark.shuffle.spill true is deprecated in 1.6 and does not work >>>>>>> in >>>>>>> >> Tungsten sort in 1.5.x >>>>>>> >> >>>>>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, >>>>>>> but this >>>>>>> >> is ignored by the tungsten-sort shuffle manager; its optimized >>>>>>> shuffles will >>>>>>> >> continue to spill to disk when necessary.” >>>>>>> >> >>>>>>> >> If this is impossible via configuration changes what code changes >>>>>>> would be >>>>>>> >> needed to accomplish this? >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> --------------------------------------------------------------------- >>>>>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>> >> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>> >> >>>>>>> > >>>>>>> > >>>>>>> >>>>>> -- >>>>>> Michael Slavitch >>>>>> 62 Renfrew Ave. >>>>>> Ottawa Ontario >>>>>> K1S 1Z5 >>>>>> >>>>> >>>>> -- >>>> Michael Slavitch >>>> 62 Renfrew Ave. >>>> Ottawa Ontario >>>> K1S 1Z5 >>>> >>> -- >>> Michael Slavitch >>> 62 Renfrew Ave. >>> Ottawa Ontario >>> K1S 1Z5 >>> >> >> -- > Michael Slavitch > 62 Renfrew Ave. > Ottawa Ontario > K1S 1Z5 >