Yea, it's an open question. I'm willing to create some benchmarks, but I'd first like to know that the feature would be accepted assuming the results are reasonable. Can a committer give me a thumbs up?
On Thu, Apr 28, 2016 at 11:17 AM, Reynold Xin <r...@databricks.com> wrote: > Hm while this is an attractive idea in theory, in practice I think you are > substantially overestimating HDFS' ability to handle a lot of small, > ephemeral files. It has never really been optimized for that use case. > > On Thu, Apr 28, 2016 at 11:15 AM, Michael Gummelt <mgumm...@mesosphere.io> > wrote: > >> > if after a work-load burst your cluster dynamically changes from 10000 >> workers to 1000, will the typical HDFS replication factor be sufficient to >> retain access to the shuffle files in HDFS >> >> HDFS isn't resizing. Spark is. HDFS files should be HA and durable. >> >> On Thu, Apr 28, 2016 at 11:08 AM, Mark Hamstra <m...@clearstorydata.com> >> wrote: >> >>> Yes, replicated and distributed shuffle materializations are key >>> requirement to maintain performance in a fully elastic cluster where >>> Executors aren't just reallocated across an essentially fixed number of >>> Worker nodes, but rather the number of Workers itself is dynamic. >>> Retaining the file interface to those shuffle materializations while also >>> using HDFS for the spark.local.dirs has a certain amount of attraction, but >>> I also wonder whether a typical HDFS deployment is really sufficient to >>> handle this kind of elastic cluster scaling. For instance and assuming >>> HDFS co-located on worker nodes, if after a work-load burst your cluster >>> dynamically changes from 10000 workers to 1000, will the typical HDFS >>> replication factor be sufficient to retain access to the shuffle files in >>> HDFS, or will we instead be seeing numerous FetchFailure exceptions, Tasks >>> recomputed or Stages aborted, etc. so that the net effect is not all that >>> much different than if the shuffle files had not been relocated to HDFS and >>> the Executors or ShuffleService instances had just disappeared along with >>> the worker nodes? >>> >>> On Thu, Apr 28, 2016 at 10:46 AM, Michael Gummelt < >>> mgumm...@mesosphere.io> wrote: >>> >>>> > Why would you run the shuffle service on 10K nodes but Spark executors >>>> on just 100 nodes? wouldn't you also run that service just on the 100 >>>> nodes? >>>> >>>> We have to start the service beforehand, out of band, and we don't know >>>> a priori where the Spark executors will land. Those 100 executors could >>>> land on any of the 10K nodes. >>>> >>>> > What does plumbing it through HDFS buy you in comparison? >>>> >>>> It drops the shuffle service requirement, which is HUGE. It means >>>> Spark can completely vacate the machine when it's not in use, which is >>>> crucial for a large, multi-tenant cluster. ShuffledRDDs can now read the >>>> map files from HDFS, rather than the ancestor executors, which means we can >>>> shut executors down immediately after the shuffle files are written. >>>> >>>> > There's some additional overhead and if anything you lose some >>>> control over locality, in a context where I presume HDFS itself is storing >>>> data on much more than the 100 Spark nodes. >>>> >>>> Write locality would be sacrificed, but the descendent executors were >>>> already doing a remote read (they have to read from multiple ancestor >>>> executors), so there's no additional cost in read locality. In fact, if we >>>> take advantage of HDFS's favored node feature, we could make it likely that >>>> all map files for a given partition land on the same node, so the >>>> descendent executor would never have to do a remote read! We'd effectively >>>> shift the remote IO from read side to write side, for theoretically no >>>> change in performance. >>>> >>>> In summary: >>>> >>>> Advantages: >>>> - No shuffle service dependency (increased utilization, decreased >>>> management cost) >>>> - Shut executors down immediately after shuffle files are written, >>>> rather than waiting for a timeout (increased utilization) >>>> - HDFS is HA, so shuffle files survive a node failure, which isn't true >>>> for the shuffle service (decreased latency during failures) >>>> - Potential ability to parallelize shuffle file reads if we write a new >>>> shuffle iterator (decreased latency) >>>> >>>> Disadvantages >>>> - Increased write latency (but potentially not if we implement it >>>> efficiently. See above). >>>> - Would need some sort of GC on HDFS shuffle files >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Apr 28, 2016 at 1:36 AM, Sean Owen <so...@cloudera.com> wrote: >>>> >>>>> Why would you run the shuffle service on 10K nodes but Spark executors >>>>> on just 100 nodes? wouldn't you also run that service just on the 100 >>>>> nodes? >>>>> >>>>> What does plumbing it through HDFS buy you in comparison? There's some >>>>> additional overhead and if anything you lose some control over >>>>> locality, in a context where I presume HDFS itself is storing data on >>>>> much more than the 100 Spark nodes. >>>>> >>>>> On Thu, Apr 28, 2016 at 1:34 AM, Michael Gummelt < >>>>> mgumm...@mesosphere.io> wrote: >>>>> >> Are you suggesting to have shuffle service persist and fetch data >>>>> with >>>>> >> hdfs, or skip shuffle service altogether and just write to hdfs? >>>>> > >>>>> > Skip shuffle service altogether. Write to HDFS. >>>>> > >>>>> > Mesos environments tend to be multi-tenant, and running the shuffle >>>>> service >>>>> > on all nodes could be extremely wasteful. If you're running a 10K >>>>> node >>>>> > cluster, and you'd like to run a Spark job that consumes 100 nodes, >>>>> you >>>>> > would have to run the shuffle service on all 10K nodes out of band >>>>> of Spark >>>>> > (e.g. marathon). I'd like a solution for dynamic allocation that >>>>> doesn't >>>>> > require this overhead. >>>>> > >>>>> > I'll look at SPARK-1529. >>>>> > >>>>> > On Wed, Apr 27, 2016 at 10:24 AM, Steve Loughran < >>>>> ste...@hortonworks.com> >>>>> > wrote: >>>>> >> >>>>> >> >>>>> >> > On 27 Apr 2016, at 04:59, Takeshi Yamamuro <linguin....@gmail.com >>>>> > >>>>> >> > wrote: >>>>> >> > >>>>> >> > Hi, all >>>>> >> > >>>>> >> > See SPARK-1529 for related discussion. >>>>> >> > >>>>> >> > // maropu >>>>> >> >>>>> >> >>>>> >> I'd not seen that discussion. >>>>> >> >>>>> >> I'm actually curious about why the 15% diff in performance between >>>>> Java >>>>> >> NIO and Hadoop FS APIs, and, if it is the case (Hadoop still uses >>>>> the >>>>> >> pre-NIO libraries, *has anyone thought of just fixing Hadoop Local >>>>> FS >>>>> >> codepath?* >>>>> >> >>>>> >> It's not like anyone hasn't filed JIRAs on that ... it's just that >>>>> nothing >>>>> >> has ever got to a state where it was considered ready to adopt, >>>>> where >>>>> >> "ready" means: passes all unit and load tests against Linux, Unix, >>>>> Windows >>>>> >> filesystems. There's been some attempts, but they never quite got >>>>> much >>>>> >> engagement or support, especially as nio wasn't there properly >>>>> until Java 7, >>>>> >> —and Hadoop was stuck on java 6 support until 2015. That's no >>>>> longer a >>>>> >> constraint: someone could do the work, using the existing JIRAs as >>>>> starting >>>>> >> points. >>>>> >> >>>>> >> >>>>> >> If someone did do this in RawLocalFS, it'd be nice if the patch also >>>>> >> allowed you to turn off CRC creation and checking. >>>>> >> >>>>> >> That's not only part of the overhead, it means that flush() >>>>> doesn't, not >>>>> >> until you reach the end of a CRC32 block ... so breaking what few >>>>> durability >>>>> >> guarantees POSIX offers. >>>>> >> >>>>> >> >>>>> >> >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > Michael Gummelt >>>>> > Software Engineer >>>>> > Mesosphere >>>>> >>>> >>>> >>>> >>>> -- >>>> Michael Gummelt >>>> Software Engineer >>>> Mesosphere >>>> >>> >>> >> >> >> -- >> Michael Gummelt >> Software Engineer >> Mesosphere >> > > -- Michael Gummelt Software Engineer Mesosphere