Hm while this is an attractive idea in theory, in practice I think you are
substantially overestimating HDFS' ability to handle a lot of small,
ephemeral files. It has never really been optimized for that use case.

On Thu, Apr 28, 2016 at 11:15 AM, Michael Gummelt <mgumm...@mesosphere.io>
wrote:

> > if after a work-load burst your cluster dynamically changes from 10000
> workers to 1000, will the typical HDFS replication factor be sufficient to
> retain access to the shuffle files in HDFS
>
> HDFS isn't resizing.  Spark is.  HDFS files should be HA and durable.
>
> On Thu, Apr 28, 2016 at 11:08 AM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> Yes, replicated and distributed shuffle materializations are key
>> requirement to maintain performance in a fully elastic cluster where
>> Executors aren't just reallocated across an essentially fixed number of
>> Worker nodes, but rather the number of Workers itself is dynamic.
>> Retaining the file interface to those shuffle materializations while also
>> using HDFS for the spark.local.dirs has a certain amount of attraction, but
>> I also wonder whether a typical HDFS deployment is really sufficient to
>> handle this kind of elastic cluster scaling.  For instance and assuming
>> HDFS co-located on worker nodes, if after a work-load burst your cluster
>> dynamically changes from 10000 workers to 1000, will the typical HDFS
>> replication factor be sufficient to retain access to the shuffle files in
>> HDFS, or will we instead be seeing numerous FetchFailure exceptions, Tasks
>> recomputed or Stages aborted, etc. so that the net effect is not all that
>> much different than if the shuffle files had not been relocated to HDFS and
>> the Executors or ShuffleService instances had just disappeared along with
>> the worker nodes?
>>
>> On Thu, Apr 28, 2016 at 10:46 AM, Michael Gummelt <mgumm...@mesosphere.io
>> > wrote:
>>
>>> > Why would you run the shuffle service on 10K nodes but Spark executors
>>> on just 100 nodes? wouldn't you also run that service just on the 100
>>> nodes?
>>>
>>> We have to start the service beforehand, out of band, and we don't know
>>> a priori where the Spark executors will land.  Those 100 executors could
>>> land on any of the 10K nodes.
>>>
>>> > What does plumbing it through HDFS buy you in comparison?
>>>
>>> It drops the shuffle service requirement, which is HUGE.  It means Spark
>>> can completely vacate the machine when it's not in use, which is crucial
>>> for a large, multi-tenant cluster.  ShuffledRDDs can now read the map files
>>> from HDFS, rather than the ancestor executors, which means we can shut
>>> executors down immediately after the shuffle files are written.
>>>
>>> > There's some additional overhead and if anything you lose some control
>>> over locality, in a context where I presume HDFS itself is storing data on
>>> much more than the 100 Spark nodes.
>>>
>>> Write locality would be sacrificed, but the descendent executors were
>>> already doing a remote read (they have to read from multiple ancestor
>>> executors), so there's no additional cost in read locality.  In fact, if we
>>> take advantage of HDFS's favored node feature, we could make it likely that
>>> all map files for a given partition land on the same node, so the
>>> descendent executor would never have to do a remote read!  We'd effectively
>>> shift the remote IO from read side to write side, for theoretically no
>>> change in performance.
>>>
>>> In summary:
>>>
>>> Advantages:
>>> - No shuffle service dependency (increased utilization, decreased
>>> management cost)
>>> - Shut executors down immediately after shuffle files are written,
>>> rather than waiting for a timeout (increased utilization)
>>> - HDFS is HA, so shuffle files survive a node failure, which isn't true
>>> for the shuffle service (decreased latency during failures)
>>> - Potential ability to parallelize shuffle file reads if we write a new
>>> shuffle iterator (decreased latency)
>>>
>>> Disadvantages
>>> - Increased write latency (but potentially not if we implement it
>>> efficiently.  See above).
>>> - Would need some sort of GC on HDFS shuffle files
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Apr 28, 2016 at 1:36 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Why would you run the shuffle service on 10K nodes but Spark executors
>>>> on just 100 nodes? wouldn't you also run that service just on the 100
>>>> nodes?
>>>>
>>>> What does plumbing it through HDFS buy you in comparison? There's some
>>>> additional overhead and if anything you lose some control over
>>>> locality, in a context where I presume HDFS itself is storing data on
>>>> much more than the 100 Spark nodes.
>>>>
>>>> On Thu, Apr 28, 2016 at 1:34 AM, Michael Gummelt <
>>>> mgumm...@mesosphere.io> wrote:
>>>> >> Are you suggesting to have shuffle service persist and fetch data
>>>> with
>>>> >> hdfs, or skip shuffle service altogether and just write to hdfs?
>>>> >
>>>> > Skip shuffle service altogether.  Write to HDFS.
>>>> >
>>>> > Mesos environments tend to be multi-tenant, and running the shuffle
>>>> service
>>>> > on all nodes could be extremely wasteful.  If you're running a 10K
>>>> node
>>>> > cluster, and you'd like to run a Spark job that consumes 100 nodes,
>>>> you
>>>> > would have to run the shuffle service on all 10K nodes out of band of
>>>> Spark
>>>> > (e.g. marathon).  I'd like a solution for dynamic allocation that
>>>> doesn't
>>>> > require this overhead.
>>>> >
>>>> > I'll look at SPARK-1529.
>>>> >
>>>> > On Wed, Apr 27, 2016 at 10:24 AM, Steve Loughran <
>>>> ste...@hortonworks.com>
>>>> > wrote:
>>>> >>
>>>> >>
>>>> >> > On 27 Apr 2016, at 04:59, Takeshi Yamamuro <linguin....@gmail.com>
>>>> >> > wrote:
>>>> >> >
>>>> >> > Hi, all
>>>> >> >
>>>> >> > See SPARK-1529 for related discussion.
>>>> >> >
>>>> >> > // maropu
>>>> >>
>>>> >>
>>>> >> I'd not seen that discussion.
>>>> >>
>>>> >> I'm actually curious about why the 15% diff in performance between
>>>> Java
>>>> >> NIO and Hadoop FS APIs, and, if it is the case (Hadoop still uses the
>>>> >> pre-NIO libraries, *has anyone thought of just fixing Hadoop Local FS
>>>> >> codepath?*
>>>> >>
>>>> >> It's not like anyone hasn't filed JIRAs on that ... it's just that
>>>> nothing
>>>> >> has ever got to a state where it was considered ready to adopt, where
>>>> >> "ready" means: passes all unit and load tests against Linux, Unix,
>>>> Windows
>>>> >> filesystems. There's been some attempts, but they never quite got
>>>> much
>>>> >> engagement or support, especially as nio wasn't there properly until
>>>> Java 7,
>>>> >> —and Hadoop was stuck on java 6 support until 2015. That's no longer
>>>> a
>>>> >> constraint: someone could do the work, using the existing JIRAs as
>>>> starting
>>>> >> points.
>>>> >>
>>>> >>
>>>> >> If someone did do this in RawLocalFS, it'd be nice if the patch also
>>>> >> allowed you to turn off CRC creation and checking.
>>>> >>
>>>> >> That's not only part of the overhead, it means that flush() doesn't,
>>>> not
>>>> >> until you reach the end of a CRC32 block ... so breaking what few
>>>> durability
>>>> >> guarantees POSIX offers.
>>>> >>
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Michael Gummelt
>>>> > Software Engineer
>>>> > Mesosphere
>>>>
>>>
>>>
>>>
>>> --
>>> Michael Gummelt
>>> Software Engineer
>>> Mesosphere
>>>
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>

Reply via email to