Yea, it's an open question.  I'm willing to create some benchmarks, but I'd
first like to know that the feature would be accepted assuming the results
are reasonable.  Can a committer give me a thumbs up?

On Thu, Apr 28, 2016 at 11:17 AM, Reynold Xin <r...@databricks.com> wrote:

> Hm while this is an attractive idea in theory, in practice I think you are
> substantially overestimating HDFS' ability to handle a lot of small,
> ephemeral files. It has never really been optimized for that use case.
>
> On Thu, Apr 28, 2016 at 11:15 AM, Michael Gummelt <mgumm...@mesosphere.io>
> wrote:
>
>> > if after a work-load burst your cluster dynamically changes from 10000
>> workers to 1000, will the typical HDFS replication factor be sufficient to
>> retain access to the shuffle files in HDFS
>>
>> HDFS isn't resizing.  Spark is.  HDFS files should be HA and durable.
>>
>> On Thu, Apr 28, 2016 at 11:08 AM, Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>>
>>> Yes, replicated and distributed shuffle materializations are key
>>> requirement to maintain performance in a fully elastic cluster where
>>> Executors aren't just reallocated across an essentially fixed number of
>>> Worker nodes, but rather the number of Workers itself is dynamic.
>>> Retaining the file interface to those shuffle materializations while also
>>> using HDFS for the spark.local.dirs has a certain amount of attraction, but
>>> I also wonder whether a typical HDFS deployment is really sufficient to
>>> handle this kind of elastic cluster scaling.  For instance and assuming
>>> HDFS co-located on worker nodes, if after a work-load burst your cluster
>>> dynamically changes from 10000 workers to 1000, will the typical HDFS
>>> replication factor be sufficient to retain access to the shuffle files in
>>> HDFS, or will we instead be seeing numerous FetchFailure exceptions, Tasks
>>> recomputed or Stages aborted, etc. so that the net effect is not all that
>>> much different than if the shuffle files had not been relocated to HDFS and
>>> the Executors or ShuffleService instances had just disappeared along with
>>> the worker nodes?
>>>
>>> On Thu, Apr 28, 2016 at 10:46 AM, Michael Gummelt <
>>> mgumm...@mesosphere.io> wrote:
>>>
>>>> > Why would you run the shuffle service on 10K nodes but Spark executors
>>>> on just 100 nodes? wouldn't you also run that service just on the 100
>>>> nodes?
>>>>
>>>> We have to start the service beforehand, out of band, and we don't know
>>>> a priori where the Spark executors will land.  Those 100 executors could
>>>> land on any of the 10K nodes.
>>>>
>>>> > What does plumbing it through HDFS buy you in comparison?
>>>>
>>>> It drops the shuffle service requirement, which is HUGE.  It means
>>>> Spark can completely vacate the machine when it's not in use, which is
>>>> crucial for a large, multi-tenant cluster.  ShuffledRDDs can now read the
>>>> map files from HDFS, rather than the ancestor executors, which means we can
>>>> shut executors down immediately after the shuffle files are written.
>>>>
>>>> > There's some additional overhead and if anything you lose some
>>>> control over locality, in a context where I presume HDFS itself is storing
>>>> data on much more than the 100 Spark nodes.
>>>>
>>>> Write locality would be sacrificed, but the descendent executors were
>>>> already doing a remote read (they have to read from multiple ancestor
>>>> executors), so there's no additional cost in read locality.  In fact, if we
>>>> take advantage of HDFS's favored node feature, we could make it likely that
>>>> all map files for a given partition land on the same node, so the
>>>> descendent executor would never have to do a remote read!  We'd effectively
>>>> shift the remote IO from read side to write side, for theoretically no
>>>> change in performance.
>>>>
>>>> In summary:
>>>>
>>>> Advantages:
>>>> - No shuffle service dependency (increased utilization, decreased
>>>> management cost)
>>>> - Shut executors down immediately after shuffle files are written,
>>>> rather than waiting for a timeout (increased utilization)
>>>> - HDFS is HA, so shuffle files survive a node failure, which isn't true
>>>> for the shuffle service (decreased latency during failures)
>>>> - Potential ability to parallelize shuffle file reads if we write a new
>>>> shuffle iterator (decreased latency)
>>>>
>>>> Disadvantages
>>>> - Increased write latency (but potentially not if we implement it
>>>> efficiently.  See above).
>>>> - Would need some sort of GC on HDFS shuffle files
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Apr 28, 2016 at 1:36 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> Why would you run the shuffle service on 10K nodes but Spark executors
>>>>> on just 100 nodes? wouldn't you also run that service just on the 100
>>>>> nodes?
>>>>>
>>>>> What does plumbing it through HDFS buy you in comparison? There's some
>>>>> additional overhead and if anything you lose some control over
>>>>> locality, in a context where I presume HDFS itself is storing data on
>>>>> much more than the 100 Spark nodes.
>>>>>
>>>>> On Thu, Apr 28, 2016 at 1:34 AM, Michael Gummelt <
>>>>> mgumm...@mesosphere.io> wrote:
>>>>> >> Are you suggesting to have shuffle service persist and fetch data
>>>>> with
>>>>> >> hdfs, or skip shuffle service altogether and just write to hdfs?
>>>>> >
>>>>> > Skip shuffle service altogether.  Write to HDFS.
>>>>> >
>>>>> > Mesos environments tend to be multi-tenant, and running the shuffle
>>>>> service
>>>>> > on all nodes could be extremely wasteful.  If you're running a 10K
>>>>> node
>>>>> > cluster, and you'd like to run a Spark job that consumes 100 nodes,
>>>>> you
>>>>> > would have to run the shuffle service on all 10K nodes out of band
>>>>> of Spark
>>>>> > (e.g. marathon).  I'd like a solution for dynamic allocation that
>>>>> doesn't
>>>>> > require this overhead.
>>>>> >
>>>>> > I'll look at SPARK-1529.
>>>>> >
>>>>> > On Wed, Apr 27, 2016 at 10:24 AM, Steve Loughran <
>>>>> ste...@hortonworks.com>
>>>>> > wrote:
>>>>> >>
>>>>> >>
>>>>> >> > On 27 Apr 2016, at 04:59, Takeshi Yamamuro <linguin....@gmail.com
>>>>> >
>>>>> >> > wrote:
>>>>> >> >
>>>>> >> > Hi, all
>>>>> >> >
>>>>> >> > See SPARK-1529 for related discussion.
>>>>> >> >
>>>>> >> > // maropu
>>>>> >>
>>>>> >>
>>>>> >> I'd not seen that discussion.
>>>>> >>
>>>>> >> I'm actually curious about why the 15% diff in performance between
>>>>> Java
>>>>> >> NIO and Hadoop FS APIs, and, if it is the case (Hadoop still uses
>>>>> the
>>>>> >> pre-NIO libraries, *has anyone thought of just fixing Hadoop Local
>>>>> FS
>>>>> >> codepath?*
>>>>> >>
>>>>> >> It's not like anyone hasn't filed JIRAs on that ... it's just that
>>>>> nothing
>>>>> >> has ever got to a state where it was considered ready to adopt,
>>>>> where
>>>>> >> "ready" means: passes all unit and load tests against Linux, Unix,
>>>>> Windows
>>>>> >> filesystems. There's been some attempts, but they never quite got
>>>>> much
>>>>> >> engagement or support, especially as nio wasn't there properly
>>>>> until Java 7,
>>>>> >> —and Hadoop was stuck on java 6 support until 2015. That's no
>>>>> longer a
>>>>> >> constraint: someone could do the work, using the existing JIRAs as
>>>>> starting
>>>>> >> points.
>>>>> >>
>>>>> >>
>>>>> >> If someone did do this in RawLocalFS, it'd be nice if the patch also
>>>>> >> allowed you to turn off CRC creation and checking.
>>>>> >>
>>>>> >> That's not only part of the overhead, it means that flush()
>>>>> doesn't, not
>>>>> >> until you reach the end of a CRC32 block ... so breaking what few
>>>>> durability
>>>>> >> guarantees POSIX offers.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Michael Gummelt
>>>>> > Software Engineer
>>>>> > Mesosphere
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Michael Gummelt
>>>> Software Engineer
>>>> Mesosphere
>>>>
>>>
>>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>>
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere

Reply via email to