Thanks Subru for initiating the thread about GPU support.
I think the path of taking 2.9 as a base for 2.10 and adding new resource
types into it is quite reasonable.
That way we can combine stabilization effort on 2.9 with GPUs.

Arun, upgrading Java is probably a separate topic.
We should discuss it on a separate followup thread if we agree to add GPU
support into 2.10.

Andrew, we actually ran a small 3.0 cluster to experiment with Tensorflow
on YARN with gpu resources. It worked well! Therefore the interest.
Although given the breadth (and the quantity) of our use cases it is
infeasible to jump directly to 3.0, as Jonathan explained.
A transitional stage such as 2.10 will be required. Probably the same for
many other big-cluster folks.
It would be great if people who run different hadoop versions <= 2.8 can
converge at 2.10 bridge, to help cross over to 3.
GPU support would be a serious catalyst for us to move forward, which I
also heard from other organizations interested in ML.

Thanks,
--Konstantin

On Tue, Feb 27, 2018 at 1:28 PM, Andrew Wang <andrew.w...@cloudera.com>
wrote:

> Hi Arun/Subru,
>
> Bumping the minimum Java version is a major change, and incompatible for
> users who are unable to upgrade their JVM version. We're beyond the EOL for
> Java 7, but as we know from our experience with Java 6, there are plenty of
> users who stick on old Java versions. Bumping the Java version also makes
> backports more difficult, and we're still maintaining a number of older 2.x
> releases. I think this is too big for a minor release, particularly when we
> have 3.x as an option that fully supports Java 8.
>
> What's the rationale for bumping it here?
>
> I'm also curious if there are known issues with 3.x that we can fix to make
> 3.x upgrades smoother. I would prefer improving the upgrade experience to
> backporting major features to 2.x since 3.x is meant to be the delivery
> vehicle for new features beyond the ones named here.
>
> Best,
> Andrew
>
> On Tue, Feb 27, 2018 at 11:01 AM, Arun Suresh <asur...@apache.org> wrote:
>
> > Hello folks
> >
> > We also think this bridging release opens up an opportunity to bump the
> > java version in branch-2 to java 8.
> > Would really love to hear thoughts on that.
> >
> > Cheers
> > -Arun/Subru
> >
> >
> > On Mon, Feb 26, 2018 at 5:18 PM, Jonathan Hung <jyhung2...@gmail.com>
> > wrote:
> >
> > > Hi Subru,
> > >
> > > Thanks for starting the discussion.
> > >
> > > We (LinkedIn) have an immediate need for resource types and native GPU
> > > support. Given we are running 2.7 on our main clusters, we decided to
> > avoid
> > > deploying hadoop 3.x on our machine learning clusters (and having to
> > > support two very different hadoop versions). Since for us there is
> > > considerable risk and work involved in upgrading to hadoop 3, I think
> > > having a branch-2.10 bridge release for porting important hadoop 3
> > features
> > > to branch-2 is a good idea.
> > >
> > > Thanks,
> > >
> > >
> > > Jonathan Hung
> > >
> > > On Mon, Feb 26, 2018 at 2:37 PM, Subru Krishnan <su...@apache.org>
> > wrote:
> > >
> > > > Folks,
> > > >
> > > > We (i.e. Microsoft) have started stabilization of 2.9 for our
> > production
> > > > deployment. During planning, we realized that we need to backport 3.x
> > > > features to support GPUs (and more resource types like network IO)
> > > natively
> > > > as part of the upgrade. We'd like to share that work with the
> > community.
> > > >
> > > > Instead of stabilizing the base release and cherry-picking fixes back
> > to
> > > > Apache, we want to work publicly and push fixes directly into
> > > > trunk/.../branch-2 for a stable 2.10.0 release. Our goal is to
> create a
> > > > bridge release for our production clusters to the 3.x series and to
> > > address
> > > > scalability problems in large clusters (N*10k nodes). As we find
> > issues,
> > > we
> > > > will file JIRAs and track resolution of significant
> regressions/faults
> > in
> > > > wiki. Moreover, LinkedIn also has committed plans for a production
> > > > deployment of the same branch. We welcome broad participation,
> > > particularly
> > > > since we'll be stabilizing relatively new features.
> > > >
> > > > The exact list of features we would like to backport in YARN are:
> > > >
> > > >    - Support for Resource types [1][2]
> > > >    - Native support for GPUs[3]
> > > >    - Absolute Resource configuration in CapacityScheduler [4]
> > > >
> > > >
> > > > With regards to HDFS, we are currently looking at mainly fixes to
> > Router
> > > > based Federation and Windows specific fixes which should anyways flow
> > > > normally.
> > > >
> > > > Thoughts?
> > > >
> > > > Thanks,
> > > > Subru/Arun
> > > >
> > > > [1] https://www.mail-archive.com/yarn-dev@hadoop.apache.org/
> > > msg27786.html
> > > > [2] https://www.mail-archive.com/yarn-dev@hadoop.apache.org/
> > > msg28281.html
> > > > [3] https://issues.apache.org/jira/browse/YARN-6223
> > > > [4] https://www.mail-archive.com/yarn-dev@hadoop.apache.org/
> > > msg28772.html
> > > >
> > >
> >
>

Reply via email to