Re: Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

Patrick Wendell Sun, 27 Jul 2014 12:20:14 -0700

Hey Ted,

We always intend Spark to work with the newer Hadoop versions and
encourage Spark users to use the newest Hadoop versions for best
performance.


We do try to be liberal in terms of supporting older versions as well.
This is because many people run older HDFS versions and we want Spark
to read and write data from them. So far we've been willing to do this
despite some maintenance cost.

The reason is that for many users it's very expensive to do a
whole-sale upgrade of HDFS, but trying out new versions of Spark is
much easier. For instance, some of the largest scale Spark users run
fairly old or forked HDFS versions.

- Patrick

On Sun, Jul 27, 2014 at 12:01 PM, Ted Yu <[email protected]> wrote:
> Thanks for replying, Patrick.
>
> The intention of my first email was for utilizing newer hadoop releases for
> their bug fixes. I am still looking for clean way of passing hadoop release
> version number to individual classes.
> Using newer hadoop releases would encourage pushing bug fixes / new
> features upstream. Ultimately Spark code would become cleaner.
>
> Cheers
>
> On Sun, Jul 27, 2014 at 8:52 AM, Patrick Wendell <[email protected]> wrote:
>
>> Ted - technically I think you are correct, although I wouldn't
>> recommend disabling this lock. This lock is not expensive (acquired
>> once per task, as are many other locks already). Also, we've seen some
>> cases where Hadoop concurrency bugs ended up requiring multiple fixes
>> - concurrency of client access is not well tested in the Hadoop
>> codebase since most of the Hadoop tools to not use concurrent access.
>> So in general it's good to be conservative in what we expect of the
>> Hadoop client libraries.
>>
>> If you'd like to discuss this further, please fork a new thread, since
>> this is a vote thread. Thanks!
>>
>> On Fri, Jul 25, 2014 at 10:14 PM, Ted Yu <[email protected]> wrote:
>> > HADOOP-10456 is fixed in hadoop 2.4.1
>> >
>> > Does this mean that synchronization
>> > on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK can be bypassed for hadoop
>> > 2.4.1 ?
>> >
>> > Cheers
>> >
>> >
>> > On Fri, Jul 25, 2014 at 6:00 PM, Patrick Wendell <[email protected]>
>> wrote:
>> >
>> >> The most important issue in this release is actually an ammendment to
>> >> an earlier fix. The original fix caused a deadlock which was a
>> >> regression from 1.0.0->1.0.1:
>> >>
>> >> Issue:
>> >> https://issues.apache.org/jira/browse/SPARK-1097
>> >>
>> >> 1.0.1 Fix:
>> >> https://github.com/apache/spark/pull/1273/files (had a deadlock)
>> >>
>> >> 1.0.2 Fix:
>> >> https://github.com/apache/spark/pull/1409/files
>> >>
>> >> I failed to correctly label this on JIRA, but I've updated it!
>> >>
>> >> On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
>> >> <[email protected]> wrote:
>> >> > That query is looking at "Fix Version" not "Target Version".  The fact
>> >> that
>> >> > the first one is still open is only because the bug is not resolved in
>> >> > master.  It is fixed in 1.0.2.  The second one is partially fixed in
>> >> 1.0.2,
>> >> > but is not worth blocking the release for.
>> >> >
>> >> >
>> >> > On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas <
>> >> > [email protected]> wrote:
>> >> >
>> >> >> TD, there are a couple of unresolved issues slated for 1.0.2
>> >> >> <
>> >> >>
>> >>
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
>> >> >> >.
>> >> >> Should they be edited somehow?
>> >> >>
>> >> >>
>> >> >> On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das <
>> >> >> [email protected]>
>> >> >> wrote:
>> >> >>
>> >> >> > Please vote on releasing the following candidate as Apache Spark
>> >> version
>> >> >> > 1.0.2.
>> >> >> >
>> >> >> > This release fixes a number of bugs in Spark 1.0.1.
>> >> >> > Some of the notable ones are
>> >> >> > - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix
>> for
>> >> >> > SPARK-1199. The fix was reverted for 1.0.2.
>> >> >> > - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
>> >> >> > HDFS CSV file.
>> >> >> > The full list is at http://s.apache.org/9NJ
>> >> >> >
>> >> >> > The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
>> >> >> >
>> >> >> >
>> >> >>
>> >>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
>> >> >> >
>> >> >> > The release files, including signatures, digests, etc can be found
>> at:
>> >> >> > http://people.apache.org/~tdas/spark-1.0.2-rc1/
>> >> >> >
>> >> >> > Release artifacts are signed with the following key:
>> >> >> > https://people.apache.org/keys/committer/tdas.asc
>> >> >> >
>> >> >> > The staging repository for this release can be found at:
>> >> >> >
>> >> https://repository.apache.org/content/repositories/orgapachespark-1024/
>> >> >> >
>> >> >> > The documentation corresponding to this release can be found at:
>> >> >> > http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
>> >> >> >
>> >> >> > Please vote on releasing this package as Apache Spark 1.0.2!
>> >> >> >
>> >> >> > The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
>> >> >> > a majority of at least 3 +1 PMC votes are cast.
>> >> >> > [ ] +1 Release this package as Apache Spark 1.0.2
>> >> >> > [ ] -1 Do not release this package because ...
>> >> >> >
>> >> >> > To learn more about Apache Spark, please see
>> >> >> > http://spark.apache.org/
>> >> >> >
>> >> >>
>> >>
>>

Re: Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

Reply via email to