Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

Tathagata Das Thu, 22 May 2014 14:07:42 -0700

Hey all,

On further testing, I came across a bug that breaks execution of
pyspark scripts on YARN.
https://issues.apache.org/jira/browse/SPARK-1900
This is a blocker and worth cutting a new RC.


We also found a fix for a known issue that prevents additional jar
files to be specified through spark-submit on YARN.
https://issues.apache.org/jira/browse/SPARK-1870
The has been fixed and will be in the next RC.

We are canceling this vote for now. We will post RC11 shortly. Thanks
everyone for testing!

TD

On Thu, May 22, 2014 at 1:25 PM, Kevin Markey <kevin.mar...@oracle.com> wrote:
> Thank you, all!  This is quite helpful.
>
> We have been arguing how to handle this issue across a growing application.
> Unfortunately the Hadoop FileSystem java doc should say all this but
> doesn't!
>
> Kevin
>
>
> On 05/22/2014 01:48 PM, Aaron Davidson wrote:
>>
>> In Spark 0.9.0 and 0.9.1, we stopped using the FileSystem cache correctly,
>> and we just recently resumed using it in 1.0 (and in 0.9.2) when this
>> issue
>> was fixed: https://issues.apache.org/jira/browse/SPARK-1676
>>
>> Prior to this fix, each Spark task created and cached its own FileSystems
>> due to a bug in how the FS cache handles UGIs. The big problem that arose
>> was that these FileSystems were never closed, so they just kept piling up.
>> There were two solutions we considered, with the following effects: (1)
>> Share the FS cache among all tasks and (2) Each task effectively gets its
>> own FS cache, and closes all of its FSes after the task completes.
>>
>> We chose solution (1) for 3 reasons:
>>   - It does not rely on the behavior of a bug in HDFS.
>>   - It is the most performant option.
>>   - It is most consistent with the semantics of the (albeit broken) FS
>> cache.
>>
>> Since this behavior was changed in 1.0, it could be considered a
>> regression. We should consider the exact behavior we want out of the FS
>> cache. For Spark's purposes, it seems fine to cache FileSystems across
>> tasks, as Spark does not close FileSystems. The issue that comes up is
>> that
>> user code which uses FileSystem.get() but then closes the FileSystem can
>> screw up Spark processes which were using that FileSystem. The workaround
>> for users would be to use FileSystem.newInstance() if they want full
>> control over the lifecycle of their FileSystems.
>>
>>
>> On Thu, May 22, 2014 at 12:06 PM, Colin McCabe
>> <cmcc...@alumni.cmu.edu>wrote:
>>
>>> The FileSystem cache is something that has caused a lot of pain over the
>>> years.  Unfortunately we (in Hadoop core) can't change the way it works
>>> now
>>> because there are too many users depending on the current behavior.
>>>
>>> Basically, the idea is that when you request a FileSystem with certain
>>> options with FileSystem#get, you might get a reference to an FS object
>>> that
>>> already exists, from our FS cache cache singleton.  Unfortunately, this
>>> also means that someone else can change the working directory on you or
>>> close the FS underneath you.  The FS is basically shared mutable state,
>>> and
>>> you don't know whom you're sharing with.
>>>
>>> It might be better for Spark to call FileSystem#newInstance, which
>>> bypasses
>>> the FileSystem cache and always creates a new object.  If Spark can hang
>>> on
>>> to the FS for a while, it can get the benefits of caching without the
>>> downsides.  In HDFS, multiple FS instances can also share things like the
>>> socket cache between them.
>>>
>>> best,
>>> Colin
>>>
>>>
>>> On Thu, May 22, 2014 at 10:06 AM, Marcelo Vanzin <van...@cloudera.com
>>>>
>>>> wrote:
>>>> Hi Kevin,
>>>>
>>>> On Thu, May 22, 2014 at 9:49 AM, Kevin Markey <kevin.mar...@oracle.com>
>>>> wrote:
>>>>>
>>>>> The FS closed exception only effects the cleanup of the staging
>>>>
>>>> directory,
>>>>>
>>>>> not the final success or failure.  I've not yet tested the effect of
>>>>> changing my application's initialization, use, or closing of
>>>
>>> FileSystem.
>>>>
>>>> Without going and reading more of the Spark code, if your app is
>>>> explicitly close()'ing the FileSystem instance, it may be causing the
>>>> exception. If Spark is caching the FileSystem instance, your app is
>>>> probably closing that same instance (which it got from the HDFS
>>>> library's internal cache).
>>>>
>>>> It would be nice if you could test that theory; it might be worth
>>>> knowing that's the case so that we can tell people not to do that.
>>>>
>>>> --
>>>> Marcelo
>>>>
>

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

Reply via email to