Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Kevin Markey
I've discovered that one of the anomalies I encountered was due to a 
(embarrassing? humorous?) user error.  See the user list thread Failed 
RC-10 yarn-cluster job for FS closed error when cleaning up staging 
directory for my discussion.  With the user error corrected, the FS 
closed exception only prevents deletion of the staging directory, but 
does not affect completion with SUCCESS. The FS closed exception still 
needs some investigation at least by me.


I tried the patch reported by SPARK-1898, but it didn't fix the problem 
without fixing the user error.  I did not attempt to test my fix without 
the patch, so I can't pass judgment on the patch.


Although this is merely a pseudocluster based test -- I can't 
reconfigure our cluster with RC-10 -- I'll now change my vote to...


+1.

Thanks all who helped.
Kevin



On 05/21/2014 09:18 PM, Tom Graves wrote:

I don't think Kevin's issue would be with an api change in YarnClientImpl since 
in both cases he says he is using hadoop 2.3.0.  I'll take a look at his post 
in the user list.

Tom




On Wednesday, May 21, 2014 7:01 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote:
  



Hi Kevin,

Can you try https://issues.apache.org/jira/browse/SPARK-1898 to see if it
fixes your issue?

Running in YARN cluster mode, I had a similar issue where Spark was able to
create a Driver and an Executor via YARN, but then it stopped making any
progress.

Note: I was using a pre-release version of CDH5.1.0, not 2.3 like you were
using.

best,
Colin



On Wed, May 21, 2014 at 3:34 PM, Kevin Markey kevin.mar...@oracle.comwrote:


0

Abstaining because I'm not sure if my failures are due to Spark,
configuration, or other factors...

Compiled and deployed RC10 for YARN, Hadoop 2.3

  per Spark 1.0.0 Yarn

documentation.  No problems.
Rebuilt applications against RC10 and Hadoop 2.3.0 (plain vanilla Apache
release).
Updated scripts for various applications.
Application had successfully compiled and run against Spark 0.9.1 and
Hadoop 2.3.0.
Ran in yarn-cluster mode.
Application ran to conclusion except that it ultimately failed because of
an exception when Spark tried to clean up the staging directory.  Also,
where before Yarn would report the running program as RUNNING, it only
reported this application as ACCEPTED.  It appeared to run two containers
when the first instance never reported that it was RUNNING.

I will post a

  separate note to the USER list about the specifics.

Thanks
Kevin Markey



On 05/21/2014 10:58 AM, Mark Hamstra wrote:


+1


On Tue, May 20, 2014 at 11:09 PM, Henry Saputra henry.sapu...@gmail.com
wrote:

   Signature and hash for source looks good

No external executable package with source - good
Compiled with git and maven - good
Ran examples and sample programs locally and standalone -good

+1

- Henry



On Tue, May 20, 2014 at 1:13 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:


Please vote on releasing the following candidate as Apache Spark version


1.0.0!


This has a few bug fixes on top of rc9:
SPARK-1875: https://github.com/apache/spark/pull/824
SPARK-1876: https://github.com/apache/spark/pull/819
SPARK-1878: https://github.com/apache/spark/pull/822
SPARK-1879: https://github.com/apache/spark/pull/823

The tag to be voted on is v1.0.0-rc10 (commit d8070234):

   https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=

d807023479ce10aec28ef3c1ab646ddefc2e663c


The

  release files, including signatures, digests, etc. can be found at:

http://people.apache.org/~tdas/spark-1.0.0-rc10/

The release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1018/

The documentation

  corresponding to this release can be found at:

http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

The full list of changes in this release can be found at:

   https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;

f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=
d807023479ce10aec28ef3c1ab646ddefc2e663c


Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until

  Friday, May 23, at 20:00 UTC and passes if

amajority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

Changes to ML vector specification:

   http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

mllib-guide.html#from-09-to-10


Changes to the Java API:

   http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

java-programming-guide.html#upgrading-from-pre-10-versions-of-spark


Changes 

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Kevin Markey

I retested several different cases...

1. FS closed exception shows up ONLY in RC-10, not in Spark 0.9.1, with 
both Hadoop 2.2 and 2.3.

2. SPARK-1898 has no effect for my use cases.
3. The failure to report that the underlying application is RUNNING 
and that it has succeeded is due ONLY to my user error.


The FS closed exception only effects the cleanup of the staging 
directory, not the final success or failure.  I've not yet tested the 
effect of changing my application's initialization, use, or closing of 
FileSystem.


Thanks again.
Kevin

On 05/22/2014 01:32 AM, Kevin Markey wrote:
I've discovered that one of the anomalies I encountered was due to a 
(embarrassing? humorous?) user error.  See the user list thread 
Failed RC-10 yarn-cluster job for FS closed error when cleaning up 
staging directory for my discussion.  With the user error corrected, 
the FS closed exception only prevents deletion of the staging 
directory, but does not affect completion with SUCCESS. The FS 
closed exception still needs some investigation at least by me.


I tried the patch reported by SPARK-1898, but it didn't fix the 
problem without fixing the user error.  I did not attempt to test my 
fix without the patch, so I can't pass judgment on the patch.


Although this is merely a pseudocluster based test -- I can't 
reconfigure our cluster with RC-10 -- I'll now change my vote to...


+1.

Thanks all who helped.
Kevin



On 05/21/2014 09:18 PM, Tom Graves wrote:
I don't think Kevin's issue would be with an api change in 
YarnClientImpl since in both cases he says he is using hadoop 2.3.0.  
I'll take a look at his post in the user list.


Tom




On Wednesday, May 21, 2014 7:01 PM, Colin McCabe 
cmcc...@alumni.cmu.edu wrote:



Hi Kevin,

Can you try https://issues.apache.org/jira/browse/SPARK-1898 to see 
if it

fixes your issue?

Running in YARN cluster mode, I had a similar issue where Spark was 
able to

create a Driver and an Executor via YARN, but then it stopped making any
progress.

Note: I was using a pre-release version of CDH5.1.0, not 2.3 like you 
were

using.

best,
Colin



On Wed, May 21, 2014 at 3:34 PM, Kevin Markey 
kevin.mar...@oracle.comwrote:



0

Abstaining because I'm not sure if my failures are due to Spark,
configuration, or other factors...

Compiled and deployed RC10 for YARN, Hadoop 2.3

  per Spark 1.0.0 Yarn

documentation.  No problems.
Rebuilt applications against RC10 and Hadoop 2.3.0 (plain vanilla 
Apache

release).
Updated scripts for various applications.
Application had successfully compiled and run against Spark 0.9.1 and
Hadoop 2.3.0.
Ran in yarn-cluster mode.
Application ran to conclusion except that it ultimately failed 
because of

an exception when Spark tried to clean up the staging directory.  Also,
where before Yarn would report the running program as RUNNING, it 
only
reported this application as ACCEPTED.  It appeared to run two 
containers

when the first instance never reported that it was RUNNING.

I will post a

  separate note to the USER list about the specifics.

Thanks
Kevin Markey



On 05/21/2014 10:58 AM, Mark Hamstra wrote:


+1


On Tue, May 20, 2014 at 11:09 PM, Henry Saputra 
henry.sapu...@gmail.com

wrote:

   Signature and hash for source looks good

No external executable package with source - good
Compiled with git and maven - good
Ran examples and sample programs locally and standalone -good

+1

- Henry



On Tue, May 20, 2014 at 1:13 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:

Please vote on releasing the following candidate as Apache Spark 
version



1.0.0!


This has a few bug fixes on top of rc9:
SPARK-1875: https://github.com/apache/spark/pull/824
SPARK-1876: https://github.com/apache/spark/pull/819
SPARK-1878: https://github.com/apache/spark/pull/822
SPARK-1879: https://github.com/apache/spark/pull/823

The tag to be voted on is v1.0.0-rc10 (commit d8070234):

https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=

d807023479ce10aec28ef3c1ab646ddefc2e663c


The

  release files, including signatures, digests, etc. can be found at:

http://people.apache.org/~tdas/spark-1.0.0-rc10/

The release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1018/ 



The documentation

  corresponding to this release can be found at:

http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

The full list of changes in this release can be found at:

https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;

f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=
d807023479ce10aec28ef3c1ab646ddefc2e663c


Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until

  Friday, May 23, at 20:00 UTC and passes if

amajority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package 

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Marcelo Vanzin
Hi Kevin,

On Thu, May 22, 2014 at 9:49 AM, Kevin Markey kevin.mar...@oracle.com wrote:
 The FS closed exception only effects the cleanup of the staging directory,
 not the final success or failure.  I've not yet tested the effect of
 changing my application's initialization, use, or closing of FileSystem.

Without going and reading more of the Spark code, if your app is
explicitly close()'ing the FileSystem instance, it may be causing the
exception. If Spark is caching the FileSystem instance, your app is
probably closing that same instance (which it got from the HDFS
library's internal cache).

It would be nice if you could test that theory; it might be worth
knowing that's the case so that we can tell people not to do that.

-- 
Marcelo


Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Colin McCabe
The FileSystem cache is something that has caused a lot of pain over the
years.  Unfortunately we (in Hadoop core) can't change the way it works now
because there are too many users depending on the current behavior.

Basically, the idea is that when you request a FileSystem with certain
options with FileSystem#get, you might get a reference to an FS object that
already exists, from our FS cache cache singleton.  Unfortunately, this
also means that someone else can change the working directory on you or
close the FS underneath you.  The FS is basically shared mutable state, and
you don't know whom you're sharing with.

It might be better for Spark to call FileSystem#newInstance, which bypasses
the FileSystem cache and always creates a new object.  If Spark can hang on
to the FS for a while, it can get the benefits of caching without the
downsides.  In HDFS, multiple FS instances can also share things like the
socket cache between them.

best,
Colin


On Thu, May 22, 2014 at 10:06 AM, Marcelo Vanzin van...@cloudera.comwrote:

 Hi Kevin,

 On Thu, May 22, 2014 at 9:49 AM, Kevin Markey kevin.mar...@oracle.com
 wrote:
  The FS closed exception only effects the cleanup of the staging
 directory,
  not the final success or failure.  I've not yet tested the effect of
  changing my application's initialization, use, or closing of FileSystem.

 Without going and reading more of the Spark code, if your app is
 explicitly close()'ing the FileSystem instance, it may be causing the
 exception. If Spark is caching the FileSystem instance, your app is
 probably closing that same instance (which it got from the HDFS
 library's internal cache).

 It would be nice if you could test that theory; it might be worth
 knowing that's the case so that we can tell people not to do that.

 --
 Marcelo



Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Aaron Davidson
In Spark 0.9.0 and 0.9.1, we stopped using the FileSystem cache correctly,
and we just recently resumed using it in 1.0 (and in 0.9.2) when this issue
was fixed: https://issues.apache.org/jira/browse/SPARK-1676

Prior to this fix, each Spark task created and cached its own FileSystems
due to a bug in how the FS cache handles UGIs. The big problem that arose
was that these FileSystems were never closed, so they just kept piling up.
There were two solutions we considered, with the following effects: (1)
Share the FS cache among all tasks and (2) Each task effectively gets its
own FS cache, and closes all of its FSes after the task completes.

We chose solution (1) for 3 reasons:
 - It does not rely on the behavior of a bug in HDFS.
 - It is the most performant option.
 - It is most consistent with the semantics of the (albeit broken) FS cache.

Since this behavior was changed in 1.0, it could be considered a
regression. We should consider the exact behavior we want out of the FS
cache. For Spark's purposes, it seems fine to cache FileSystems across
tasks, as Spark does not close FileSystems. The issue that comes up is that
user code which uses FileSystem.get() but then closes the FileSystem can
screw up Spark processes which were using that FileSystem. The workaround
for users would be to use FileSystem.newInstance() if they want full
control over the lifecycle of their FileSystems.


On Thu, May 22, 2014 at 12:06 PM, Colin McCabe cmcc...@alumni.cmu.eduwrote:

 The FileSystem cache is something that has caused a lot of pain over the
 years.  Unfortunately we (in Hadoop core) can't change the way it works now
 because there are too many users depending on the current behavior.

 Basically, the idea is that when you request a FileSystem with certain
 options with FileSystem#get, you might get a reference to an FS object that
 already exists, from our FS cache cache singleton.  Unfortunately, this
 also means that someone else can change the working directory on you or
 close the FS underneath you.  The FS is basically shared mutable state, and
 you don't know whom you're sharing with.

 It might be better for Spark to call FileSystem#newInstance, which bypasses
 the FileSystem cache and always creates a new object.  If Spark can hang on
 to the FS for a while, it can get the benefits of caching without the
 downsides.  In HDFS, multiple FS instances can also share things like the
 socket cache between them.

 best,
 Colin


 On Thu, May 22, 2014 at 10:06 AM, Marcelo Vanzin van...@cloudera.com
 wrote:

  Hi Kevin,
 
  On Thu, May 22, 2014 at 9:49 AM, Kevin Markey kevin.mar...@oracle.com
  wrote:
   The FS closed exception only effects the cleanup of the staging
  directory,
   not the final success or failure.  I've not yet tested the effect of
   changing my application's initialization, use, or closing of
 FileSystem.
 
  Without going and reading more of the Spark code, if your app is
  explicitly close()'ing the FileSystem instance, it may be causing the
  exception. If Spark is caching the FileSystem instance, your app is
  probably closing that same instance (which it got from the HDFS
  library's internal cache).
 
  It would be nice if you could test that theory; it might be worth
  knowing that's the case so that we can tell people not to do that.
 
  --
  Marcelo
 



Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Kevin Markey

Thank you, all!  This is quite helpful.

We have been arguing how to handle this issue across a growing 
application.  Unfortunately the Hadoop FileSystem java doc should say 
all this but doesn't!


Kevin

On 05/22/2014 01:48 PM, Aaron Davidson wrote:

In Spark 0.9.0 and 0.9.1, we stopped using the FileSystem cache correctly,
and we just recently resumed using it in 1.0 (and in 0.9.2) when this issue
was fixed: https://issues.apache.org/jira/browse/SPARK-1676

Prior to this fix, each Spark task created and cached its own FileSystems
due to a bug in how the FS cache handles UGIs. The big problem that arose
was that these FileSystems were never closed, so they just kept piling up.
There were two solutions we considered, with the following effects: (1)
Share the FS cache among all tasks and (2) Each task effectively gets its
own FS cache, and closes all of its FSes after the task completes.

We chose solution (1) for 3 reasons:
  - It does not rely on the behavior of a bug in HDFS.
  - It is the most performant option.
  - It is most consistent with the semantics of the (albeit broken) FS cache.

Since this behavior was changed in 1.0, it could be considered a
regression. We should consider the exact behavior we want out of the FS
cache. For Spark's purposes, it seems fine to cache FileSystems across
tasks, as Spark does not close FileSystems. The issue that comes up is that
user code which uses FileSystem.get() but then closes the FileSystem can
screw up Spark processes which were using that FileSystem. The workaround
for users would be to use FileSystem.newInstance() if they want full
control over the lifecycle of their FileSystems.


On Thu, May 22, 2014 at 12:06 PM, Colin McCabe cmcc...@alumni.cmu.eduwrote:


The FileSystem cache is something that has caused a lot of pain over the
years.  Unfortunately we (in Hadoop core) can't change the way it works now
because there are too many users depending on the current behavior.

Basically, the idea is that when you request a FileSystem with certain
options with FileSystem#get, you might get a reference to an FS object that
already exists, from our FS cache cache singleton.  Unfortunately, this
also means that someone else can change the working directory on you or
close the FS underneath you.  The FS is basically shared mutable state, and
you don't know whom you're sharing with.

It might be better for Spark to call FileSystem#newInstance, which bypasses
the FileSystem cache and always creates a new object.  If Spark can hang on
to the FS for a while, it can get the benefits of caching without the
downsides.  In HDFS, multiple FS instances can also share things like the
socket cache between them.

best,
Colin


On Thu, May 22, 2014 at 10:06 AM, Marcelo Vanzin van...@cloudera.com

wrote:
Hi Kevin,

On Thu, May 22, 2014 at 9:49 AM, Kevin Markey kevin.mar...@oracle.com
wrote:

The FS closed exception only effects the cleanup of the staging

directory,

not the final success or failure.  I've not yet tested the effect of
changing my application's initialization, use, or closing of

FileSystem.

Without going and reading more of the Spark code, if your app is
explicitly close()'ing the FileSystem instance, it may be causing the
exception. If Spark is caching the FileSystem instance, your app is
probably closing that same instance (which it got from the HDFS
library's internal cache).

It would be nice if you could test that theory; it might be worth
knowing that's the case so that we can tell people not to do that.

--
Marcelo





Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Tathagata Das
Hey all,

On further testing, I came across a bug that breaks execution of
pyspark scripts on YARN.
https://issues.apache.org/jira/browse/SPARK-1900
This is a blocker and worth cutting a new RC.

We also found a fix for a known issue that prevents additional jar
files to be specified through spark-submit on YARN.
https://issues.apache.org/jira/browse/SPARK-1870
The has been fixed and will be in the next RC.

We are canceling this vote for now. We will post RC11 shortly. Thanks
everyone for testing!

TD

On Thu, May 22, 2014 at 1:25 PM, Kevin Markey kevin.mar...@oracle.com wrote:
 Thank you, all!  This is quite helpful.

 We have been arguing how to handle this issue across a growing application.
 Unfortunately the Hadoop FileSystem java doc should say all this but
 doesn't!

 Kevin


 On 05/22/2014 01:48 PM, Aaron Davidson wrote:

 In Spark 0.9.0 and 0.9.1, we stopped using the FileSystem cache correctly,
 and we just recently resumed using it in 1.0 (and in 0.9.2) when this
 issue
 was fixed: https://issues.apache.org/jira/browse/SPARK-1676

 Prior to this fix, each Spark task created and cached its own FileSystems
 due to a bug in how the FS cache handles UGIs. The big problem that arose
 was that these FileSystems were never closed, so they just kept piling up.
 There were two solutions we considered, with the following effects: (1)
 Share the FS cache among all tasks and (2) Each task effectively gets its
 own FS cache, and closes all of its FSes after the task completes.

 We chose solution (1) for 3 reasons:
   - It does not rely on the behavior of a bug in HDFS.
   - It is the most performant option.
   - It is most consistent with the semantics of the (albeit broken) FS
 cache.

 Since this behavior was changed in 1.0, it could be considered a
 regression. We should consider the exact behavior we want out of the FS
 cache. For Spark's purposes, it seems fine to cache FileSystems across
 tasks, as Spark does not close FileSystems. The issue that comes up is
 that
 user code which uses FileSystem.get() but then closes the FileSystem can
 screw up Spark processes which were using that FileSystem. The workaround
 for users would be to use FileSystem.newInstance() if they want full
 control over the lifecycle of their FileSystems.


 On Thu, May 22, 2014 at 12:06 PM, Colin McCabe
 cmcc...@alumni.cmu.eduwrote:

 The FileSystem cache is something that has caused a lot of pain over the
 years.  Unfortunately we (in Hadoop core) can't change the way it works
 now
 because there are too many users depending on the current behavior.

 Basically, the idea is that when you request a FileSystem with certain
 options with FileSystem#get, you might get a reference to an FS object
 that
 already exists, from our FS cache cache singleton.  Unfortunately, this
 also means that someone else can change the working directory on you or
 close the FS underneath you.  The FS is basically shared mutable state,
 and
 you don't know whom you're sharing with.

 It might be better for Spark to call FileSystem#newInstance, which
 bypasses
 the FileSystem cache and always creates a new object.  If Spark can hang
 on
 to the FS for a while, it can get the benefits of caching without the
 downsides.  In HDFS, multiple FS instances can also share things like the
 socket cache between them.

 best,
 Colin


 On Thu, May 22, 2014 at 10:06 AM, Marcelo Vanzin van...@cloudera.com

 wrote:
 Hi Kevin,

 On Thu, May 22, 2014 at 9:49 AM, Kevin Markey kevin.mar...@oracle.com
 wrote:

 The FS closed exception only effects the cleanup of the staging

 directory,

 not the final success or failure.  I've not yet tested the effect of
 changing my application's initialization, use, or closing of

 FileSystem.

 Without going and reading more of the Spark code, if your app is
 explicitly close()'ing the FileSystem instance, it may be causing the
 exception. If Spark is caching the FileSystem instance, your app is
 probably closing that same instance (which it got from the HDFS
 library's internal cache).

 It would be nice if you could test that theory; it might be worth
 knowing that's the case so that we can tell people not to do that.

 --
 Marcelo




Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Colin McCabe
On Thu, May 22, 2014 at 12:48 PM, Aaron Davidson ilike...@gmail.com wrote:

 In Spark 0.9.0 and 0.9.1, we stopped using the FileSystem cache correctly,
 and we just recently resumed using it in 1.0 (and in 0.9.2) when this issue
 was fixed: https://issues.apache.org/jira/browse/SPARK-1676


Interesting...


 Prior to this fix, each Spark task created and cached its own FileSystems
 due to a bug in how the FS cache handles UGIs. The big problem that arose
 was that these FileSystems were never closed, so they just kept piling up.
 There were two solutions we considered, with the following effects: (1)
 Share the FS cache among all tasks and (2) Each task effectively gets its
 own FS cache, and closes all of its FSes after the task completes.


Since the FS cache is in hadoop-common-project, it's not so much a bug in
HDFS as a bug in Hadoop.  So even if you're using, say, Lustre, you'll
still get the same issues with org.apache.hadoop.fs.FileSystem and its
global cache.

We chose solution (1) for 3 reasons:
  - It does not rely on the behavior of a bug in HDFS.

 - It is the most performant option.
  - It is most consistent with the semantics of the (albeit broken) FS
 cache.

 Since this behavior was changed in 1.0, it could be considered a
 regression. We should consider the exact behavior we want out of the FS
 cache. For Spark's purposes, it seems fine to cache FileSystems across
 tasks, as Spark does not close FileSystems. The issue that comes up is that
 user code which uses FileSystem.get() but then closes the FileSystem can
 screw up Spark processes which were using that FileSystem. The workaround
 for users would be to use FileSystem.newInstance() if they want full
 control over the lifecycle of their FileSystems.


The current solution seems reasonable, as long as Spark processes:
1. don't change the current working directory (doing so isn't thread-safe
and will affect all other users of that FS object)
2. don't close the FileSystem object

Another solution would be to use newInstance and build your own FS cache,
essentially.  I don't think it would be that much code.  This might be
nicer because you could implement things like closing FileSystem objects
that haven't been used in a while.

cheers,
Colin



 On Thu, May 22, 2014 at 12:06 PM, Colin McCabe cmcc...@alumni.cmu.edu
 wrote:

  The FileSystem cache is something that has caused a lot of pain over the
  years.  Unfortunately we (in Hadoop core) can't change the way it works
 now
  because there are too many users depending on the current behavior.
 
  Basically, the idea is that when you request a FileSystem with certain
  options with FileSystem#get, you might get a reference to an FS object
 that
  already exists, from our FS cache cache singleton.  Unfortunately, this
  also means that someone else can change the working directory on you or
  close the FS underneath you.  The FS is basically shared mutable state,
 and
  you don't know whom you're sharing with.
 
  It might be better for Spark to call FileSystem#newInstance, which
 bypasses
  the FileSystem cache and always creates a new object.  If Spark can hang
 on
  to the FS for a while, it can get the benefits of caching without the
  downsides.  In HDFS, multiple FS instances can also share things like the
  socket cache between them.
 
  best,
  Colin
 
 
  On Thu, May 22, 2014 at 10:06 AM, Marcelo Vanzin van...@cloudera.com
  wrote:
 
   Hi Kevin,
  
   On Thu, May 22, 2014 at 9:49 AM, Kevin Markey kevin.mar...@oracle.com
 
   wrote:
The FS closed exception only effects the cleanup of the staging
   directory,
not the final success or failure.  I've not yet tested the effect of
changing my application's initialization, use, or closing of
  FileSystem.
  
   Without going and reading more of the Spark code, if your app is
   explicitly close()'ing the FileSystem instance, it may be causing the
   exception. If Spark is caching the FileSystem instance, your app is
   probably closing that same instance (which it got from the HDFS
   library's internal cache).
  
   It would be nice if you could test that theory; it might be worth
   knowing that's the case so that we can tell people not to do that.
  
   --
   Marcelo
  
 



Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Henry Saputra
Looks like SPARK-1900 is a blocker for YARN and might as well add
SPARK-1870 while at it.

TD or Patrick, could you kindly send [CANCEL] prefixed in the subject
email out for the RC10 Vote to help people follow the active VOTE
threads? The VOTE emails are getting a bit hard to follow.


- Henry


On Thu, May 22, 2014 at 2:05 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 Hey all,

 On further testing, I came across a bug that breaks execution of
 pyspark scripts on YARN.
 https://issues.apache.org/jira/browse/SPARK-1900
 This is a blocker and worth cutting a new RC.

 We also found a fix for a known issue that prevents additional jar
 files to be specified through spark-submit on YARN.
 https://issues.apache.org/jira/browse/SPARK-1870
 The has been fixed and will be in the next RC.

 We are canceling this vote for now. We will post RC11 shortly. Thanks
 everyone for testing!

 TD



Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Tathagata Das
Right! Doing that.

TD

On Thu, May 22, 2014 at 3:07 PM, Henry Saputra henry.sapu...@gmail.com wrote:
 Looks like SPARK-1900 is a blocker for YARN and might as well add
 SPARK-1870 while at it.

 TD or Patrick, could you kindly send [CANCEL] prefixed in the subject
 email out for the RC10 Vote to help people follow the active VOTE
 threads? The VOTE emails are getting a bit hard to follow.


 - Henry


 On Thu, May 22, 2014 at 2:05 PM, Tathagata Das
 tathagata.das1...@gmail.com wrote:
 Hey all,

 On further testing, I came across a bug that breaks execution of
 pyspark scripts on YARN.
 https://issues.apache.org/jira/browse/SPARK-1900
 This is a blocker and worth cutting a new RC.

 We also found a fix for a known issue that prevents additional jar
 files to be specified through spark-submit on YARN.
 https://issues.apache.org/jira/browse/SPARK-1870
 The has been fixed and will be in the next RC.

 We are canceling this vote for now. We will post RC11 shortly. Thanks
 everyone for testing!

 TD



Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-21 Thread Mark Hamstra
+1


On Tue, May 20, 2014 at 11:09 PM, Henry Saputra henry.sapu...@gmail.comwrote:

 Signature and hash for source looks good
 No external executable package with source - good
 Compiled with git and maven - good
 Ran examples and sample programs locally and standalone -good

 +1

 - Henry



 On Tue, May 20, 2014 at 1:13 PM, Tathagata Das
 tathagata.das1...@gmail.com wrote:
  Please vote on releasing the following candidate as Apache Spark version
 1.0.0!
 
  This has a few bug fixes on top of rc9:
  SPARK-1875: https://github.com/apache/spark/pull/824
  SPARK-1876: https://github.com/apache/spark/pull/819
  SPARK-1878: https://github.com/apache/spark/pull/822
  SPARK-1879: https://github.com/apache/spark/pull/823
 
  The tag to be voted on is v1.0.0-rc10 (commit d8070234):
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d807023479ce10aec28ef3c1ab646ddefc2e663c
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~tdas/spark-1.0.0-rc10/
 
  The release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/tdas.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1018/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/
 
  The full list of changes in this release can be found at:
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=d807023479ce10aec28ef3c1ab646ddefc2e663c
 
  Please vote on releasing this package as Apache Spark 1.0.0!
 
  The vote is open until Friday, May 23, at 20:00 UTC and passes if
  amajority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == API Changes ==
  We welcome users to compile Spark applications against 1.0. There are
  a few API changes in this release. Here are links to the associated
  upgrade guides - user facing changes have been kept as small as
  possible.
 
  Changes to ML vector specification:
 
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/mllib-guide.html#from-09-to-10
 
  Changes to the Java API:
 
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
 
  Changes to the streaming API:
 
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
 
  Changes to the GraphX API:
 
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
 
  Other changes:
  coGroup and related functions now return Iterable[T] instead of Seq[T]
  == Call toSeq on the result to restore the old behavior
 
  SparkContext.jarOfClass returns Option[String] instead of Seq[String]
  == Call toSeq on the result to restore old behavior



Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-21 Thread Kevin Markey

0

Abstaining because I'm not sure if my failures are due to Spark, 
configuration, or other factors...


Compiled and deployed RC10 for YARN, Hadoop 2.3 per Spark 1.0.0 Yarn 
documentation.  No problems.
Rebuilt applications against RC10 and Hadoop 2.3.0 (plain vanilla Apache 
release).

Updated scripts for various applications.
Application had successfully compiled and run against Spark 0.9.1 and 
Hadoop 2.3.0.

Ran in yarn-cluster mode.
Application ran to conclusion except that it ultimately failed because 
of an exception when Spark tried to clean up the staging directory.  
Also, where before Yarn would report the running program as RUNNING, 
it only reported this application as ACCEPTED.  It appeared to run two 
containers when the first instance never reported that it was RUNNING.


I will post a separate note to the USER list about the specifics.

Thanks
Kevin Markey


On 05/21/2014 10:58 AM, Mark Hamstra wrote:

+1


On Tue, May 20, 2014 at 11:09 PM, Henry Saputra henry.sapu...@gmail.comwrote:


Signature and hash for source looks good
No external executable package with source - good
Compiled with git and maven - good
Ran examples and sample programs locally and standalone -good

+1

- Henry



On Tue, May 20, 2014 at 1:13 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:

Please vote on releasing the following candidate as Apache Spark version

1.0.0!

This has a few bug fixes on top of rc9:
SPARK-1875: https://github.com/apache/spark/pull/824
SPARK-1876: https://github.com/apache/spark/pull/819
SPARK-1878: https://github.com/apache/spark/pull/822
SPARK-1879: https://github.com/apache/spark/pull/823

The tag to be voted on is v1.0.0-rc10 (commit d8070234):


https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d807023479ce10aec28ef3c1ab646ddefc2e663c

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc10/

The release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1018/

The documentation corresponding to this release can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

The full list of changes in this release can be found at:


https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=d807023479ce10aec28ef3c1ab646ddefc2e663c

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Friday, May 23, at 20:00 UTC and passes if
amajority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

Changes to ML vector specification:


http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/mllib-guide.html#from-09-to-10

Changes to the Java API:


http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

Changes to the streaming API:


http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

Changes to the GraphX API:


http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

Other changes:
coGroup and related functions now return Iterable[T] instead of Seq[T]
== Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
== Call toSeq on the result to restore old behavior




Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-21 Thread Colin McCabe
Hi Kevin,

Can you try https://issues.apache.org/jira/browse/SPARK-1898 to see if it
fixes your issue?

Running in YARN cluster mode, I had a similar issue where Spark was able to
create a Driver and an Executor via YARN, but then it stopped making any
progress.

Note: I was using a pre-release version of CDH5.1.0, not 2.3 like you were
using.

best,
Colin


On Wed, May 21, 2014 at 3:34 PM, Kevin Markey kevin.mar...@oracle.comwrote:

 0

 Abstaining because I'm not sure if my failures are due to Spark,
 configuration, or other factors...

 Compiled and deployed RC10 for YARN, Hadoop 2.3 per Spark 1.0.0 Yarn
 documentation.  No problems.
 Rebuilt applications against RC10 and Hadoop 2.3.0 (plain vanilla Apache
 release).
 Updated scripts for various applications.
 Application had successfully compiled and run against Spark 0.9.1 and
 Hadoop 2.3.0.
 Ran in yarn-cluster mode.
 Application ran to conclusion except that it ultimately failed because of
 an exception when Spark tried to clean up the staging directory.  Also,
 where before Yarn would report the running program as RUNNING, it only
 reported this application as ACCEPTED.  It appeared to run two containers
 when the first instance never reported that it was RUNNING.

 I will post a separate note to the USER list about the specifics.

 Thanks
 Kevin Markey



 On 05/21/2014 10:58 AM, Mark Hamstra wrote:

 +1


 On Tue, May 20, 2014 at 11:09 PM, Henry Saputra henry.sapu...@gmail.com
 wrote:

  Signature and hash for source looks good
 No external executable package with source - good
 Compiled with git and maven - good
 Ran examples and sample programs locally and standalone -good

 +1

 - Henry



 On Tue, May 20, 2014 at 1:13 PM, Tathagata Das
 tathagata.das1...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark version

 1.0.0!

 This has a few bug fixes on top of rc9:
 SPARK-1875: https://github.com/apache/spark/pull/824
 SPARK-1876: https://github.com/apache/spark/pull/819
 SPARK-1878: https://github.com/apache/spark/pull/822
 SPARK-1879: https://github.com/apache/spark/pull/823

 The tag to be voted on is v1.0.0-rc10 (commit d8070234):

  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 d807023479ce10aec28ef3c1ab646ddefc2e663c

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc10/

 The release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1018/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

 The full list of changes in this release can be found at:

  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;
 f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=
 d807023479ce10aec28ef3c1ab646ddefc2e663c

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until Friday, May 23, at 20:00 UTC and passes if
 amajority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 Changes to ML vector specification:

  http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/
 mllib-guide.html#from-09-to-10

 Changes to the Java API:

  http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/
 java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 Changes to the streaming API:

  http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/
 streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 Changes to the GraphX API:

  http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/
 graphx-programming-guide.html#upgrade-guide-from-spark-091

 Other changes:
 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior





Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-21 Thread Tom Graves
Has anyone tried pyspark on yarn and got it to work?  I was having issues when 
I built spark on redhat but when I built on my mac it had worked,  but now when 
I build it on my mac it also doesn't work.

Tom




On Tuesday, May 20, 2014 3:14 PM, Tathagata Das tathagata.das1...@gmail.com 
wrote:
 


Please vote on releasing the following candidate as Apache Spark version 1.0.0!

This has a few bug fixes on top of rc9:
SPARK-1875: https://github.com/apache/spark/pull/824
SPARK-1876: https://github.com/apache/spark/pull/819
SPARK-1878: https://github.com/apache/spark/pull/822
SPARK-1879: https://github.com/apache/spark/pull/823

The tag to be voted on is v1.0.0-rc10 (commit d8070234):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d807023479ce10aec28ef3c1ab646ddefc2e663c

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc10/

The release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1018/

The documentation corresponding to this release can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

The full list of changes in this release can be found at:
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=d807023479ce10aec28ef3c1ab646ddefc2e663c

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Friday, May 23, at 20:00 UTC and passes if
amajority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

Changes to ML vector specification:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/mllib-guide.html#from-09-to-10

Changes to the Java API:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

Changes to the streaming API:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

Changes to the GraphX API:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

Other changes:
coGroup and related functions now return Iterable[T] instead of Seq[T]
== Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
== Call toSeq on the result to restore old behavior

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-21 Thread Tom Graves
I don't think Kevin's issue would be with an api change in YarnClientImpl since 
in both cases he says he is using hadoop 2.3.0.  I'll take a look at his post 
in the user list.

Tom




On Wednesday, May 21, 2014 7:01 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote:
 


Hi Kevin,

Can you try https://issues.apache.org/jira/browse/SPARK-1898 to see if it
fixes your issue?

Running in YARN cluster mode, I had a similar issue where Spark was able to
create a Driver and an Executor via YARN, but then it stopped making any
progress.

Note: I was using a pre-release version of CDH5.1.0, not 2.3 like you were
using.

best,
Colin



On Wed, May 21, 2014 at 3:34 PM, Kevin Markey kevin.mar...@oracle.comwrote:

 0

 Abstaining because I'm not sure if my failures are due to Spark,
 configuration, or other factors...

 Compiled and deployed RC10 for YARN, Hadoop 2.3
 per Spark 1.0.0 Yarn
 documentation.  No problems.
 Rebuilt applications against RC10 and Hadoop 2.3.0 (plain vanilla Apache
 release).
 Updated scripts for various applications.
 Application had successfully compiled and run against Spark 0.9.1 and
 Hadoop 2.3.0.
 Ran in yarn-cluster mode.
 Application ran to conclusion except that it ultimately failed because of
 an exception when Spark tried to clean up the staging directory.  Also,
 where before Yarn would report the running program as RUNNING, it only
 reported this application as ACCEPTED.  It appeared to run two containers
 when the first instance never reported that it was RUNNING.

 I will post a
 separate note to the USER list about the specifics.

 Thanks
 Kevin Markey



 On 05/21/2014 10:58 AM, Mark Hamstra wrote:

 +1


 On Tue, May 20, 2014 at 11:09 PM, Henry Saputra henry.sapu...@gmail.com
 wrote:

  Signature and hash for source looks good
 No external executable package with source - good
 Compiled with git and maven - good
 Ran examples and sample programs locally and standalone -good

 +1

 - Henry



 On Tue, May 20, 2014 at 1:13 PM, Tathagata Das
 tathagata.das1...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark version

 1.0.0!

 This has a few bug fixes on top of rc9:
 SPARK-1875: https://github.com/apache/spark/pull/824
 SPARK-1876: https://github.com/apache/spark/pull/819
 SPARK-1878: https://github.com/apache/spark/pull/822
 SPARK-1879: https://github.com/apache/spark/pull/823

 The tag to be voted on is v1.0.0-rc10 (commit d8070234):

  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 d807023479ce10aec28ef3c1ab646ddefc2e663c

 The
 release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc10/

 The release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1018/

 The documentation
 corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

 The full list of changes in this release can be found at:

  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;
 f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=
 d807023479ce10aec28ef3c1ab646ddefc2e663c

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until
 Friday, May 23, at 20:00 UTC and passes if
 amajority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 Changes to ML vector specification:

  http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/
 mllib-guide.html#from-09-to-10

 Changes to the Java API:

  http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/
 java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 Changes to the streaming API:

  http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/
 streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 Changes to the GraphX API:

  http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/
 graphx-programming-guide.html#upgrade-guide-from-spark-091

 Other changes:
 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior




[VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Tathagata Das
Please vote on releasing the following candidate as Apache Spark version 1.0.0!

This has a few bug fixes on top of rc9:
SPARK-1875: https://github.com/apache/spark/pull/824
SPARK-1876: https://github.com/apache/spark/pull/819
SPARK-1878: https://github.com/apache/spark/pull/822
SPARK-1879: https://github.com/apache/spark/pull/823

The tag to be voted on is v1.0.0-rc10 (commit d8070234):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d807023479ce10aec28ef3c1ab646ddefc2e663c

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc10/

The release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1018/

The documentation corresponding to this release can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

The full list of changes in this release can be found at:
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=d807023479ce10aec28ef3c1ab646ddefc2e663c

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Friday, May 23, at 20:00 UTC and passes if
amajority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

Changes to ML vector specification:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/mllib-guide.html#from-09-to-10

Changes to the Java API:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

Changes to the streaming API:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

Changes to the GraphX API:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

Other changes:
coGroup and related functions now return Iterable[T] instead of Seq[T]
== Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
== Call toSeq on the result to restore old behavior


Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Sandy Ryza
+1


On Tue, May 20, 2014 at 5:26 PM, Andrew Or and...@databricks.com wrote:

 +1


 2014-05-20 13:13 GMT-07:00 Tathagata Das tathagata.das1...@gmail.com:

  Please vote on releasing the following candidate as Apache Spark version
  1.0.0!
 
  This has a few bug fixes on top of rc9:
  SPARK-1875: https://github.com/apache/spark/pull/824
  SPARK-1876: https://github.com/apache/spark/pull/819
  SPARK-1878: https://github.com/apache/spark/pull/822
  SPARK-1879: https://github.com/apache/spark/pull/823
 
  The tag to be voted on is v1.0.0-rc10 (commit d8070234):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d807023479ce10aec28ef3c1ab646ddefc2e663c
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~tdas/spark-1.0.0-rc10/
 
  The release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/tdas.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1018/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/
 
  The full list of changes in this release can be found at:
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=d807023479ce10aec28ef3c1ab646ddefc2e663c
 
  Please vote on releasing this package as Apache Spark 1.0.0!
 
  The vote is open until Friday, May 23, at 20:00 UTC and passes if
  amajority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == API Changes ==
  We welcome users to compile Spark applications against 1.0. There are
  a few API changes in this release. Here are links to the associated
  upgrade guides - user facing changes have been kept as small as
  possible.
 
  Changes to ML vector specification:
 
 
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/mllib-guide.html#from-09-to-10
 
  Changes to the Java API:
 
 
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
 
  Changes to the streaming API:
 
 
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
 
  Changes to the GraphX API:
 
 
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
 
  Other changes:
  coGroup and related functions now return Iterable[T] instead of Seq[T]
  == Call toSeq on the result to restore the old behavior
 
  SparkContext.jarOfClass returns Option[String] instead of Seq[String]
  == Call toSeq on the result to restore old behavior
 



Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Marcelo Vanzin
+1 (non-binding)

I have:
- checked signatures and checksums of the files
- built the code from the git repo using both sbt and mvn (against hadoop 2.3.0)
- ran a few simple jobs in local, yarn-client and yarn-cluster mode

Haven't explicitly tested any of the recent fixes, streaming nor sql.


On Tue, May 20, 2014 at 1:13 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.0.0!

 This has a few bug fixes on top of rc9:
 SPARK-1875: https://github.com/apache/spark/pull/824
 SPARK-1876: https://github.com/apache/spark/pull/819
 SPARK-1878: https://github.com/apache/spark/pull/822
 SPARK-1879: https://github.com/apache/spark/pull/823

 The tag to be voted on is v1.0.0-rc10 (commit d8070234):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d807023479ce10aec28ef3c1ab646ddefc2e663c

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc10/

 The release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1018/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

 The full list of changes in this release can be found at:
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=d807023479ce10aec28ef3c1ab646ddefc2e663c

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until Friday, May 23, at 20:00 UTC and passes if
 amajority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 Changes to ML vector specification:
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/mllib-guide.html#from-09-to-10

 Changes to the Java API:
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 Changes to the streaming API:
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 Changes to the GraphX API:
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

 Other changes:
 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior



-- 
Marcelo


Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Matei Zaharia
+1

Tested it on both Windows and Mac OS X, with both Scala and Python. Confirmed 
that the issues in the previous RC were fixed.

Matei

On May 20, 2014, at 5:28 PM, Marcelo Vanzin van...@cloudera.com wrote:

 +1 (non-binding)
 
 I have:
 - checked signatures and checksums of the files
 - built the code from the git repo using both sbt and mvn (against hadoop 
 2.3.0)
 - ran a few simple jobs in local, yarn-client and yarn-cluster mode
 
 Haven't explicitly tested any of the recent fixes, streaming nor sql.
 
 
 On Tue, May 20, 2014 at 1:13 PM, Tathagata Das
 tathagata.das1...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.0.0!
 
 This has a few bug fixes on top of rc9:
 SPARK-1875: https://github.com/apache/spark/pull/824
 SPARK-1876: https://github.com/apache/spark/pull/819
 SPARK-1878: https://github.com/apache/spark/pull/822
 SPARK-1879: https://github.com/apache/spark/pull/823
 
 The tag to be voted on is v1.0.0-rc10 (commit d8070234):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d807023479ce10aec28ef3c1ab646ddefc2e663c
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc10/
 
 The release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1018/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/
 
 The full list of changes in this release can be found at:
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=d807023479ce10aec28ef3c1ab646ddefc2e663c
 
 Please vote on releasing this package as Apache Spark 1.0.0!
 
 The vote is open until Friday, May 23, at 20:00 UTC and passes if
 amajority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.
 
 Changes to ML vector specification:
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/mllib-guide.html#from-09-to-10
 
 Changes to the Java API:
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
 
 Changes to the streaming API:
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
 
 Changes to the GraphX API:
 http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
 
 Other changes:
 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior
 
 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior
 
 
 
 -- 
 Marcelo