Re: Announcing Spark 1.0.1

2014-07-11 Thread Henry Saputra
Congrats to the Spark community !

On Friday, July 11, 2014, Patrick Wendell  wrote:

> I am happy to announce the availability of Spark 1.0.1! This release
> includes contributions from 70 developers. Spark 1.0.0 includes fixes
> across several areas of Spark, including the core API, PySpark, and
> MLlib. It also includes new features in Spark's (alpha) SQL library,
> including support for JSON data and performance and stability fixes.
>
> Visit the release notes[1] to read about this release or download[2]
> the release today.
>
> [1] http://spark.apache.org/releases/spark-release-1-0-1.html
> [2] http://spark.apache.org/downloads.html
>


Announcing Spark 1.0.1

2014-07-11 Thread Patrick Wendell
I am happy to announce the availability of Spark 1.0.1! This release
includes contributions from 70 developers. Spark 1.0.0 includes fixes
across several areas of Spark, including the core API, PySpark, and
MLlib. It also includes new features in Spark's (alpha) SQL library,
including support for JSON data and performance and stability fixes.

Visit the release notes[1] to read about this release or download[2]
the release today.

[1] http://spark.apache.org/releases/spark-release-1-0-1.html
[2] http://spark.apache.org/downloads.html


Re: Calling Scala/Java methods which operates on RDD

2014-07-11 Thread Kan Zhang
Hi Jai,

Your suspicion is correct. In general, Python RDDs are pickled into byte
arrays and stored in Java land as RDDs of byte arrays. union/zip operates
on byte arrays directly without deserializing. Currently, Python byte
arrays only get unpickled into Java objects in special cases, like SQL
functions or saving to Sequence Files (upcoming).

Hope it helps.

Kan


On Fri, Jul 11, 2014 at 5:04 AM, Jai Kumar Singh 
wrote:

> HI,
>   I want to write some common utility function in Scala and want to call
> the same from Java/Python Spark API ( may be add some wrapper code around
> scala calls). Calling Scala functions from Java works fine. I was reading
> pyspark rdd code and find out that pyspark is able to call JavaRDD function
> like union/zip to get same for pyspark RDD and deserializing the output and
> everything works fine. But somehow I am
> not able to work out really simple example. I think I am missing some
> serialization/deserialization.
>
> Can someone confirm that is it even possible to do so? Or, would it be much
> easier to pass RDD data files around instead of RDD directly (from pyspark
> to java/scala)?
>
> For example, below code just add 1 to each element of RDD containing
> Integers.
>
> package flukebox.test;
>
> object TestClass{
>
> def testFunc(data:RDD[Int])={
>
>   data.map(x => x+1)
>
> }
>
> }
>
> Calling from python,
>
> from pyspark import RDD
>
> from py4j.java_gateway import java_import
>
> java_import(sc._gateway.jvm, "flukebox.test")
>
>
> data = sc.parallelize([1,2,3,4,5,6,7,8,9])
>
> sc._jvm.flukebox.test.TestClass.testFunc(data._jrdd.rdd())
>
>
> *This fails because testFunc get any RDD of type Byte Array.*
>
>
> Any help/pointer would be highly appreciated.
>
>
> Thanks & Regards,
>
> Jai K Singh
>


Re: How pySpark works?

2014-07-11 Thread Reynold Xin
Also take a look at this:
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals


On Fri, Jul 11, 2014 at 10:29 AM, Andrew Or  wrote:

> Hi Egor,
>
> Here are a few answers to your questions:
>
> 1) Python needs to be installed on all machines, but not pyspark. The way
> the executors get the pyspark code depends on which cluster manager you
> use. In standalone mode, your executors need to have the actual python
> files in their working directory. In yarn mode, python files are included
> in the assembly jar, which is then shipped to your executor containers
> through a distributed cache.
>
> 2) Pyspark is just a thin wrapper around Spark. When you write a closure in
> python, it is shipped to the executors within the task itself the same way
> scala closures are shipped. If you use a special library, then all of the
> nodes will need to have that library pre-installed.
>
> 3) Are you trying to run your c++ code inside the "map" function? If so,
> you need to make sure the compiled code is present in the working directory
> on all the executors before-hand for python to "exec" it. I haven't done
> this before, but maybe there are a few gotchas in doing this.
>
> Maybe others can add more information?
>
> Andrew
>
>
> 2014-07-11 5:50 GMT-07:00 Egor Pahomov :
>
> > Hi, I want to use pySpark, but can't understand how it works.
> Documentation
> > doesn't provide enough information.
> >
> > 1) How python shipped to cluster? Should machines in cluster already have
> > python?
> > 2) What happens when I write some python code in "map" function - is it
> > shipped to cluster and just executed on it? How it understand all
> > dependencies, which my code need and ship it there? If I use Math in my
> > code in "map" does it mean, that I would ship Math class or some python
> > Math on cluster would be used?
> > 3) I have c++ compiled code. Can I ship this executable with "addPyFile"
> > and just use "exec" function from python? Would it work?
> >
> > --
> >
> >
> >
> > *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >
>


Re: How pySpark works?

2014-07-11 Thread Andrew Or
Hi Egor,

Here are a few answers to your questions:

1) Python needs to be installed on all machines, but not pyspark. The way
the executors get the pyspark code depends on which cluster manager you
use. In standalone mode, your executors need to have the actual python
files in their working directory. In yarn mode, python files are included
in the assembly jar, which is then shipped to your executor containers
through a distributed cache.

2) Pyspark is just a thin wrapper around Spark. When you write a closure in
python, it is shipped to the executors within the task itself the same way
scala closures are shipped. If you use a special library, then all of the
nodes will need to have that library pre-installed.

3) Are you trying to run your c++ code inside the "map" function? If so,
you need to make sure the compiled code is present in the working directory
on all the executors before-hand for python to "exec" it. I haven't done
this before, but maybe there are a few gotchas in doing this.

Maybe others can add more information?

Andrew


2014-07-11 5:50 GMT-07:00 Egor Pahomov :

> Hi, I want to use pySpark, but can't understand how it works. Documentation
> doesn't provide enough information.
>
> 1) How python shipped to cluster? Should machines in cluster already have
> python?
> 2) What happens when I write some python code in "map" function - is it
> shipped to cluster and just executed on it? How it understand all
> dependencies, which my code need and ship it there? If I use Math in my
> code in "map" does it mean, that I would ship Math class or some python
> Math on cluster would be used?
> 3) I have c++ compiled code. Can I ship this executable with "addPyFile"
> and just use "exec" function from python? Would it work?
>
> --
>
>
>
> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>


[RESULT] [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Patrick Wendell
This vote has passed with 9 +1 votes (5 binding) and 1 -1 vote (0 binding).

+1:
Patrick Wendell*
Mark Hamstra*
DB Tsai
Krishna Sankar
Soren Macbeth
Andrew Or
Matei Zaharia*
Xiangrui Meng*
Tom Graves*

0:

-1:
Gary Malouf


Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Patrick Wendell
Okay just FYI - I'm closing this vote since many people are waiting on
the release and I was hoping to package it today. If we find a
reproducible Mesos issue here, we can definitely spin the fix into a
subsequent release.



On Fri, Jul 11, 2014 at 9:37 AM, Patrick Wendell  wrote:
> Hey Gary,
>
> Why do you think the akka frame size changed? It didn't change - we
> added some fixes for cases where users were setting non-default
> values.
>
> On Fri, Jul 11, 2014 at 9:31 AM, Gary Malouf  wrote:
>> Hi Matei,
>>
>> We have not had time to re-deploy the rc today, but one thing that jumps
>> out is the shrinking of the default akka frame size from 10MB to around
>> 128KB by default.  That is my first suspicion for our issue - could imagine
>> that biting others as well.
>>
>> I'll try to re-test that today - either way, understand moving forward at
>> this point.
>>
>> Gary
>>
>>
>> On Fri, Jul 11, 2014 at 12:08 PM, Matei Zaharia 
>> wrote:
>>
>>> Unless you can diagnose the problem quickly, Gary, I think we need to go
>>> ahead with this release as is. This release didn't touch the Mesos support
>>> as far as I know, so the problem might be a nondeterministic issue with
>>> your application. But on the other hand the release does fix some critical
>>> bugs that affect all users. We can always do 1.0.2 later if we discover a
>>> problem.
>>>
>>> Matei
>>>
>>> On Jul 10, 2014, at 9:40 PM, Patrick Wendell  wrote:
>>>
>>> > Hey Gary,
>>> >
>>> > The vote technically doesn't close until I send the vote summary
>>> > e-mail, but I was planning to close and package this tonight. It's too
>>> > bad if there is a regression, it might be worth holding the release
>>> > but it really requires narrowing down the issue to get more
>>> > information about the scope and severity. Could you fork another
>>> > thread for this?
>>> >
>>> > - Patrick
>>> >
>>> > On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf 
>>> wrote:
>>> >> -1 I honestly do not know the voting rules for the Spark community, so
>>> >> please excuse me if I am out of line or if Mesos compatibility is not a
>>> >> concern at this point.
>>> >>
>>> >> We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos
>>> >> 0.18.2.  All of our jobs with data above a few gigabytes hung
>>> indefinitely.
>>> >> Downgrading back to the 1.0.0 stable release of Spark built the same way
>>> >> worked for us.
>>> >>
>>> >>
>>> >> On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves >> >
>>> >> wrote:
>>> >>
>>> >>> +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with
>>> >>> authentication on.
>>> >>>
>>> >>> Tom
>>> >>>
>>> >>>
>>> >>> On Friday, July 4, 2014 2:39 PM, Patrick Wendell 
>>> >>> wrote:
>>> >>>
>>> >>>
>>> >>>
>>> >>> Please vote on releasing the following candidate as Apache Spark
>>> version
>>> >>> 1.0.1!
>>> >>>
>>> >>> The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
>>> >>>
>>> >>>
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78
>>> >>>
>>> >>> The release files, including signatures, digests, etc. can be found at:
>>> >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2/
>>> >>>
>>> >>> Release artifacts are signed with the following key:
>>> >>> https://people.apache.org/keys/committer/pwendell.asc
>>> >>>
>>> >>> The staging repository for this release can be found at:
>>> >>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1021/
>>> >>>
>>> >>> The documentation corresponding to this release can be found at:
>>> >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/
>>> >>>
>>> >>> Please vote on releasing this package as Apache Spark 1.0.1!
>>> >>>
>>> >>> The vote is open until Monday, July 07, at 20:45 UTC and passes if
>>> >>> a majority of at least 3 +1 PMC votes are cast.
>>> >>>
>>> >>> [ ] +1 Release this package as Apache Spark 1.0.1
>>> >>> [ ] -1 Do not release this package because ...
>>> >>>
>>> >>> To learn more about Apache Spark, please see
>>> >>> http://spark.apache.org/
>>> >>>
>>> >>> === Differences from RC1 ===
>>> >>> This release includes only one "blocking" patch from rc1:
>>> >>> https://github.com/apache/spark/pull/1255
>>> >>>
>>> >>> There are also smaller fixes which came in over the last week.
>>> >>>
>>> >>> === About this release ===
>>> >>> This release fixes a few high-priority bugs in 1.0 and has a variety
>>> >>> of smaller fixes. The full list is here: http://s.apache.org/b45. Some
>>> >>> of the more visible patches are:
>>> >>>
>>> >>> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
>>> >>> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
>>> size.
>>> >>> SPARK-1790: Support r3 instance types on EC2.
>>> >>>
>>> >>> This is the first maintenance release on the 1.0 line. We plan to make
>>> >>> additional maintenance releases as new fixes come in.
>>> >>>
>>>
>>>


Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Patrick Wendell
Hey Gary,

Why do you think the akka frame size changed? It didn't change - we
added some fixes for cases where users were setting non-default
values.

On Fri, Jul 11, 2014 at 9:31 AM, Gary Malouf  wrote:
> Hi Matei,
>
> We have not had time to re-deploy the rc today, but one thing that jumps
> out is the shrinking of the default akka frame size from 10MB to around
> 128KB by default.  That is my first suspicion for our issue - could imagine
> that biting others as well.
>
> I'll try to re-test that today - either way, understand moving forward at
> this point.
>
> Gary
>
>
> On Fri, Jul 11, 2014 at 12:08 PM, Matei Zaharia 
> wrote:
>
>> Unless you can diagnose the problem quickly, Gary, I think we need to go
>> ahead with this release as is. This release didn't touch the Mesos support
>> as far as I know, so the problem might be a nondeterministic issue with
>> your application. But on the other hand the release does fix some critical
>> bugs that affect all users. We can always do 1.0.2 later if we discover a
>> problem.
>>
>> Matei
>>
>> On Jul 10, 2014, at 9:40 PM, Patrick Wendell  wrote:
>>
>> > Hey Gary,
>> >
>> > The vote technically doesn't close until I send the vote summary
>> > e-mail, but I was planning to close and package this tonight. It's too
>> > bad if there is a regression, it might be worth holding the release
>> > but it really requires narrowing down the issue to get more
>> > information about the scope and severity. Could you fork another
>> > thread for this?
>> >
>> > - Patrick
>> >
>> > On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf 
>> wrote:
>> >> -1 I honestly do not know the voting rules for the Spark community, so
>> >> please excuse me if I am out of line or if Mesos compatibility is not a
>> >> concern at this point.
>> >>
>> >> We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos
>> >> 0.18.2.  All of our jobs with data above a few gigabytes hung
>> indefinitely.
>> >> Downgrading back to the 1.0.0 stable release of Spark built the same way
>> >> worked for us.
>> >>
>> >>
>> >> On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves > >
>> >> wrote:
>> >>
>> >>> +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with
>> >>> authentication on.
>> >>>
>> >>> Tom
>> >>>
>> >>>
>> >>> On Friday, July 4, 2014 2:39 PM, Patrick Wendell 
>> >>> wrote:
>> >>>
>> >>>
>> >>>
>> >>> Please vote on releasing the following candidate as Apache Spark
>> version
>> >>> 1.0.1!
>> >>>
>> >>> The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
>> >>>
>> >>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78
>> >>>
>> >>> The release files, including signatures, digests, etc. can be found at:
>> >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2/
>> >>>
>> >>> Release artifacts are signed with the following key:
>> >>> https://people.apache.org/keys/committer/pwendell.asc
>> >>>
>> >>> The staging repository for this release can be found at:
>> >>>
>> https://repository.apache.org/content/repositories/orgapachespark-1021/
>> >>>
>> >>> The documentation corresponding to this release can be found at:
>> >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/
>> >>>
>> >>> Please vote on releasing this package as Apache Spark 1.0.1!
>> >>>
>> >>> The vote is open until Monday, July 07, at 20:45 UTC and passes if
>> >>> a majority of at least 3 +1 PMC votes are cast.
>> >>>
>> >>> [ ] +1 Release this package as Apache Spark 1.0.1
>> >>> [ ] -1 Do not release this package because ...
>> >>>
>> >>> To learn more about Apache Spark, please see
>> >>> http://spark.apache.org/
>> >>>
>> >>> === Differences from RC1 ===
>> >>> This release includes only one "blocking" patch from rc1:
>> >>> https://github.com/apache/spark/pull/1255
>> >>>
>> >>> There are also smaller fixes which came in over the last week.
>> >>>
>> >>> === About this release ===
>> >>> This release fixes a few high-priority bugs in 1.0 and has a variety
>> >>> of smaller fixes. The full list is here: http://s.apache.org/b45. Some
>> >>> of the more visible patches are:
>> >>>
>> >>> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
>> >>> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
>> size.
>> >>> SPARK-1790: Support r3 instance types on EC2.
>> >>>
>> >>> This is the first maintenance release on the 1.0 line. We plan to make
>> >>> additional maintenance releases as new fixes come in.
>> >>>
>>
>>


Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Gary Malouf
Hi Matei,

We have not had time to re-deploy the rc today, but one thing that jumps
out is the shrinking of the default akka frame size from 10MB to around
128KB by default.  That is my first suspicion for our issue - could imagine
that biting others as well.

I'll try to re-test that today - either way, understand moving forward at
this point.

Gary


On Fri, Jul 11, 2014 at 12:08 PM, Matei Zaharia 
wrote:

> Unless you can diagnose the problem quickly, Gary, I think we need to go
> ahead with this release as is. This release didn't touch the Mesos support
> as far as I know, so the problem might be a nondeterministic issue with
> your application. But on the other hand the release does fix some critical
> bugs that affect all users. We can always do 1.0.2 later if we discover a
> problem.
>
> Matei
>
> On Jul 10, 2014, at 9:40 PM, Patrick Wendell  wrote:
>
> > Hey Gary,
> >
> > The vote technically doesn't close until I send the vote summary
> > e-mail, but I was planning to close and package this tonight. It's too
> > bad if there is a regression, it might be worth holding the release
> > but it really requires narrowing down the issue to get more
> > information about the scope and severity. Could you fork another
> > thread for this?
> >
> > - Patrick
> >
> > On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf 
> wrote:
> >> -1 I honestly do not know the voting rules for the Spark community, so
> >> please excuse me if I am out of line or if Mesos compatibility is not a
> >> concern at this point.
> >>
> >> We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos
> >> 0.18.2.  All of our jobs with data above a few gigabytes hung
> indefinitely.
> >> Downgrading back to the 1.0.0 stable release of Spark built the same way
> >> worked for us.
> >>
> >>
> >> On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves  >
> >> wrote:
> >>
> >>> +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with
> >>> authentication on.
> >>>
> >>> Tom
> >>>
> >>>
> >>> On Friday, July 4, 2014 2:39 PM, Patrick Wendell 
> >>> wrote:
> >>>
> >>>
> >>>
> >>> Please vote on releasing the following candidate as Apache Spark
> version
> >>> 1.0.1!
> >>>
> >>> The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
> >>>
> >>>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78
> >>>
> >>> The release files, including signatures, digests, etc. can be found at:
> >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2/
> >>>
> >>> Release artifacts are signed with the following key:
> >>> https://people.apache.org/keys/committer/pwendell.asc
> >>>
> >>> The staging repository for this release can be found at:
> >>>
> https://repository.apache.org/content/repositories/orgapachespark-1021/
> >>>
> >>> The documentation corresponding to this release can be found at:
> >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/
> >>>
> >>> Please vote on releasing this package as Apache Spark 1.0.1!
> >>>
> >>> The vote is open until Monday, July 07, at 20:45 UTC and passes if
> >>> a majority of at least 3 +1 PMC votes are cast.
> >>>
> >>> [ ] +1 Release this package as Apache Spark 1.0.1
> >>> [ ] -1 Do not release this package because ...
> >>>
> >>> To learn more about Apache Spark, please see
> >>> http://spark.apache.org/
> >>>
> >>> === Differences from RC1 ===
> >>> This release includes only one "blocking" patch from rc1:
> >>> https://github.com/apache/spark/pull/1255
> >>>
> >>> There are also smaller fixes which came in over the last week.
> >>>
> >>> === About this release ===
> >>> This release fixes a few high-priority bugs in 1.0 and has a variety
> >>> of smaller fixes. The full list is here: http://s.apache.org/b45. Some
> >>> of the more visible patches are:
> >>>
> >>> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
> >>> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
> size.
> >>> SPARK-1790: Support r3 instance types on EC2.
> >>>
> >>> This is the first maintenance release on the 1.0 line. We plan to make
> >>> additional maintenance releases as new fixes come in.
> >>>
>
>


Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Matei Zaharia
Unless you can diagnose the problem quickly, Gary, I think we need to go ahead 
with this release as is. This release didn't touch the Mesos support as far as 
I know, so the problem might be a nondeterministic issue with your application. 
But on the other hand the release does fix some critical bugs that affect all 
users. We can always do 1.0.2 later if we discover a problem.

Matei

On Jul 10, 2014, at 9:40 PM, Patrick Wendell  wrote:

> Hey Gary,
> 
> The vote technically doesn't close until I send the vote summary
> e-mail, but I was planning to close and package this tonight. It's too
> bad if there is a regression, it might be worth holding the release
> but it really requires narrowing down the issue to get more
> information about the scope and severity. Could you fork another
> thread for this?
> 
> - Patrick
> 
> On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf  wrote:
>> -1 I honestly do not know the voting rules for the Spark community, so
>> please excuse me if I am out of line or if Mesos compatibility is not a
>> concern at this point.
>> 
>> We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos
>> 0.18.2.  All of our jobs with data above a few gigabytes hung indefinitely.
>> Downgrading back to the 1.0.0 stable release of Spark built the same way
>> worked for us.
>> 
>> 
>> On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves 
>> wrote:
>> 
>>> +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with
>>> authentication on.
>>> 
>>> Tom
>>> 
>>> 
>>> On Friday, July 4, 2014 2:39 PM, Patrick Wendell 
>>> wrote:
>>> 
>>> 
>>> 
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.0.1!
>>> 
>>> The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
>>> 
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78
>>> 
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.0.1-rc2/
>>> 
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>> 
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1021/
>>> 
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/
>>> 
>>> Please vote on releasing this package as Apache Spark 1.0.1!
>>> 
>>> The vote is open until Monday, July 07, at 20:45 UTC and passes if
>>> a majority of at least 3 +1 PMC votes are cast.
>>> 
>>> [ ] +1 Release this package as Apache Spark 1.0.1
>>> [ ] -1 Do not release this package because ...
>>> 
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> 
>>> === Differences from RC1 ===
>>> This release includes only one "blocking" patch from rc1:
>>> https://github.com/apache/spark/pull/1255
>>> 
>>> There are also smaller fixes which came in over the last week.
>>> 
>>> === About this release ===
>>> This release fixes a few high-priority bugs in 1.0 and has a variety
>>> of smaller fixes. The full list is here: http://s.apache.org/b45. Some
>>> of the more visible patches are:
>>> 
>>> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
>>> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size.
>>> SPARK-1790: Support r3 instance types on EC2.
>>> 
>>> This is the first maintenance release on the 1.0 line. We plan to make
>>> additional maintenance releases as new fixes come in.
>>> 



Calling Scala/Java methods which operates on RDD

2014-07-11 Thread Jai Kumar Singh
HI,
  I want to write some common utility function in Scala and want to call
the same from Java/Python Spark API ( may be add some wrapper code around
scala calls). Calling Scala functions from Java works fine. I was reading
pyspark rdd code and find out that pyspark is able to call JavaRDD function
like union/zip to get same for pyspark RDD and deserializing the output and
everything works fine. But somehow I am
not able to work out really simple example. I think I am missing some
serialization/deserialization.

Can someone confirm that is it even possible to do so? Or, would it be much
easier to pass RDD data files around instead of RDD directly (from pyspark
to java/scala)?

For example, below code just add 1 to each element of RDD containing
Integers.

package flukebox.test;

object TestClass{

def testFunc(data:RDD[Int])={

  data.map(x => x+1)

}

}

Calling from python,

from pyspark import RDD

from py4j.java_gateway import java_import

java_import(sc._gateway.jvm, "flukebox.test")


data = sc.parallelize([1,2,3,4,5,6,7,8,9])

sc._jvm.flukebox.test.TestClass.testFunc(data._jrdd.rdd())


*This fails because testFunc get any RDD of type Byte Array.*


Any help/pointer would be highly appreciated.


Thanks & Regards,

Jai K Singh


Re: Random forest - is it under implementation?

2014-07-11 Thread Egor Pahomov
Great. Then one question left:
what would you recommend for implementation?



2014-07-11 17:43 GMT+04:00 Chester At Work :

> Sung chung from alpine data labs presented the random Forrest
> implementation at Spark summit 2014. The work will be open sourced and
> contributed back to MLLib.
>
> Stay tuned
>
>
>
> Sent from my iPad
>
> On Jul 11, 2014, at 6:02 AM, Egor Pahomov  wrote:
>
> > Hi, I have intern, who wants to implement some ML algorithm for spark.
> > Which algorithm would be good idea to implement(it should be not very
> > difficult)? I heard someone already working on random forest, but
> couldn't
> > find proof of that.
> >
> > I'm aware of new politics, where we should implement stable, good
> quality,
> > popular ML or do not do it at all.
> >
> > --
> >
> >
> >
> > *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>



-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: Random forest - is it under implementation?

2014-07-11 Thread Chester At Work
Sung chung from alpine data labs presented the random Forrest implementation at 
Spark summit 2014. The work will be open sourced and contributed back to MLLib.

Stay tuned 



Sent from my iPad

On Jul 11, 2014, at 6:02 AM, Egor Pahomov  wrote:

> Hi, I have intern, who wants to implement some ML algorithm for spark.
> Which algorithm would be good idea to implement(it should be not very
> difficult)? I heard someone already working on random forest, but couldn't
> find proof of that.
> 
> I'm aware of new politics, where we should implement stable, good quality,
> popular ML or do not do it at all.
> 
> -- 
> 
> 
> 
> *Sincerely yoursEgor PakhomovScala Developer, Yandex*


Random forest - is it under implementation?

2014-07-11 Thread Egor Pahomov
Hi, I have intern, who wants to implement some ML algorithm for spark.
Which algorithm would be good idea to implement(it should be not very
difficult)? I heard someone already working on random forest, but couldn't
find proof of that.

I'm aware of new politics, where we should implement stable, good quality,
popular ML or do not do it at all.

-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


How pySpark works?

2014-07-11 Thread Egor Pahomov
Hi, I want to use pySpark, but can't understand how it works. Documentation
doesn't provide enough information.

1) How python shipped to cluster? Should machines in cluster already have
python?
2) What happens when I write some python code in "map" function - is it
shipped to cluster and just executed on it? How it understand all
dependencies, which my code need and ship it there? If I use Math in my
code in "map" does it mean, that I would ship Math class or some python
Math on cluster would be used?
3) I have c++ compiled code. Can I ship this executable with "addPyFile"
and just use "exec" function from python? Would it work?

-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*