Re: what is the difference between org.spark-project.hive and org.apache.hadoop.hive

2014-07-11 Thread Patrick Wendell
There are two differences:

1. We publish hive with a shaded protobuf dependency to avoid
conflicts with some Hadoop versions.
2. We publish a proper hive-exec jar that only includes hive packages.
The upstream version of hive-exec bundles a bunch of other random
dependencies in it which makes it really hard for third-party projects
to use it.

On Thu, Jul 10, 2014 at 11:29 PM, kingfly wangf...@huawei.com wrote:

 --

 Best Regards
 Frank Wang | Software Engineer

 Mobile: +86 18505816792
 Phone: +86 571 63547
 Fax:
 Email: wangf...@huawei.com
 
 Huawei Technologies Co., Ltd.
 Hangzhou RD Center
 NO.410, JiangHong Road, Binjiang Area, Hangzhou, 310052, P. R. China




How pySpark works?

2014-07-11 Thread Egor Pahomov
Hi, I want to use pySpark, but can't understand how it works. Documentation
doesn't provide enough information.

1) How python shipped to cluster? Should machines in cluster already have
python?
2) What happens when I write some python code in map function - is it
shipped to cluster and just executed on it? How it understand all
dependencies, which my code need and ship it there? If I use Math in my
code in map does it mean, that I would ship Math class or some python
Math on cluster would be used?
3) I have c++ compiled code. Can I ship this executable with addPyFile
and just use exec function from python? Would it work?

-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Random forest - is it under implementation?

2014-07-11 Thread Egor Pahomov
Hi, I have intern, who wants to implement some ML algorithm for spark.
Which algorithm would be good idea to implement(it should be not very
difficult)? I heard someone already working on random forest, but couldn't
find proof of that.

I'm aware of new politics, where we should implement stable, good quality,
popular ML or do not do it at all.

-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: Random forest - is it under implementation?

2014-07-11 Thread Chester At Work
Sung chung from alpine data labs presented the random Forrest implementation at 
Spark summit 2014. The work will be open sourced and contributed back to MLLib.

Stay tuned 



Sent from my iPad

On Jul 11, 2014, at 6:02 AM, Egor Pahomov pahomov.e...@gmail.com wrote:

 Hi, I have intern, who wants to implement some ML algorithm for spark.
 Which algorithm would be good idea to implement(it should be not very
 difficult)? I heard someone already working on random forest, but couldn't
 find proof of that.
 
 I'm aware of new politics, where we should implement stable, good quality,
 popular ML or do not do it at all.
 
 -- 
 
 
 
 *Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: Random forest - is it under implementation?

2014-07-11 Thread Egor Pahomov
Great. Then one question left:
what would you recommend for implementation?



2014-07-11 17:43 GMT+04:00 Chester At Work ches...@alpinenow.com:

 Sung chung from alpine data labs presented the random Forrest
 implementation at Spark summit 2014. The work will be open sourced and
 contributed back to MLLib.

 Stay tuned



 Sent from my iPad

 On Jul 11, 2014, at 6:02 AM, Egor Pahomov pahomov.e...@gmail.com wrote:

  Hi, I have intern, who wants to implement some ML algorithm for spark.
  Which algorithm would be good idea to implement(it should be not very
  difficult)? I heard someone already working on random forest, but
 couldn't
  find proof of that.
 
  I'm aware of new politics, where we should implement stable, good
 quality,
  popular ML or do not do it at all.
 
  --
 
 
 
  *Sincerely yoursEgor PakhomovScala Developer, Yandex*




-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Calling Scala/Java methods which operates on RDD

2014-07-11 Thread Jai Kumar Singh
HI,
  I want to write some common utility function in Scala and want to call
the same from Java/Python Spark API ( may be add some wrapper code around
scala calls). Calling Scala functions from Java works fine. I was reading
pyspark rdd code and find out that pyspark is able to call JavaRDD function
like union/zip to get same for pyspark RDD and deserializing the output and
everything works fine. But somehow I am
not able to work out really simple example. I think I am missing some
serialization/deserialization.

Can someone confirm that is it even possible to do so? Or, would it be much
easier to pass RDD data files around instead of RDD directly (from pyspark
to java/scala)?

For example, below code just add 1 to each element of RDD containing
Integers.

package flukebox.test;

object TestClass{

def testFunc(data:RDD[Int])={

  data.map(x = x+1)

}

}

Calling from python,

from pyspark import RDD

from py4j.java_gateway import java_import

java_import(sc._gateway.jvm, flukebox.test)


data = sc.parallelize([1,2,3,4,5,6,7,8,9])

sc._jvm.flukebox.test.TestClass.testFunc(data._jrdd.rdd())


*This fails because testFunc get any RDD of type Byte Array.*


Any help/pointer would be highly appreciated.


Thanks  Regards,

Jai K Singh


Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Matei Zaharia
Unless you can diagnose the problem quickly, Gary, I think we need to go ahead 
with this release as is. This release didn't touch the Mesos support as far as 
I know, so the problem might be a nondeterministic issue with your application. 
But on the other hand the release does fix some critical bugs that affect all 
users. We can always do 1.0.2 later if we discover a problem.

Matei

On Jul 10, 2014, at 9:40 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Gary,
 
 The vote technically doesn't close until I send the vote summary
 e-mail, but I was planning to close and package this tonight. It's too
 bad if there is a regression, it might be worth holding the release
 but it really requires narrowing down the issue to get more
 information about the scope and severity. Could you fork another
 thread for this?
 
 - Patrick
 
 On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf malouf.g...@gmail.com wrote:
 -1 I honestly do not know the voting rules for the Spark community, so
 please excuse me if I am out of line or if Mesos compatibility is not a
 concern at this point.
 
 We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos
 0.18.2.  All of our jobs with data above a few gigabytes hung indefinitely.
 Downgrading back to the 1.0.0 stable release of Spark built the same way
 worked for us.
 
 
 On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves tgraves...@yahoo.com.invalid
 wrote:
 
 +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with
 authentication on.
 
 Tom
 
 
 On Friday, July 4, 2014 2:39 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
 
 
 Please vote on releasing the following candidate as Apache Spark version
 1.0.1!
 
 The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.0.1-rc2/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1021/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/
 
 Please vote on releasing this package as Apache Spark 1.0.1!
 
 The vote is open until Monday, July 07, at 20:45 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.0.1
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 === Differences from RC1 ===
 This release includes only one blocking patch from rc1:
 https://github.com/apache/spark/pull/1255
 
 There are also smaller fixes which came in over the last week.
 
 === About this release ===
 This release fixes a few high-priority bugs in 1.0 and has a variety
 of smaller fixes. The full list is here: http://s.apache.org/b45. Some
 of the more visible patches are:
 
 SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
 SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size.
 SPARK-1790: Support r3 instance types on EC2.
 
 This is the first maintenance release on the 1.0 line. We plan to make
 additional maintenance releases as new fixes come in.
 



Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Gary Malouf
Hi Matei,

We have not had time to re-deploy the rc today, but one thing that jumps
out is the shrinking of the default akka frame size from 10MB to around
128KB by default.  That is my first suspicion for our issue - could imagine
that biting others as well.

I'll try to re-test that today - either way, understand moving forward at
this point.

Gary


On Fri, Jul 11, 2014 at 12:08 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Unless you can diagnose the problem quickly, Gary, I think we need to go
 ahead with this release as is. This release didn't touch the Mesos support
 as far as I know, so the problem might be a nondeterministic issue with
 your application. But on the other hand the release does fix some critical
 bugs that affect all users. We can always do 1.0.2 later if we discover a
 problem.

 Matei

 On Jul 10, 2014, at 9:40 PM, Patrick Wendell pwend...@gmail.com wrote:

  Hey Gary,
 
  The vote technically doesn't close until I send the vote summary
  e-mail, but I was planning to close and package this tonight. It's too
  bad if there is a regression, it might be worth holding the release
  but it really requires narrowing down the issue to get more
  information about the scope and severity. Could you fork another
  thread for this?
 
  - Patrick
 
  On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf malouf.g...@gmail.com
 wrote:
  -1 I honestly do not know the voting rules for the Spark community, so
  please excuse me if I am out of line or if Mesos compatibility is not a
  concern at this point.
 
  We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos
  0.18.2.  All of our jobs with data above a few gigabytes hung
 indefinitely.
  Downgrading back to the 1.0.0 stable release of Spark built the same way
  worked for us.
 
 
  On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves tgraves...@yahoo.com.invalid
 
  wrote:
 
  +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with
  authentication on.
 
  Tom
 
 
  On Friday, July 4, 2014 2:39 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
 
 
  Please vote on releasing the following candidate as Apache Spark
 version
  1.0.1!
 
  The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.0.1-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
 
 https://repository.apache.org/content/repositories/orgapachespark-1021/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.1!
 
  The vote is open until Monday, July 07, at 20:45 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  === Differences from RC1 ===
  This release includes only one blocking patch from rc1:
  https://github.com/apache/spark/pull/1255
 
  There are also smaller fixes which came in over the last week.
 
  === About this release ===
  This release fixes a few high-priority bugs in 1.0 and has a variety
  of smaller fixes. The full list is here: http://s.apache.org/b45. Some
  of the more visible patches are:
 
  SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
  SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
 size.
  SPARK-1790: Support r3 instance types on EC2.
 
  This is the first maintenance release on the 1.0 line. We plan to make
  additional maintenance releases as new fixes come in.
 




Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Patrick Wendell
Hey Gary,

Why do you think the akka frame size changed? It didn't change - we
added some fixes for cases where users were setting non-default
values.

On Fri, Jul 11, 2014 at 9:31 AM, Gary Malouf malouf.g...@gmail.com wrote:
 Hi Matei,

 We have not had time to re-deploy the rc today, but one thing that jumps
 out is the shrinking of the default akka frame size from 10MB to around
 128KB by default.  That is my first suspicion for our issue - could imagine
 that biting others as well.

 I'll try to re-test that today - either way, understand moving forward at
 this point.

 Gary


 On Fri, Jul 11, 2014 at 12:08 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Unless you can diagnose the problem quickly, Gary, I think we need to go
 ahead with this release as is. This release didn't touch the Mesos support
 as far as I know, so the problem might be a nondeterministic issue with
 your application. But on the other hand the release does fix some critical
 bugs that affect all users. We can always do 1.0.2 later if we discover a
 problem.

 Matei

 On Jul 10, 2014, at 9:40 PM, Patrick Wendell pwend...@gmail.com wrote:

  Hey Gary,
 
  The vote technically doesn't close until I send the vote summary
  e-mail, but I was planning to close and package this tonight. It's too
  bad if there is a regression, it might be worth holding the release
  but it really requires narrowing down the issue to get more
  information about the scope and severity. Could you fork another
  thread for this?
 
  - Patrick
 
  On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf malouf.g...@gmail.com
 wrote:
  -1 I honestly do not know the voting rules for the Spark community, so
  please excuse me if I am out of line or if Mesos compatibility is not a
  concern at this point.
 
  We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos
  0.18.2.  All of our jobs with data above a few gigabytes hung
 indefinitely.
  Downgrading back to the 1.0.0 stable release of Spark built the same way
  worked for us.
 
 
  On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves tgraves...@yahoo.com.invalid
 
  wrote:
 
  +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with
  authentication on.
 
  Tom
 
 
  On Friday, July 4, 2014 2:39 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
 
 
  Please vote on releasing the following candidate as Apache Spark
 version
  1.0.1!
 
  The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.0.1-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
 
 https://repository.apache.org/content/repositories/orgapachespark-1021/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.1!
 
  The vote is open until Monday, July 07, at 20:45 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  === Differences from RC1 ===
  This release includes only one blocking patch from rc1:
  https://github.com/apache/spark/pull/1255
 
  There are also smaller fixes which came in over the last week.
 
  === About this release ===
  This release fixes a few high-priority bugs in 1.0 and has a variety
  of smaller fixes. The full list is here: http://s.apache.org/b45. Some
  of the more visible patches are:
 
  SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
  SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
 size.
  SPARK-1790: Support r3 instance types on EC2.
 
  This is the first maintenance release on the 1.0 line. We plan to make
  additional maintenance releases as new fixes come in.
 




Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Patrick Wendell
Okay just FYI - I'm closing this vote since many people are waiting on
the release and I was hoping to package it today. If we find a
reproducible Mesos issue here, we can definitely spin the fix into a
subsequent release.



On Fri, Jul 11, 2014 at 9:37 AM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Gary,

 Why do you think the akka frame size changed? It didn't change - we
 added some fixes for cases where users were setting non-default
 values.

 On Fri, Jul 11, 2014 at 9:31 AM, Gary Malouf malouf.g...@gmail.com wrote:
 Hi Matei,

 We have not had time to re-deploy the rc today, but one thing that jumps
 out is the shrinking of the default akka frame size from 10MB to around
 128KB by default.  That is my first suspicion for our issue - could imagine
 that biting others as well.

 I'll try to re-test that today - either way, understand moving forward at
 this point.

 Gary


 On Fri, Jul 11, 2014 at 12:08 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Unless you can diagnose the problem quickly, Gary, I think we need to go
 ahead with this release as is. This release didn't touch the Mesos support
 as far as I know, so the problem might be a nondeterministic issue with
 your application. But on the other hand the release does fix some critical
 bugs that affect all users. We can always do 1.0.2 later if we discover a
 problem.

 Matei

 On Jul 10, 2014, at 9:40 PM, Patrick Wendell pwend...@gmail.com wrote:

  Hey Gary,
 
  The vote technically doesn't close until I send the vote summary
  e-mail, but I was planning to close and package this tonight. It's too
  bad if there is a regression, it might be worth holding the release
  but it really requires narrowing down the issue to get more
  information about the scope and severity. Could you fork another
  thread for this?
 
  - Patrick
 
  On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf malouf.g...@gmail.com
 wrote:
  -1 I honestly do not know the voting rules for the Spark community, so
  please excuse me if I am out of line or if Mesos compatibility is not a
  concern at this point.
 
  We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos
  0.18.2.  All of our jobs with data above a few gigabytes hung
 indefinitely.
  Downgrading back to the 1.0.0 stable release of Spark built the same way
  worked for us.
 
 
  On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves tgraves...@yahoo.com.invalid
 
  wrote:
 
  +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with
  authentication on.
 
  Tom
 
 
  On Friday, July 4, 2014 2:39 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
 
 
  Please vote on releasing the following candidate as Apache Spark
 version
  1.0.1!
 
  The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.0.1-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
 
 https://repository.apache.org/content/repositories/orgapachespark-1021/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.1!
 
  The vote is open until Monday, July 07, at 20:45 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  === Differences from RC1 ===
  This release includes only one blocking patch from rc1:
  https://github.com/apache/spark/pull/1255
 
  There are also smaller fixes which came in over the last week.
 
  === About this release ===
  This release fixes a few high-priority bugs in 1.0 and has a variety
  of smaller fixes. The full list is here: http://s.apache.org/b45. Some
  of the more visible patches are:
 
  SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
  SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
 size.
  SPARK-1790: Support r3 instance types on EC2.
 
  This is the first maintenance release on the 1.0 line. We plan to make
  additional maintenance releases as new fixes come in.
 




[RESULT] [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Patrick Wendell
This vote has passed with 9 +1 votes (5 binding) and 1 -1 vote (0 binding).

+1:
Patrick Wendell*
Mark Hamstra*
DB Tsai
Krishna Sankar
Soren Macbeth
Andrew Or
Matei Zaharia*
Xiangrui Meng*
Tom Graves*

0:

-1:
Gary Malouf


Re: How pySpark works?

2014-07-11 Thread Andrew Or
Hi Egor,

Here are a few answers to your questions:

1) Python needs to be installed on all machines, but not pyspark. The way
the executors get the pyspark code depends on which cluster manager you
use. In standalone mode, your executors need to have the actual python
files in their working directory. In yarn mode, python files are included
in the assembly jar, which is then shipped to your executor containers
through a distributed cache.

2) Pyspark is just a thin wrapper around Spark. When you write a closure in
python, it is shipped to the executors within the task itself the same way
scala closures are shipped. If you use a special library, then all of the
nodes will need to have that library pre-installed.

3) Are you trying to run your c++ code inside the map function? If so,
you need to make sure the compiled code is present in the working directory
on all the executors before-hand for python to exec it. I haven't done
this before, but maybe there are a few gotchas in doing this.

Maybe others can add more information?

Andrew


2014-07-11 5:50 GMT-07:00 Egor Pahomov pahomov.e...@gmail.com:

 Hi, I want to use pySpark, but can't understand how it works. Documentation
 doesn't provide enough information.

 1) How python shipped to cluster? Should machines in cluster already have
 python?
 2) What happens when I write some python code in map function - is it
 shipped to cluster and just executed on it? How it understand all
 dependencies, which my code need and ship it there? If I use Math in my
 code in map does it mean, that I would ship Math class or some python
 Math on cluster would be used?
 3) I have c++ compiled code. Can I ship this executable with addPyFile
 and just use exec function from python? Would it work?

 --



 *Sincerely yoursEgor PakhomovScala Developer, Yandex*



Re: How pySpark works?

2014-07-11 Thread Reynold Xin
Also take a look at this:
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals


On Fri, Jul 11, 2014 at 10:29 AM, Andrew Or and...@databricks.com wrote:

 Hi Egor,

 Here are a few answers to your questions:

 1) Python needs to be installed on all machines, but not pyspark. The way
 the executors get the pyspark code depends on which cluster manager you
 use. In standalone mode, your executors need to have the actual python
 files in their working directory. In yarn mode, python files are included
 in the assembly jar, which is then shipped to your executor containers
 through a distributed cache.

 2) Pyspark is just a thin wrapper around Spark. When you write a closure in
 python, it is shipped to the executors within the task itself the same way
 scala closures are shipped. If you use a special library, then all of the
 nodes will need to have that library pre-installed.

 3) Are you trying to run your c++ code inside the map function? If so,
 you need to make sure the compiled code is present in the working directory
 on all the executors before-hand for python to exec it. I haven't done
 this before, but maybe there are a few gotchas in doing this.

 Maybe others can add more information?

 Andrew


 2014-07-11 5:50 GMT-07:00 Egor Pahomov pahomov.e...@gmail.com:

  Hi, I want to use pySpark, but can't understand how it works.
 Documentation
  doesn't provide enough information.
 
  1) How python shipped to cluster? Should machines in cluster already have
  python?
  2) What happens when I write some python code in map function - is it
  shipped to cluster and just executed on it? How it understand all
  dependencies, which my code need and ship it there? If I use Math in my
  code in map does it mean, that I would ship Math class or some python
  Math on cluster would be used?
  3) I have c++ compiled code. Can I ship this executable with addPyFile
  and just use exec function from python? Would it work?
 
  --
 
 
 
  *Sincerely yoursEgor PakhomovScala Developer, Yandex*
 



Re: Calling Scala/Java methods which operates on RDD

2014-07-11 Thread Kan Zhang
Hi Jai,

Your suspicion is correct. In general, Python RDDs are pickled into byte
arrays and stored in Java land as RDDs of byte arrays. union/zip operates
on byte arrays directly without deserializing. Currently, Python byte
arrays only get unpickled into Java objects in special cases, like SQL
functions or saving to Sequence Files (upcoming).

Hope it helps.

Kan


On Fri, Jul 11, 2014 at 5:04 AM, Jai Kumar Singh fluke...@flukebox.in
wrote:

 HI,
   I want to write some common utility function in Scala and want to call
 the same from Java/Python Spark API ( may be add some wrapper code around
 scala calls). Calling Scala functions from Java works fine. I was reading
 pyspark rdd code and find out that pyspark is able to call JavaRDD function
 like union/zip to get same for pyspark RDD and deserializing the output and
 everything works fine. But somehow I am
 not able to work out really simple example. I think I am missing some
 serialization/deserialization.

 Can someone confirm that is it even possible to do so? Or, would it be much
 easier to pass RDD data files around instead of RDD directly (from pyspark
 to java/scala)?

 For example, below code just add 1 to each element of RDD containing
 Integers.

 package flukebox.test;

 object TestClass{

 def testFunc(data:RDD[Int])={

   data.map(x = x+1)

 }

 }

 Calling from python,

 from pyspark import RDD

 from py4j.java_gateway import java_import

 java_import(sc._gateway.jvm, flukebox.test)


 data = sc.parallelize([1,2,3,4,5,6,7,8,9])

 sc._jvm.flukebox.test.TestClass.testFunc(data._jrdd.rdd())


 *This fails because testFunc get any RDD of type Byte Array.*


 Any help/pointer would be highly appreciated.


 Thanks  Regards,

 Jai K Singh



Announcing Spark 1.0.1

2014-07-11 Thread Patrick Wendell
I am happy to announce the availability of Spark 1.0.1! This release
includes contributions from 70 developers. Spark 1.0.0 includes fixes
across several areas of Spark, including the core API, PySpark, and
MLlib. It also includes new features in Spark's (alpha) SQL library,
including support for JSON data and performance and stability fixes.

Visit the release notes[1] to read about this release or download[2]
the release today.

[1] http://spark.apache.org/releases/spark-release-1-0-1.html
[2] http://spark.apache.org/downloads.html


Re: Announcing Spark 1.0.1

2014-07-11 Thread Henry Saputra
Congrats to the Spark community !

On Friday, July 11, 2014, Patrick Wendell pwend...@gmail.com wrote:

 I am happy to announce the availability of Spark 1.0.1! This release
 includes contributions from 70 developers. Spark 1.0.0 includes fixes
 across several areas of Spark, including the core API, PySpark, and
 MLlib. It also includes new features in Spark's (alpha) SQL library,
 including support for JSON data and performance and stability fixes.

 Visit the release notes[1] to read about this release or download[2]
 the release today.

 [1] http://spark.apache.org/releases/spark-release-1-0-1.html
 [2] http://spark.apache.org/downloads.html