Re: Announcing Spark 1.0.1
Congrats to the Spark community ! On Friday, July 11, 2014, Patrick Wendell wrote: > I am happy to announce the availability of Spark 1.0.1! This release > includes contributions from 70 developers. Spark 1.0.0 includes fixes > across several areas of Spark, including the core API, PySpark, and > MLlib. It also includes new features in Spark's (alpha) SQL library, > including support for JSON data and performance and stability fixes. > > Visit the release notes[1] to read about this release or download[2] > the release today. > > [1] http://spark.apache.org/releases/spark-release-1-0-1.html > [2] http://spark.apache.org/downloads.html >
Announcing Spark 1.0.1
I am happy to announce the availability of Spark 1.0.1! This release includes contributions from 70 developers. Spark 1.0.0 includes fixes across several areas of Spark, including the core API, PySpark, and MLlib. It also includes new features in Spark's (alpha) SQL library, including support for JSON data and performance and stability fixes. Visit the release notes[1] to read about this release or download[2] the release today. [1] http://spark.apache.org/releases/spark-release-1-0-1.html [2] http://spark.apache.org/downloads.html
Re: Calling Scala/Java methods which operates on RDD
Hi Jai, Your suspicion is correct. In general, Python RDDs are pickled into byte arrays and stored in Java land as RDDs of byte arrays. union/zip operates on byte arrays directly without deserializing. Currently, Python byte arrays only get unpickled into Java objects in special cases, like SQL functions or saving to Sequence Files (upcoming). Hope it helps. Kan On Fri, Jul 11, 2014 at 5:04 AM, Jai Kumar Singh wrote: > HI, > I want to write some common utility function in Scala and want to call > the same from Java/Python Spark API ( may be add some wrapper code around > scala calls). Calling Scala functions from Java works fine. I was reading > pyspark rdd code and find out that pyspark is able to call JavaRDD function > like union/zip to get same for pyspark RDD and deserializing the output and > everything works fine. But somehow I am > not able to work out really simple example. I think I am missing some > serialization/deserialization. > > Can someone confirm that is it even possible to do so? Or, would it be much > easier to pass RDD data files around instead of RDD directly (from pyspark > to java/scala)? > > For example, below code just add 1 to each element of RDD containing > Integers. > > package flukebox.test; > > object TestClass{ > > def testFunc(data:RDD[Int])={ > > data.map(x => x+1) > > } > > } > > Calling from python, > > from pyspark import RDD > > from py4j.java_gateway import java_import > > java_import(sc._gateway.jvm, "flukebox.test") > > > data = sc.parallelize([1,2,3,4,5,6,7,8,9]) > > sc._jvm.flukebox.test.TestClass.testFunc(data._jrdd.rdd()) > > > *This fails because testFunc get any RDD of type Byte Array.* > > > Any help/pointer would be highly appreciated. > > > Thanks & Regards, > > Jai K Singh >
Re: How pySpark works?
Also take a look at this: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals On Fri, Jul 11, 2014 at 10:29 AM, Andrew Or wrote: > Hi Egor, > > Here are a few answers to your questions: > > 1) Python needs to be installed on all machines, but not pyspark. The way > the executors get the pyspark code depends on which cluster manager you > use. In standalone mode, your executors need to have the actual python > files in their working directory. In yarn mode, python files are included > in the assembly jar, which is then shipped to your executor containers > through a distributed cache. > > 2) Pyspark is just a thin wrapper around Spark. When you write a closure in > python, it is shipped to the executors within the task itself the same way > scala closures are shipped. If you use a special library, then all of the > nodes will need to have that library pre-installed. > > 3) Are you trying to run your c++ code inside the "map" function? If so, > you need to make sure the compiled code is present in the working directory > on all the executors before-hand for python to "exec" it. I haven't done > this before, but maybe there are a few gotchas in doing this. > > Maybe others can add more information? > > Andrew > > > 2014-07-11 5:50 GMT-07:00 Egor Pahomov : > > > Hi, I want to use pySpark, but can't understand how it works. > Documentation > > doesn't provide enough information. > > > > 1) How python shipped to cluster? Should machines in cluster already have > > python? > > 2) What happens when I write some python code in "map" function - is it > > shipped to cluster and just executed on it? How it understand all > > dependencies, which my code need and ship it there? If I use Math in my > > code in "map" does it mean, that I would ship Math class or some python > > Math on cluster would be used? > > 3) I have c++ compiled code. Can I ship this executable with "addPyFile" > > and just use "exec" function from python? Would it work? > > > > -- > > > > > > > > *Sincerely yoursEgor PakhomovScala Developer, Yandex* > > >
Re: How pySpark works?
Hi Egor, Here are a few answers to your questions: 1) Python needs to be installed on all machines, but not pyspark. The way the executors get the pyspark code depends on which cluster manager you use. In standalone mode, your executors need to have the actual python files in their working directory. In yarn mode, python files are included in the assembly jar, which is then shipped to your executor containers through a distributed cache. 2) Pyspark is just a thin wrapper around Spark. When you write a closure in python, it is shipped to the executors within the task itself the same way scala closures are shipped. If you use a special library, then all of the nodes will need to have that library pre-installed. 3) Are you trying to run your c++ code inside the "map" function? If so, you need to make sure the compiled code is present in the working directory on all the executors before-hand for python to "exec" it. I haven't done this before, but maybe there are a few gotchas in doing this. Maybe others can add more information? Andrew 2014-07-11 5:50 GMT-07:00 Egor Pahomov : > Hi, I want to use pySpark, but can't understand how it works. Documentation > doesn't provide enough information. > > 1) How python shipped to cluster? Should machines in cluster already have > python? > 2) What happens when I write some python code in "map" function - is it > shipped to cluster and just executed on it? How it understand all > dependencies, which my code need and ship it there? If I use Math in my > code in "map" does it mean, that I would ship Math class or some python > Math on cluster would be used? > 3) I have c++ compiled code. Can I ship this executable with "addPyFile" > and just use "exec" function from python? Would it work? > > -- > > > > *Sincerely yoursEgor PakhomovScala Developer, Yandex* >
[RESULT] [VOTE] Release Apache Spark 1.0.1 (RC2)
This vote has passed with 9 +1 votes (5 binding) and 1 -1 vote (0 binding). +1: Patrick Wendell* Mark Hamstra* DB Tsai Krishna Sankar Soren Macbeth Andrew Or Matei Zaharia* Xiangrui Meng* Tom Graves* 0: -1: Gary Malouf
Re: [VOTE] Release Apache Spark 1.0.1 (RC2)
Okay just FYI - I'm closing this vote since many people are waiting on the release and I was hoping to package it today. If we find a reproducible Mesos issue here, we can definitely spin the fix into a subsequent release. On Fri, Jul 11, 2014 at 9:37 AM, Patrick Wendell wrote: > Hey Gary, > > Why do you think the akka frame size changed? It didn't change - we > added some fixes for cases where users were setting non-default > values. > > On Fri, Jul 11, 2014 at 9:31 AM, Gary Malouf wrote: >> Hi Matei, >> >> We have not had time to re-deploy the rc today, but one thing that jumps >> out is the shrinking of the default akka frame size from 10MB to around >> 128KB by default. That is my first suspicion for our issue - could imagine >> that biting others as well. >> >> I'll try to re-test that today - either way, understand moving forward at >> this point. >> >> Gary >> >> >> On Fri, Jul 11, 2014 at 12:08 PM, Matei Zaharia >> wrote: >> >>> Unless you can diagnose the problem quickly, Gary, I think we need to go >>> ahead with this release as is. This release didn't touch the Mesos support >>> as far as I know, so the problem might be a nondeterministic issue with >>> your application. But on the other hand the release does fix some critical >>> bugs that affect all users. We can always do 1.0.2 later if we discover a >>> problem. >>> >>> Matei >>> >>> On Jul 10, 2014, at 9:40 PM, Patrick Wendell wrote: >>> >>> > Hey Gary, >>> > >>> > The vote technically doesn't close until I send the vote summary >>> > e-mail, but I was planning to close and package this tonight. It's too >>> > bad if there is a regression, it might be worth holding the release >>> > but it really requires narrowing down the issue to get more >>> > information about the scope and severity. Could you fork another >>> > thread for this? >>> > >>> > - Patrick >>> > >>> > On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf >>> wrote: >>> >> -1 I honestly do not know the voting rules for the Spark community, so >>> >> please excuse me if I am out of line or if Mesos compatibility is not a >>> >> concern at this point. >>> >> >>> >> We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos >>> >> 0.18.2. All of our jobs with data above a few gigabytes hung >>> indefinitely. >>> >> Downgrading back to the 1.0.0 stable release of Spark built the same way >>> >> worked for us. >>> >> >>> >> >>> >> On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves >> > >>> >> wrote: >>> >> >>> >>> +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with >>> >>> authentication on. >>> >>> >>> >>> Tom >>> >>> >>> >>> >>> >>> On Friday, July 4, 2014 2:39 PM, Patrick Wendell >>> >>> wrote: >>> >>> >>> >>> >>> >>> >>> >>> Please vote on releasing the following candidate as Apache Spark >>> version >>> >>> 1.0.1! >>> >>> >>> >>> The tag to be voted on is v1.0.1-rc1 (commit 7d1043c): >>> >>> >>> >>> >>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78 >>> >>> >>> >>> The release files, including signatures, digests, etc. can be found at: >>> >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2/ >>> >>> >>> >>> Release artifacts are signed with the following key: >>> >>> https://people.apache.org/keys/committer/pwendell.asc >>> >>> >>> >>> The staging repository for this release can be found at: >>> >>> >>> https://repository.apache.org/content/repositories/orgapachespark-1021/ >>> >>> >>> >>> The documentation corresponding to this release can be found at: >>> >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/ >>> >>> >>> >>> Please vote on releasing this package as Apache Spark 1.0.1! >>> >>> >>> >>> The vote is open until Monday, July 07, at 20:45 UTC and passes if >>> >>> a majority of at least 3 +1 PMC votes are cast. >>> >>> >>> >>> [ ] +1 Release this package as Apache Spark 1.0.1 >>> >>> [ ] -1 Do not release this package because ... >>> >>> >>> >>> To learn more about Apache Spark, please see >>> >>> http://spark.apache.org/ >>> >>> >>> >>> === Differences from RC1 === >>> >>> This release includes only one "blocking" patch from rc1: >>> >>> https://github.com/apache/spark/pull/1255 >>> >>> >>> >>> There are also smaller fixes which came in over the last week. >>> >>> >>> >>> === About this release === >>> >>> This release fixes a few high-priority bugs in 1.0 and has a variety >>> >>> of smaller fixes. The full list is here: http://s.apache.org/b45. Some >>> >>> of the more visible patches are: >>> >>> >>> >>> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys >>> >>> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame >>> size. >>> >>> SPARK-1790: Support r3 instance types on EC2. >>> >>> >>> >>> This is the first maintenance release on the 1.0 line. We plan to make >>> >>> additional maintenance releases as new fixes come in. >>> >>> >>> >>>
Re: [VOTE] Release Apache Spark 1.0.1 (RC2)
Hey Gary, Why do you think the akka frame size changed? It didn't change - we added some fixes for cases where users were setting non-default values. On Fri, Jul 11, 2014 at 9:31 AM, Gary Malouf wrote: > Hi Matei, > > We have not had time to re-deploy the rc today, but one thing that jumps > out is the shrinking of the default akka frame size from 10MB to around > 128KB by default. That is my first suspicion for our issue - could imagine > that biting others as well. > > I'll try to re-test that today - either way, understand moving forward at > this point. > > Gary > > > On Fri, Jul 11, 2014 at 12:08 PM, Matei Zaharia > wrote: > >> Unless you can diagnose the problem quickly, Gary, I think we need to go >> ahead with this release as is. This release didn't touch the Mesos support >> as far as I know, so the problem might be a nondeterministic issue with >> your application. But on the other hand the release does fix some critical >> bugs that affect all users. We can always do 1.0.2 later if we discover a >> problem. >> >> Matei >> >> On Jul 10, 2014, at 9:40 PM, Patrick Wendell wrote: >> >> > Hey Gary, >> > >> > The vote technically doesn't close until I send the vote summary >> > e-mail, but I was planning to close and package this tonight. It's too >> > bad if there is a regression, it might be worth holding the release >> > but it really requires narrowing down the issue to get more >> > information about the scope and severity. Could you fork another >> > thread for this? >> > >> > - Patrick >> > >> > On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf >> wrote: >> >> -1 I honestly do not know the voting rules for the Spark community, so >> >> please excuse me if I am out of line or if Mesos compatibility is not a >> >> concern at this point. >> >> >> >> We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos >> >> 0.18.2. All of our jobs with data above a few gigabytes hung >> indefinitely. >> >> Downgrading back to the 1.0.0 stable release of Spark built the same way >> >> worked for us. >> >> >> >> >> >> On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves > > >> >> wrote: >> >> >> >>> +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with >> >>> authentication on. >> >>> >> >>> Tom >> >>> >> >>> >> >>> On Friday, July 4, 2014 2:39 PM, Patrick Wendell >> >>> wrote: >> >>> >> >>> >> >>> >> >>> Please vote on releasing the following candidate as Apache Spark >> version >> >>> 1.0.1! >> >>> >> >>> The tag to be voted on is v1.0.1-rc1 (commit 7d1043c): >> >>> >> >>> >> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78 >> >>> >> >>> The release files, including signatures, digests, etc. can be found at: >> >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2/ >> >>> >> >>> Release artifacts are signed with the following key: >> >>> https://people.apache.org/keys/committer/pwendell.asc >> >>> >> >>> The staging repository for this release can be found at: >> >>> >> https://repository.apache.org/content/repositories/orgapachespark-1021/ >> >>> >> >>> The documentation corresponding to this release can be found at: >> >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/ >> >>> >> >>> Please vote on releasing this package as Apache Spark 1.0.1! >> >>> >> >>> The vote is open until Monday, July 07, at 20:45 UTC and passes if >> >>> a majority of at least 3 +1 PMC votes are cast. >> >>> >> >>> [ ] +1 Release this package as Apache Spark 1.0.1 >> >>> [ ] -1 Do not release this package because ... >> >>> >> >>> To learn more about Apache Spark, please see >> >>> http://spark.apache.org/ >> >>> >> >>> === Differences from RC1 === >> >>> This release includes only one "blocking" patch from rc1: >> >>> https://github.com/apache/spark/pull/1255 >> >>> >> >>> There are also smaller fixes which came in over the last week. >> >>> >> >>> === About this release === >> >>> This release fixes a few high-priority bugs in 1.0 and has a variety >> >>> of smaller fixes. The full list is here: http://s.apache.org/b45. Some >> >>> of the more visible patches are: >> >>> >> >>> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys >> >>> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame >> size. >> >>> SPARK-1790: Support r3 instance types on EC2. >> >>> >> >>> This is the first maintenance release on the 1.0 line. We plan to make >> >>> additional maintenance releases as new fixes come in. >> >>> >> >>
Re: [VOTE] Release Apache Spark 1.0.1 (RC2)
Hi Matei, We have not had time to re-deploy the rc today, but one thing that jumps out is the shrinking of the default akka frame size from 10MB to around 128KB by default. That is my first suspicion for our issue - could imagine that biting others as well. I'll try to re-test that today - either way, understand moving forward at this point. Gary On Fri, Jul 11, 2014 at 12:08 PM, Matei Zaharia wrote: > Unless you can diagnose the problem quickly, Gary, I think we need to go > ahead with this release as is. This release didn't touch the Mesos support > as far as I know, so the problem might be a nondeterministic issue with > your application. But on the other hand the release does fix some critical > bugs that affect all users. We can always do 1.0.2 later if we discover a > problem. > > Matei > > On Jul 10, 2014, at 9:40 PM, Patrick Wendell wrote: > > > Hey Gary, > > > > The vote technically doesn't close until I send the vote summary > > e-mail, but I was planning to close and package this tonight. It's too > > bad if there is a regression, it might be worth holding the release > > but it really requires narrowing down the issue to get more > > information about the scope and severity. Could you fork another > > thread for this? > > > > - Patrick > > > > On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf > wrote: > >> -1 I honestly do not know the voting rules for the Spark community, so > >> please excuse me if I am out of line or if Mesos compatibility is not a > >> concern at this point. > >> > >> We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos > >> 0.18.2. All of our jobs with data above a few gigabytes hung > indefinitely. > >> Downgrading back to the 1.0.0 stable release of Spark built the same way > >> worked for us. > >> > >> > >> On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves > > >> wrote: > >> > >>> +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with > >>> authentication on. > >>> > >>> Tom > >>> > >>> > >>> On Friday, July 4, 2014 2:39 PM, Patrick Wendell > >>> wrote: > >>> > >>> > >>> > >>> Please vote on releasing the following candidate as Apache Spark > version > >>> 1.0.1! > >>> > >>> The tag to be voted on is v1.0.1-rc1 (commit 7d1043c): > >>> > >>> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78 > >>> > >>> The release files, including signatures, digests, etc. can be found at: > >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2/ > >>> > >>> Release artifacts are signed with the following key: > >>> https://people.apache.org/keys/committer/pwendell.asc > >>> > >>> The staging repository for this release can be found at: > >>> > https://repository.apache.org/content/repositories/orgapachespark-1021/ > >>> > >>> The documentation corresponding to this release can be found at: > >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/ > >>> > >>> Please vote on releasing this package as Apache Spark 1.0.1! > >>> > >>> The vote is open until Monday, July 07, at 20:45 UTC and passes if > >>> a majority of at least 3 +1 PMC votes are cast. > >>> > >>> [ ] +1 Release this package as Apache Spark 1.0.1 > >>> [ ] -1 Do not release this package because ... > >>> > >>> To learn more about Apache Spark, please see > >>> http://spark.apache.org/ > >>> > >>> === Differences from RC1 === > >>> This release includes only one "blocking" patch from rc1: > >>> https://github.com/apache/spark/pull/1255 > >>> > >>> There are also smaller fixes which came in over the last week. > >>> > >>> === About this release === > >>> This release fixes a few high-priority bugs in 1.0 and has a variety > >>> of smaller fixes. The full list is here: http://s.apache.org/b45. Some > >>> of the more visible patches are: > >>> > >>> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys > >>> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame > size. > >>> SPARK-1790: Support r3 instance types on EC2. > >>> > >>> This is the first maintenance release on the 1.0 line. We plan to make > >>> additional maintenance releases as new fixes come in. > >>> > >
Re: [VOTE] Release Apache Spark 1.0.1 (RC2)
Unless you can diagnose the problem quickly, Gary, I think we need to go ahead with this release as is. This release didn't touch the Mesos support as far as I know, so the problem might be a nondeterministic issue with your application. But on the other hand the release does fix some critical bugs that affect all users. We can always do 1.0.2 later if we discover a problem. Matei On Jul 10, 2014, at 9:40 PM, Patrick Wendell wrote: > Hey Gary, > > The vote technically doesn't close until I send the vote summary > e-mail, but I was planning to close and package this tonight. It's too > bad if there is a regression, it might be worth holding the release > but it really requires narrowing down the issue to get more > information about the scope and severity. Could you fork another > thread for this? > > - Patrick > > On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf wrote: >> -1 I honestly do not know the voting rules for the Spark community, so >> please excuse me if I am out of line or if Mesos compatibility is not a >> concern at this point. >> >> We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos >> 0.18.2. All of our jobs with data above a few gigabytes hung indefinitely. >> Downgrading back to the 1.0.0 stable release of Spark built the same way >> worked for us. >> >> >> On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves >> wrote: >> >>> +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with >>> authentication on. >>> >>> Tom >>> >>> >>> On Friday, July 4, 2014 2:39 PM, Patrick Wendell >>> wrote: >>> >>> >>> >>> Please vote on releasing the following candidate as Apache Spark version >>> 1.0.1! >>> >>> The tag to be voted on is v1.0.1-rc1 (commit 7d1043c): >>> >>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78 >>> >>> The release files, including signatures, digests, etc. can be found at: >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2/ >>> >>> Release artifacts are signed with the following key: >>> https://people.apache.org/keys/committer/pwendell.asc >>> >>> The staging repository for this release can be found at: >>> https://repository.apache.org/content/repositories/orgapachespark-1021/ >>> >>> The documentation corresponding to this release can be found at: >>> http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/ >>> >>> Please vote on releasing this package as Apache Spark 1.0.1! >>> >>> The vote is open until Monday, July 07, at 20:45 UTC and passes if >>> a majority of at least 3 +1 PMC votes are cast. >>> >>> [ ] +1 Release this package as Apache Spark 1.0.1 >>> [ ] -1 Do not release this package because ... >>> >>> To learn more about Apache Spark, please see >>> http://spark.apache.org/ >>> >>> === Differences from RC1 === >>> This release includes only one "blocking" patch from rc1: >>> https://github.com/apache/spark/pull/1255 >>> >>> There are also smaller fixes which came in over the last week. >>> >>> === About this release === >>> This release fixes a few high-priority bugs in 1.0 and has a variety >>> of smaller fixes. The full list is here: http://s.apache.org/b45. Some >>> of the more visible patches are: >>> >>> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys >>> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size. >>> SPARK-1790: Support r3 instance types on EC2. >>> >>> This is the first maintenance release on the 1.0 line. We plan to make >>> additional maintenance releases as new fixes come in. >>>
Calling Scala/Java methods which operates on RDD
HI, I want to write some common utility function in Scala and want to call the same from Java/Python Spark API ( may be add some wrapper code around scala calls). Calling Scala functions from Java works fine. I was reading pyspark rdd code and find out that pyspark is able to call JavaRDD function like union/zip to get same for pyspark RDD and deserializing the output and everything works fine. But somehow I am not able to work out really simple example. I think I am missing some serialization/deserialization. Can someone confirm that is it even possible to do so? Or, would it be much easier to pass RDD data files around instead of RDD directly (from pyspark to java/scala)? For example, below code just add 1 to each element of RDD containing Integers. package flukebox.test; object TestClass{ def testFunc(data:RDD[Int])={ data.map(x => x+1) } } Calling from python, from pyspark import RDD from py4j.java_gateway import java_import java_import(sc._gateway.jvm, "flukebox.test") data = sc.parallelize([1,2,3,4,5,6,7,8,9]) sc._jvm.flukebox.test.TestClass.testFunc(data._jrdd.rdd()) *This fails because testFunc get any RDD of type Byte Array.* Any help/pointer would be highly appreciated. Thanks & Regards, Jai K Singh
Re: Random forest - is it under implementation?
Great. Then one question left: what would you recommend for implementation? 2014-07-11 17:43 GMT+04:00 Chester At Work : > Sung chung from alpine data labs presented the random Forrest > implementation at Spark summit 2014. The work will be open sourced and > contributed back to MLLib. > > Stay tuned > > > > Sent from my iPad > > On Jul 11, 2014, at 6:02 AM, Egor Pahomov wrote: > > > Hi, I have intern, who wants to implement some ML algorithm for spark. > > Which algorithm would be good idea to implement(it should be not very > > difficult)? I heard someone already working on random forest, but > couldn't > > find proof of that. > > > > I'm aware of new politics, where we should implement stable, good > quality, > > popular ML or do not do it at all. > > > > -- > > > > > > > > *Sincerely yoursEgor PakhomovScala Developer, Yandex* > -- *Sincerely yoursEgor PakhomovScala Developer, Yandex*
Re: Random forest - is it under implementation?
Sung chung from alpine data labs presented the random Forrest implementation at Spark summit 2014. The work will be open sourced and contributed back to MLLib. Stay tuned Sent from my iPad On Jul 11, 2014, at 6:02 AM, Egor Pahomov wrote: > Hi, I have intern, who wants to implement some ML algorithm for spark. > Which algorithm would be good idea to implement(it should be not very > difficult)? I heard someone already working on random forest, but couldn't > find proof of that. > > I'm aware of new politics, where we should implement stable, good quality, > popular ML or do not do it at all. > > -- > > > > *Sincerely yoursEgor PakhomovScala Developer, Yandex*
Random forest - is it under implementation?
Hi, I have intern, who wants to implement some ML algorithm for spark. Which algorithm would be good idea to implement(it should be not very difficult)? I heard someone already working on random forest, but couldn't find proof of that. I'm aware of new politics, where we should implement stable, good quality, popular ML or do not do it at all. -- *Sincerely yoursEgor PakhomovScala Developer, Yandex*
How pySpark works?
Hi, I want to use pySpark, but can't understand how it works. Documentation doesn't provide enough information. 1) How python shipped to cluster? Should machines in cluster already have python? 2) What happens when I write some python code in "map" function - is it shipped to cluster and just executed on it? How it understand all dependencies, which my code need and ship it there? If I use Math in my code in "map" does it mean, that I would ship Math class or some python Math on cluster would be used? 3) I have c++ compiled code. Can I ship this executable with "addPyFile" and just use "exec" function from python? Would it work? -- *Sincerely yoursEgor PakhomovScala Developer, Yandex*