[MLlib] Extensibility of MLlib classes (Word2VecModel etc.)
Hey, I'm trying to implement doc2vec (http://cs.stanford.edu/~quocle/paragraph_vector.pdf, mainly for sport/research purpose due to all it's limitations so I would probably not even try to PR it into MLlib itself) but to do that it would be highly useful to have access to MLlib's Word2VecModel class, which is mostly private. Is there any reason (i.e. some Spark/MLlib guidelines) for that or would it be ok to refactor the code and make a PR? I've found a similar JIRA issue which was posted almost a year ago but for some reason it got closed: https://issues.apache.org/jira/browse/SPARK-4101. Mateusz -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Extensibility-of-MLlib-classes-Word2VecModel-etc-tp14011.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Did the 1.5 release complete?
Dev/user announcement was made just now. For Maven, I did publish it this afternoon (so it's been a few hours). If it is still not there tomorrow morning, I will look into it. On Wed, Sep 9, 2015 at 2:42 AM, Sean Owenwrote: > I saw the end of the RC3 vote: > > https://mail-archives.apache.org/mod_mbox/spark-dev/201509.mbox/%3CCAPh_B%3DbQWf_vVuPs_eRpvnNSj8fbULX4kULnbs6MCAA10ZQ9eQ%40mail.gmail.com%3E > > but there are no artifacts for it in Maven? > > http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.spark%22%20AND%20a%3A%22spark-parent_2.10%22 > > and I don't see any announcement at dev@ > https://mail-archives.apache.org/mod_mbox/spark-dev/201509.mbox/browser > > But it was announced here just now: > https://databricks.com/blog/2015/09/09/announcing-spark-1-5.html > > Did I miss something? > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: [ANNOUNCE] Announcing Spark 1.5.0
Great work, everyone! - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: looking for a technical reviewer to review a book on Spark
Hi Mohammad: I'm interested. ThanksGuru Yeleswarapu From: Mohammed GullerTo: "dev@spark.apache.org" Sent: Wednesday, September 9, 2015 8:36 AM Subject: looking for a technical reviewer to review a book on Spark Hi Spark developers, I am writing a book on Spark. The publisher of the book is looking for a technical reviewer. You will be compensated for your time. The publisher will pay a flat rate per page for the review. I spoke with Matei Zaharia about this and he suggested that I send an email to the dev mailing list. The book covers Spark core and the Spark libraries, including Spark SQL, Spark Streaming, MLlib, Spark ML, and GraphX. It also covers operational aspects such as deployment with different cluster managers and monitoring. Please let me know if you are interested and I will connect you with the publisher. Thanks, Mohammed Principal Architect, Glassbeam Inc, www.glassbeam.com, 5201 Great America Parkway, Suite 360, Santa Clara, CA 95054 p: +1.408.740.4610, m: +1.925.786.7521, f: +1.408.740.4601,skype : mguller
Re: looking for a technical reviewer to review a book on Spark
My Apologies for broadcast! That email was meant for Mohammad. From: Gurumurthy YeleswarapuTo: Mohammed Guller ; "dev@spark.apache.org" Sent: Wednesday, September 9, 2015 8:50 AM Subject: Re: looking for a technical reviewer to review a book on Spark Hi Mohammad: I'm interested. ThanksGuru Yeleswarapu From: Mohammed Guller To: "dev@spark.apache.org" Sent: Wednesday, September 9, 2015 8:36 AM Subject: looking for a technical reviewer to review a book on Spark #yiv1554291100 #yiv1554291100 -- filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv1554291100 p.yiv1554291100MsoNormal, #yiv1554291100 li.yiv1554291100MsoNormal, #yiv1554291100 div.yiv1554291100MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:11.0pt;}#yiv1554291100 a:link, #yiv1554291100 span.yiv1554291100MsoHyperlink {color:blue;text-decoration:underline;}#yiv1554291100 a:visited, #yiv1554291100 span.yiv1554291100MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv1554291100 span.yiv1554291100EmailStyle17 {color:windowtext;}#yiv1554291100 .yiv1554291100MsoChpDefault {}#yiv1554291100 filtered {margin:1.0in 1.0in 1.0in 1.0in;}#yiv1554291100 div.yiv1554291100WordSection1 {}#yiv1554291100 Hi Spark developers, I am writing a book on Spark. The publisher of the book is looking for a technical reviewer. You will be compensated for your time. The publisher will pay a flat rate per page for the review. I spoke with Matei Zaharia about this and he suggested that I send an email to the dev mailing list. The book covers Spark core and the Spark libraries, including Spark SQL, Spark Streaming, MLlib, Spark ML, and GraphX. It also covers operational aspects such as deployment with different cluster managers and monitoring. Please let me know if you are interested and I will connect you with the publisher. Thanks, Mohammed Principal Architect, Glassbeam Inc, www.glassbeam.com, 5201 Great America Parkway, Suite 360, Santa Clara, CA 95054 p: +1.408.740.4610, m: +1.925.786.7521, f: +1.408.740.4601,skype : mguller
RE: (Spark SQL) partition-scoped UDF
Follow-up: solved this problem by overriding the model's `transform` method, and using `mapPartitions` to produce a new DataFrame rather than using `udf`. Source code:https://github.com/deeplearning4j/deeplearning4j/blob/135d3b25b96c21349abf488a44f59bb37a2a5930/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/classification/MultiLayerNetworkClassification.scala#L143 Thanks Reynold for your time. -Eron Date: Sat, 5 Sep 2015 13:55:34 -0700 Subject: Re: (Spark SQL) partition-scoped UDF From: ewri...@live.com To: r...@databricks.com CC: dev@spark.apache.org The transformer is a classification model produced by the NeuralNetClassification estimator of dl4j-spark-ml. Source code here. The neural net operates most efficiently when many examples are classified in batch. I imagine overriding `transform` rather than `predictRaw`. Does anyone know of a solution compatible with Spark 1.4 or 1.5? Thanks again! From: Reynold Xin Date: Friday, September 4, 2015 at 5:19 PM To: Eron Wright Cc: "dev@spark.apache.org" Subject: Re: (Spark SQL) partition-scoped UDF Can you say more about your transformer? This is a good idea, and indeed we are doing it for R already (the latest way to run UDFs in R is to pass the entire partition as a local R dataframe for users to run on). However, what works for R for simple data processing might not work for your high performance transformer, etc. On Fri, Sep 4, 2015 at 7:08 AM, Eron Wrightwrote: Transformers in Spark ML typically operate on a per-row basis, based on callUDF. For a new transformer that I'm developing, I have a need to transform an entire partition with a function, as opposed to transforming each row separately. The reason is that, in my case, rows must be transformed in batch for efficiency to amortize some overhead. How may I accomplish this? One option appears to be to invoke DataFrame::mapPartitions, yielding an RDD that is then converted back to a DataFrame. Unsure about the viability or consequences of that. Thanks!Eron Wright
Re: Deserializing JSON into Scala objects in Java code
Marcelo and Christopher, Thanks for your help! The problem turned out to arise from a different part of the code (we have multiple ObjectMappers), but because I am not very familiar with Jackson I had thought there was a problem with the Scala module. Thank you again, Kevin From: Christopher CurrieDate: Wednesday, September 9, 2015 at 10:17 AM To: Kevin Chen , "dev@spark.apache.org" Cc: Matt Cheah , Mingyu Kim Subject: Fwd: Deserializing JSON into Scala objects in Java code Kevin, I'm not a Spark dev, but I maintain the Scala module for Jackson. If you're continuing to have issues with parsing JSON using the Spark Scala datatypes, let me know or chime in on the jackson mailing list (jackson-u...@googlegroups.com) and I'll see what I can do to help. Christopher Currie -- Forwarded message -- From: Paul Brown Date: Tue, Sep 8, 2015 at 8:58 PM Subject: Fwd: Deserializing JSON into Scala objects in Java code To: Christopher Currie Passing along. -- Forwarded message -- From: Kevin Chen Date: Tuesday, September 8, 2015 Subject: Deserializing JSON into Scala objects in Java code To: "dev@spark.apache.org" Cc: Matt Cheah , Mingyu Kim Hello Spark Devs, I am trying to use the new Spark API json endpoints at /api/v1/[path] (added in SPARK-3454). In order to minimize maintenance on our end, I would like to use Retrofit/Jackson to parse the json directly into the Scala classes in org/apache/spark/status/api/v1/api.scala (ApplicationInfo, ApplicationAttemptInfo, etc…). However, Jackson does not seem to know how to handle Scala Seqs, and will throw an error when trying to parse the attempts: Seq[ApplicationAttemptInfo] field of ApplicationInfo. Our codebase is in Java. My questions are: 1. Do you have any recommendations on how to easily deserialize Scala objects from json? For example, do you have any current usage examples of SPARK-3454 with Java? 2. Alternatively, are you committed to the json formats of /api/v1/path? I would guess so, because of the ‘v1’, but wanted to confirm. If so, I could deserialize the json into instances of my own Java classes instead, without worrying about changing the class structure later due to changes in the Spark API. Some further information: * The error I am getting with Jackson when trying to deserialize the json into ApplicationInfo is Caused by: com.fasterxml.jackson.databind.JsonMappingException: Can not construct instance of scala.collection.Seq, problem: abstract types either need to be mapped to concrete types, have custom deserializer, or be instantiated with additional type information * I tried using Jackson’s DefaultScalaModule, which seems to have support for Scala Seqs, but got no luck. * Deserialization works if the Scala class does not have any Seq fields, and works if the fields are Java Lists instead of Seqs. Thanks very much for your help! Kevin Chen -- (Sent from mobile. Pardon brevity.) smime.p7s Description: S/MIME cryptographic signature
Re: [ANNOUNCE] Announcing Spark 1.5.0
Hi Spark Developers, I'm eager to try it out! However, I got problems in resolving dependencies: [warn] [NOT FOUND ] org.apache.spark#spark-core_2.10;1.5.0!spark-core_2.10.jar (0ms) [warn] jcenter: tried When the package will be available? Best Regards, Jerry On Wed, Sep 9, 2015 at 9:30 AM, Dimitris Kouzis - Loukaswrote: > Yeii! > > On Wed, Sep 9, 2015 at 2:25 PM, Yu Ishikawa > wrote: > >> Great work, everyone! >> >> >> >> - >> -- Yu Ishikawa >> -- >> View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >
Re: [ANNOUNCE] Announcing Spark 1.5.0
You can try it out really quickly by "building" a Spark Notebook from http://spark-notebook.io/. Just choose the master branch and 1.5.0, a correct hadoop version (default to 2.2.0 though) and there you go :-) On Wed, Sep 9, 2015 at 6:39 PM Ted Yuwrote: > Jerry: > I just tried building hbase-spark module with 1.5.0 and I see: > > ls -l ~/.m2/repository/org/apache/spark/spark-core_2.10/1.5.0 > total 21712 > -rw-r--r-- 1 tyu staff 196 Sep 9 09:37 _maven.repositories > -rw-r--r-- 1 tyu staff 11081542 Sep 9 09:37 spark-core_2.10-1.5.0.jar > -rw-r--r-- 1 tyu staff41 Sep 9 09:37 > spark-core_2.10-1.5.0.jar.sha1 > -rw-r--r-- 1 tyu staff 19816 Sep 9 09:37 spark-core_2.10-1.5.0.pom > -rw-r--r-- 1 tyu staff41 Sep 9 09:37 > spark-core_2.10-1.5.0.pom.sha1 > > FYI > > On Wed, Sep 9, 2015 at 9:35 AM, Jerry Lam wrote: > >> Hi Spark Developers, >> >> I'm eager to try it out! However, I got problems in resolving >> dependencies: >> [warn] [NOT FOUND ] >> org.apache.spark#spark-core_2.10;1.5.0!spark-core_2.10.jar (0ms) >> [warn] jcenter: tried >> >> When the package will be available? >> >> Best Regards, >> >> Jerry >> >> >> On Wed, Sep 9, 2015 at 9:30 AM, Dimitris Kouzis - Loukas < >> look...@gmail.com> wrote: >> >>> Yeii! >>> >>> On Wed, Sep 9, 2015 at 2:25 PM, Yu Ishikawa < >>> yuu.ishikawa+sp...@gmail.com> wrote: >>> Great work, everyone! - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org >>> >> > -- andy
Re: [ANNOUNCE] Announcing Spark 1.5.0
Jerry: I just tried building hbase-spark module with 1.5.0 and I see: ls -l ~/.m2/repository/org/apache/spark/spark-core_2.10/1.5.0 total 21712 -rw-r--r-- 1 tyu staff 196 Sep 9 09:37 _maven.repositories -rw-r--r-- 1 tyu staff 11081542 Sep 9 09:37 spark-core_2.10-1.5.0.jar -rw-r--r-- 1 tyu staff41 Sep 9 09:37 spark-core_2.10-1.5.0.jar.sha1 -rw-r--r-- 1 tyu staff 19816 Sep 9 09:37 spark-core_2.10-1.5.0.pom -rw-r--r-- 1 tyu staff41 Sep 9 09:37 spark-core_2.10-1.5.0.pom.sha1 FYI On Wed, Sep 9, 2015 at 9:35 AM, Jerry Lamwrote: > Hi Spark Developers, > > I'm eager to try it out! However, I got problems in resolving dependencies: > [warn] [NOT FOUND ] > org.apache.spark#spark-core_2.10;1.5.0!spark-core_2.10.jar (0ms) > [warn] jcenter: tried > > When the package will be available? > > Best Regards, > > Jerry > > > On Wed, Sep 9, 2015 at 9:30 AM, Dimitris Kouzis - Loukas < > look...@gmail.com> wrote: > >> Yeii! >> >> On Wed, Sep 9, 2015 at 2:25 PM, Yu Ishikawa > > wrote: >> >>> Great work, everyone! >>> >>> >>> >>> - >>> -- Yu Ishikawa >>> -- >>> View this message in context: >>> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html >>> Sent from the Apache Spark Developers List mailing list archive at >>> Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> >>> >> >
Re: Code generation for GPU
I am already looking at the dataframes APIs and the implementation. In fact, the columnar representation https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala is what gave me the idea of my talk proposal. It is ideally suited for computation on GPU. But from what Reynold said, it appears that the columnar structure is not exploited for computation like expressions. It appears that the columnar structure is used only for space efficient in memory storage and not for computations. Even the TungstenProject invokes the operations on a row by row basis. The UnsafeRow is optimized in the sense that it is only a logical row as opposed to the InternalRow which has physical copies of the values. But the computation is still on a per row basis rather than batches of rows stored in columnar structure. Thanks for some concrete suggestions on presentation. I do have the core idea or theme of my talk ready in mind, but I will now present on the lines you suggest. I wasn't really thinking of a demo, but now I will do that. I was actually hoping to be able to contribute to spark code and show results on those changes rather than offline changes. I will still try to do that by hooking to the columnar structure, but it may not be in a shape that can go in the spark code. Thats what I meant by severely limiting the scope of my talk. I have seen a perf improvement of 5-10 times on expression evaluation even on "ordinary" laptop GPUs. Thus, it will be a good demo along with some concrete proposals for vectorization. As you said, I will have to hook up to a column structure and perform computation and let the existing spark computation also proceed and compare the performance. I will focus on the slides early (7th Oct is deadline), and then continue the work for another 3 weeks till the summit. It still gives me enough time to do considerable work. Hope your fear does not come true. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Code-generation-for-GPU-tp13954p14025.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Spark 1.5: How to trigger expression execution through UnsafeRow/TungstenProject
The tungsten, cogegen etc options are enabled by default. But I am not able to get the execution through the UnsafeRow/TungstenProject. It still executes using InternalRow/Project. I see this in the SparkStrategies.scala: If unsafe mode is enabled and we support these data types in Unsafe, use the tungsten project. Otherwise use the normal project. Can someone give an example code on what can trigger this? I tried some of the primitive types but did not work. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-5-How-to-trigger-expression-execution-through-UnsafeRow-TungstenProject-tp14026.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark 1.5: How to trigger expression execution through UnsafeRow/TungstenProject
Here is the example from Reynold ( http://search-hadoop.com/m/q3RTtfvs1P1YDK8d) : scala> val data = sc.parallelize(1 to size, 5).map(x => (util.Random.nextInt(size / repetitions),util.Random.nextDouble)).toDF("key", "value") data: org.apache.spark.sql.DataFrame = [key: int, value: double] scala> data.explain == Physical Plan == TungstenProject [_1#0 AS key#2,_2#1 AS value#3] Scan PhysicalRDD[_1#0,_2#1] ... scala> val res = df.groupBy("key").agg(sum("value")) res: org.apache.spark.sql.DataFrame = [key: int, sum(value): double] scala> res.explain 15/09/09 14:17:26 INFO MemoryStore: ensureFreeSpace(88456) called with curMem=84037, maxMem=556038881 15/09/09 14:17:26 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 86.4 KB, free 530.1 MB) 15/09/09 14:17:26 INFO MemoryStore: ensureFreeSpace(19788) called with curMem=172493, maxMem=556038881 15/09/09 14:17:26 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 19.3 KB, free 530.1 MB) 15/09/09 14:17:26 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:42098 (size: 19.3 KB, free: 530.2 MB) 15/09/09 14:17:26 INFO SparkContext: Created broadcast 2 from explain at :27 == Physical Plan == TungstenAggregate(key=[key#19], functions=[(sum(value#20),mode=Final,isDistinct=false)], output=[key#19,sum(value)#21]) TungstenExchange hashpartitioning(key#19) TungstenAggregate(key=[key#19], functions=[(sum(value#20),mode=Partial,isDistinct=false)], output=[key#19,currentSum#25]) Scan ParquetRelation[file:/tmp/data][key#19,value#20] FYI On Wed, Sep 9, 2015 at 12:31 PM, lonikarwrote: > The tungsten, cogegen etc options are enabled by default. But I am not able > to get the execution through the UnsafeRow/TungstenProject. It still > executes using InternalRow/Project. > > I see this in the SparkStrategies.scala: If unsafe mode is enabled and we > support these data types in Unsafe, use the tungsten project. Otherwise use > the normal project. > > Can someone give an example code on what can trigger this? I tried some of > the primitive types but did not work. > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-5-How-to-trigger-expression-execution-through-UnsafeRow-TungstenProject-tp14026.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >