[MLlib] Extensibility of MLlib classes (Word2VecModel etc.)

2015-09-09 Thread Maandy
Hey,

I'm trying to implement doc2vec
(http://cs.stanford.edu/~quocle/paragraph_vector.pdf, mainly for
sport/research purpose due to all it's limitations so I would probably not
even try to PR it into MLlib itself) but to do that it would be highly
useful to have access to MLlib's Word2VecModel class, which is mostly
private. Is there any reason (i.e. some Spark/MLlib guidelines) for that or
would it be ok to refactor the code and make a PR? I've found a similar JIRA
issue which was posted almost a year ago but for some reason it got closed:
https://issues.apache.org/jira/browse/SPARK-4101.

Mateusz



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Extensibility-of-MLlib-classes-Word2VecModel-etc-tp14011.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Did the 1.5 release complete?

2015-09-09 Thread Reynold Xin
Dev/user announcement was made just now.

For Maven, I did publish it this afternoon (so it's been a few hours). If
it is still not there tomorrow morning, I will look into it.



On Wed, Sep 9, 2015 at 2:42 AM, Sean Owen  wrote:

> I saw the end of the RC3 vote:
>
> https://mail-archives.apache.org/mod_mbox/spark-dev/201509.mbox/%3CCAPh_B%3DbQWf_vVuPs_eRpvnNSj8fbULX4kULnbs6MCAA10ZQ9eQ%40mail.gmail.com%3E
>
> but there are no artifacts for it in Maven?
>
> http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.spark%22%20AND%20a%3A%22spark-parent_2.10%22
>
> and I don't see any announcement at dev@
> https://mail-archives.apache.org/mod_mbox/spark-dev/201509.mbox/browser
>
> But it was announced here just now:
> https://databricks.com/blog/2015/09/09/announcing-spark-1-5.html
>
> Did I miss something?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-09 Thread Yu Ishikawa
Great work, everyone!



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: looking for a technical reviewer to review a book on Spark

2015-09-09 Thread Gurumurthy Yeleswarapu
Hi Mohammad:
I'm interested.
ThanksGuru Yeleswarapu
  From: Mohammed Guller 
 To: "dev@spark.apache.org"  
 Sent: Wednesday, September 9, 2015 8:36 AM
 Subject: looking for a technical reviewer to review a book on Spark
   
 Hi Spark 
developers,    I am writing a book on Spark. The publisher of the book is 
looking for a technical reviewer. You will be compensated for your time. The 
publisher will pay a flat rate per page for the review.    I spoke with Matei 
Zaharia about this and he suggested that I send an email to the dev mailing 
list.    The book covers Spark core and the Spark libraries, including Spark 
SQL, Spark Streaming, MLlib, Spark ML, and GraphX. It also covers operational 
aspects such as deployment with different cluster managers and monitoring.    
Please let me know if you are interested and I will connect you with the 
publisher.    Thanks, Mohammed     Principal Architect, Glassbeam Inc, 
www.glassbeam.com, 5201 Great America Parkway, Suite 360, Santa Clara, CA 95054 
p: +1.408.740.4610, m: +1.925.786.7521, f: +1.408.740.4601,skype : mguller    

  

Re: looking for a technical reviewer to review a book on Spark

2015-09-09 Thread Gurumurthy Yeleswarapu
My Apologies for broadcast! That email was meant for Mohammad.
  From: Gurumurthy Yeleswarapu 
 To: Mohammed Guller ; "dev@spark.apache.org" 
 
 Sent: Wednesday, September 9, 2015 8:50 AM
 Subject: Re: looking for a technical reviewer to review a book on Spark
   
Hi Mohammad:
I'm interested.
ThanksGuru Yeleswarapu
 

 From: Mohammed Guller 
 To: "dev@spark.apache.org"  
 Sent: Wednesday, September 9, 2015 8:36 AM
 Subject: looking for a technical reviewer to review a book on Spark
   
 #yiv1554291100 #yiv1554291100 -- filtered {font-family:Calibri;panose-1:2 15 5 
2 2 2 4 3 2 4;}#yiv1554291100 p.yiv1554291100MsoNormal, #yiv1554291100 
li.yiv1554291100MsoNormal, #yiv1554291100 div.yiv1554291100MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:11.0pt;}#yiv1554291100 a:link, 
#yiv1554291100 span.yiv1554291100MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv1554291100 a:visited, #yiv1554291100 
span.yiv1554291100MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv1554291100 
span.yiv1554291100EmailStyle17 {color:windowtext;}#yiv1554291100 
.yiv1554291100MsoChpDefault {}#yiv1554291100 filtered {margin:1.0in 1.0in 1.0in 
1.0in;}#yiv1554291100 div.yiv1554291100WordSection1 {}#yiv1554291100 Hi Spark 
developers,    I am writing a book on Spark. The publisher of the book is 
looking for a technical reviewer. You will be compensated for your time. The 
publisher will pay a flat rate per page for the review.    I spoke with Matei 
Zaharia about this and he suggested that I send an email to the dev mailing 
list.    The book covers Spark core and the Spark libraries, including Spark 
SQL, Spark Streaming, MLlib, Spark ML, and GraphX. It also covers operational 
aspects such as deployment with different cluster managers and monitoring.    
Please let me know if you are interested and I will connect you with the 
publisher.    Thanks, Mohammed     Principal Architect, Glassbeam Inc, 
www.glassbeam.com, 5201 Great America Parkway, Suite 360, Santa Clara, CA 95054 
p: +1.408.740.4610, m: +1.925.786.7521, f: +1.408.740.4601,skype : mguller    

   

  

RE: (Spark SQL) partition-scoped UDF

2015-09-09 Thread Eron Wright
Follow-up:  solved this problem by overriding the model's `transform` method, 
and using `mapPartitions` to produce a new DataFrame rather than using `udf`.   
Source 
code:https://github.com/deeplearning4j/deeplearning4j/blob/135d3b25b96c21349abf488a44f59bb37a2a5930/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/classification/MultiLayerNetworkClassification.scala#L143
Thanks Reynold for your time.
-Eron
Date: Sat, 5 Sep 2015 13:55:34 -0700
Subject: Re: (Spark SQL) partition-scoped UDF
From: ewri...@live.com
To: r...@databricks.com
CC: dev@spark.apache.org

The transformer is a classification model produced by the 
NeuralNetClassification estimator of dl4j-spark-ml.  Source code here.  The 
neural net operates most efficiently when many examples are classified in 
batch.  I imagine overriding `transform` rather than `predictRaw`.   Does 
anyone know of a solution compatible with Spark 1.4 or 1.5?
Thanks again!
From:  Reynold Xin
Date:  Friday, September 4, 2015 at 5:19 PM
To:  Eron Wright
Cc:  "dev@spark.apache.org"
Subject:  Re: (Spark SQL) partition-scoped UDF

Can you say more about your transformer?
This is a good idea, and indeed we are doing it for R already (the latest way 
to run UDFs in R is to pass the entire partition as a local R dataframe for 
users to run on). However, what works for R for simple data processing might 
not work for your high performance transformer, etc.

On Fri, Sep 4, 2015 at 7:08 AM, Eron Wright  wrote:
Transformers in Spark ML typically operate on a per-row basis, based on 
callUDF. For a new transformer that I'm developing, I have a need to transform 
an entire partition with a function, as opposed to transforming each row 
separately.   The reason is that, in my case, rows must be transformed in batch 
for efficiency to amortize some overhead.   How may I accomplish this?
One option appears to be to invoke DataFrame::mapPartitions, yielding an RDD 
that is then converted back to a DataFrame.   Unsure about the viability or 
consequences of that.
Thanks!Eron Wright
  

Re: Deserializing JSON into Scala objects in Java code

2015-09-09 Thread Kevin Chen
Marcelo and Christopher,

 Thanks for your help! The problem turned out to arise from a different part
of the code (we have multiple ObjectMappers), but because I am not very
familiar with Jackson I had thought there was a problem with the Scala
module.

Thank you again,
Kevin

From:  Christopher Currie 
Date:  Wednesday, September 9, 2015 at 10:17 AM
To:  Kevin Chen , "dev@spark.apache.org"

Cc:  Matt Cheah , Mingyu Kim 
Subject:  Fwd: Deserializing JSON into Scala objects in Java code

Kevin,

I'm not a Spark dev, but I maintain the Scala module for Jackson. If you're
continuing to have issues with parsing JSON using the Spark Scala datatypes,
let me know or chime in on the jackson mailing list
(jackson-u...@googlegroups.com) and I'll see what I can do to help.

Christopher Currie

-- Forwarded message --
From: Paul Brown 
Date: Tue, Sep 8, 2015 at 8:58 PM
Subject: Fwd: Deserializing JSON into Scala objects in Java code
To: Christopher Currie 


Passing along. 

-- Forwarded message --
From: Kevin Chen 
Date: Tuesday, September 8, 2015
Subject: Deserializing JSON into Scala objects in Java code
To: "dev@spark.apache.org" 
Cc: Matt Cheah , Mingyu Kim 


Hello Spark Devs,

 I am trying to use the new Spark API json endpoints at /api/v1/[path]
(added in SPARK-3454).

 In order to minimize maintenance on our end, I would like to use
Retrofit/Jackson to parse the json directly into the Scala classes in
org/apache/spark/status/api/v1/api.scala (ApplicationInfo,
ApplicationAttemptInfo, etc…). However, Jackson does not seem to know how to
handle Scala Seqs, and will throw an error when trying to parse the
attempts: Seq[ApplicationAttemptInfo] field of ApplicationInfo. Our codebase
is in Java.

 My questions are:
1. Do you have any recommendations on how to easily deserialize Scala
objects from json? For example, do you have any current usage examples of
SPARK-3454 with Java?
2. Alternatively, are you committed to the json formats of /api/v1/path? I
would guess so, because of the ‘v1’, but wanted to confirm. If so, I could
deserialize the json into instances of my own Java classes instead, without
worrying about changing the class structure later due to changes in the
Spark API.
Some further information:
* The error I am getting with Jackson when trying to deserialize the json
into ApplicationInfo is Caused by:
com.fasterxml.jackson.databind.JsonMappingException: Can not construct
instance of scala.collection.Seq, problem: abstract types either need to be
mapped to concrete types, have custom deserializer, or be instantiated with
additional type information
* I tried using Jackson’s DefaultScalaModule, which seems to have support
for Scala Seqs, but got no luck.
* Deserialization works if the Scala class does not have any Seq fields, and
works if the fields are Java Lists instead of Seqs.
Thanks very much for your help!
Kevin Chen




-- 
(Sent from mobile. Pardon brevity.)





smime.p7s
Description: S/MIME cryptographic signature


Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-09 Thread Jerry Lam
Hi Spark Developers,

I'm eager to try it out! However, I got problems in resolving dependencies:
[warn] [NOT FOUND  ]
org.apache.spark#spark-core_2.10;1.5.0!spark-core_2.10.jar (0ms)
[warn]  jcenter: tried

When the package will be available?

Best Regards,

Jerry


On Wed, Sep 9, 2015 at 9:30 AM, Dimitris Kouzis - Loukas 
wrote:

> Yeii!
>
> On Wed, Sep 9, 2015 at 2:25 PM, Yu Ishikawa 
> wrote:
>
>> Great work, everyone!
>>
>>
>>
>> -
>> -- Yu Ishikawa
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-09 Thread andy petrella
You can try it out really quickly by "building" a Spark Notebook from
http://spark-notebook.io/.

Just choose the master branch and 1.5.0, a correct hadoop version (default
to 2.2.0 though) and there you go :-)


On Wed, Sep 9, 2015 at 6:39 PM Ted Yu  wrote:

> Jerry:
> I just tried building hbase-spark module with 1.5.0 and I see:
>
> ls -l ~/.m2/repository/org/apache/spark/spark-core_2.10/1.5.0
> total 21712
> -rw-r--r--  1 tyu  staff   196 Sep  9 09:37 _maven.repositories
> -rw-r--r--  1 tyu  staff  11081542 Sep  9 09:37 spark-core_2.10-1.5.0.jar
> -rw-r--r--  1 tyu  staff41 Sep  9 09:37
> spark-core_2.10-1.5.0.jar.sha1
> -rw-r--r--  1 tyu  staff 19816 Sep  9 09:37 spark-core_2.10-1.5.0.pom
> -rw-r--r--  1 tyu  staff41 Sep  9 09:37
> spark-core_2.10-1.5.0.pom.sha1
>
> FYI
>
> On Wed, Sep 9, 2015 at 9:35 AM, Jerry Lam  wrote:
>
>> Hi Spark Developers,
>>
>> I'm eager to try it out! However, I got problems in resolving
>> dependencies:
>> [warn] [NOT FOUND  ]
>> org.apache.spark#spark-core_2.10;1.5.0!spark-core_2.10.jar (0ms)
>> [warn]  jcenter: tried
>>
>> When the package will be available?
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Wed, Sep 9, 2015 at 9:30 AM, Dimitris Kouzis - Loukas <
>> look...@gmail.com> wrote:
>>
>>> Yeii!
>>>
>>> On Wed, Sep 9, 2015 at 2:25 PM, Yu Ishikawa <
>>> yuu.ishikawa+sp...@gmail.com> wrote:
>>>
 Great work, everyone!



 -
 -- Yu Ishikawa
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>>
>>
> --
andy


Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-09 Thread Ted Yu
Jerry:
I just tried building hbase-spark module with 1.5.0 and I see:

ls -l ~/.m2/repository/org/apache/spark/spark-core_2.10/1.5.0
total 21712
-rw-r--r--  1 tyu  staff   196 Sep  9 09:37 _maven.repositories
-rw-r--r--  1 tyu  staff  11081542 Sep  9 09:37 spark-core_2.10-1.5.0.jar
-rw-r--r--  1 tyu  staff41 Sep  9 09:37
spark-core_2.10-1.5.0.jar.sha1
-rw-r--r--  1 tyu  staff 19816 Sep  9 09:37 spark-core_2.10-1.5.0.pom
-rw-r--r--  1 tyu  staff41 Sep  9 09:37
spark-core_2.10-1.5.0.pom.sha1

FYI

On Wed, Sep 9, 2015 at 9:35 AM, Jerry Lam  wrote:

> Hi Spark Developers,
>
> I'm eager to try it out! However, I got problems in resolving dependencies:
> [warn] [NOT FOUND  ]
> org.apache.spark#spark-core_2.10;1.5.0!spark-core_2.10.jar (0ms)
> [warn]  jcenter: tried
>
> When the package will be available?
>
> Best Regards,
>
> Jerry
>
>
> On Wed, Sep 9, 2015 at 9:30 AM, Dimitris Kouzis - Loukas <
> look...@gmail.com> wrote:
>
>> Yeii!
>>
>> On Wed, Sep 9, 2015 at 2:25 PM, Yu Ishikawa > > wrote:
>>
>>> Great work, everyone!
>>>
>>>
>>>
>>> -
>>> -- Yu Ishikawa
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>


Re: Code generation for GPU

2015-09-09 Thread lonikar
I am already looking at the dataframes APIs and the implementation. In fact,
the columnar representation
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala
is what gave me the idea of my talk proposal. It is ideally suited for
computation on GPU. But from what Reynold said, it appears that the columnar
structure is not exploited for computation like expressions. It appears that
the columnar structure is used only for space efficient in memory storage
and not for computations. Even the TungstenProject invokes the operations on
a row by row basis. The UnsafeRow is optimized in the sense that it is only
a logical row as opposed to the InternalRow which has physical copies of the
values. But the computation is still on a per row basis rather than batches
of rows stored in columnar structure.

Thanks for some concrete suggestions on presentation. I do have the core
idea or theme of my talk ready in mind, but I will now present on the lines
you suggest. I wasn't really thinking of a demo, but now I will do that. I
was actually hoping to be able to contribute to spark code and show results
on those changes rather than offline changes. I will still try to do that by
hooking to the columnar structure, but it may not be in a shape that can go
in the spark code. Thats what I meant by severely limiting the scope of my
talk.

I have seen a perf improvement of 5-10 times on expression evaluation even
on "ordinary" laptop GPUs. Thus, it will be a good demo along with some
concrete proposals for vectorization. As you said, I will have to hook up to
a column structure and perform computation and let the existing spark
computation also proceed and compare the performance.

I will focus on the slides early (7th Oct is deadline), and then continue
the work for another 3 weeks till the summit. It still gives me enough time
to do considerable work. Hope your fear does not come true.






--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Code-generation-for-GPU-tp13954p14025.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark 1.5: How to trigger expression execution through UnsafeRow/TungstenProject

2015-09-09 Thread lonikar
The tungsten, cogegen etc options are enabled by default. But I am not able
to get the execution through the UnsafeRow/TungstenProject. It still
executes using InternalRow/Project.

I see this in the SparkStrategies.scala: If unsafe mode is enabled and we
support these data types in Unsafe, use the tungsten project. Otherwise use
the normal project.

Can someone give an example code on what can trigger this? I tried some of
the primitive types but did not work.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-5-How-to-trigger-expression-execution-through-UnsafeRow-TungstenProject-tp14026.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.5: How to trigger expression execution through UnsafeRow/TungstenProject

2015-09-09 Thread Ted Yu
Here is the example from Reynold (
http://search-hadoop.com/m/q3RTtfvs1P1YDK8d) :

scala> val data = sc.parallelize(1 to size, 5).map(x =>
(util.Random.nextInt(size /
repetitions),util.Random.nextDouble)).toDF("key", "value")
data: org.apache.spark.sql.DataFrame = [key: int, value: double]

scala> data.explain
== Physical Plan ==
TungstenProject [_1#0 AS key#2,_2#1 AS value#3]
 Scan PhysicalRDD[_1#0,_2#1]

...
scala> val res = df.groupBy("key").agg(sum("value"))
res: org.apache.spark.sql.DataFrame = [key: int, sum(value): double]

scala> res.explain
15/09/09 14:17:26 INFO MemoryStore: ensureFreeSpace(88456) called with
curMem=84037, maxMem=556038881
15/09/09 14:17:26 INFO MemoryStore: Block broadcast_2 stored as values in
memory (estimated size 86.4 KB, free 530.1 MB)
15/09/09 14:17:26 INFO MemoryStore: ensureFreeSpace(19788) called with
curMem=172493, maxMem=556038881
15/09/09 14:17:26 INFO MemoryStore: Block broadcast_2_piece0 stored as
bytes in memory (estimated size 19.3 KB, free 530.1 MB)
15/09/09 14:17:26 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
on localhost:42098 (size: 19.3 KB, free: 530.2 MB)
15/09/09 14:17:26 INFO SparkContext: Created broadcast 2 from explain at
:27
== Physical Plan ==
TungstenAggregate(key=[key#19],
functions=[(sum(value#20),mode=Final,isDistinct=false)],
output=[key#19,sum(value)#21])
 TungstenExchange hashpartitioning(key#19)
  TungstenAggregate(key=[key#19],
functions=[(sum(value#20),mode=Partial,isDistinct=false)],
output=[key#19,currentSum#25])
   Scan ParquetRelation[file:/tmp/data][key#19,value#20]

FYI

On Wed, Sep 9, 2015 at 12:31 PM, lonikar  wrote:

> The tungsten, cogegen etc options are enabled by default. But I am not able
> to get the execution through the UnsafeRow/TungstenProject. It still
> executes using InternalRow/Project.
>
> I see this in the SparkStrategies.scala: If unsafe mode is enabled and we
> support these data types in Unsafe, use the tungsten project. Otherwise use
> the normal project.
>
> Can someone give an example code on what can trigger this? I tried some of
> the primitive types but did not work.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-5-How-to-trigger-expression-execution-through-UnsafeRow-TungstenProject-tp14026.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>