[jira] [Comment Edited] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2017-02-21 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876202#comment-15876202
 ] 

RJ Nowling edited comment on SPARK-14174 at 2/21/17 4:08 PM:
-

I did the initial implementation for SPARK-2308.  re: the random sampling.  
With Spark's approach to random sampling, a Bernoulli trial is performed for 
each data point in the RDD.  It's not as efficient as the case where 
random-access indexing is available.  That said, if your vector dare quite 
long, then you save computational time on evaluating distances and such.  Thus, 
when evaluating the performance, don't just look at the case of a large number 
of vectors -- look at the case of vectors with many elements.


was (Author: rnowling):
I did the initial implementation for SPARK-2308.  re: the random sampling.  
With Spark's approach to random sampling, a Bernoulli trial is performed for 
each data point in the RDD.  It's not as efficient as the case where 
random-access indexing is available.  That said, if your vector are quite long, 
then you save computational time on evaluating distances and such.  Thus, when 
evaluating the performance, don't just look at the case of a large number of 
vectors -- look at the case of vectors with many elements.

> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> {code}
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> {code}
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> {code}
> KMeans2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01)   90sec
> {code}
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2017-02-21 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876202#comment-15876202
 ] 

RJ Nowling commented on SPARK-14174:


I did the initial implementation for SPARK-2308.  re: the random sampling.  
With Spark's approach to random sampling, a Bernoulli trial is performed for 
each data point in the RDD.  It's not as efficient as the case where 
random-access indexing is available.  That said, if your vector are quite long, 
then you save computational time on evaluating distances and such.  Thus, when 
evaluating the performance, don't just look at the case of a large number of 
vectors -- look at the case of vectors with many elements.

> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> {code}
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> {code}
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> {code}
> KMeans2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01)   90sec
> {code}
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16365) Ideas for moving "mllib-local" forward

2016-07-13 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375928#comment-15375928
 ] 

RJ Nowling edited comment on SPARK-16365 at 7/13/16 10:40 PM:
--

I'm really looking forward to this feature. Spark is great where model training 
is expensive and involves large data sets but I want to be able to deploy those 
models as part of mobile or other applications without a dependency on Spark. 
It would be especially nice if there were implementations not only for the JVM 
but Python, Go, and other languages. 

[~MechCoder], applying models often requires less computation than training 
them. So use Spark to train then have a local, non distributed library for 
embedding models in other applications is how I interpret the feature.


was (Author: rnowling):
I'm really looking forward to this feature. Spark is great where model training 
is expensive and involves large data sets but I want to be able to deploy those 
models as per of mobile or other applications without a dependency on Spark. It 
would be especially nice if there were implementations only for the JVM but 
Python, Go, and other languages. 

[~MechCoder], applying models often requires less computation than training 
them. So use Spark to train then have a local, non distributed library for 
embedding in other applications is how I interpret the feature.

> Ideas for moving "mllib-local" forward
> --
>
> Key: SPARK-16365
> URL: https://issues.apache.org/jira/browse/SPARK-16365
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Nick Pentreath
>
> Since SPARK-13944 is all done, we should all think about what the "next 
> steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's 
> linear algebra", or "investigate how we will implement local models/pipelines 
> in Spark", etc.
> This ticket is for comments, ideas, brainstormings and PoCs. The separation 
> of linalg into a standalone project turned out to be significantly more 
> complex than originally expected. So I vote we devote sufficient discussion 
> and time to planning out the next move :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward

2016-07-13 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375928#comment-15375928
 ] 

RJ Nowling commented on SPARK-16365:


I'm really looking forward to this feature. Spark is great where model training 
is expensive and involves large data sets but I want to be able to deploy those 
models as per of mobile or other applications without a dependency on Spark. It 
would be especially nice if there were implementations only for the JVM but 
Python, Go, and other languages. 

[~MechCoder], applying models often requires less computation than training 
them. So use Spark to train then have a local, non distributed library for 
embedding in other applications is how I interpret the feature.

> Ideas for moving "mllib-local" forward
> --
>
> Key: SPARK-16365
> URL: https://issues.apache.org/jira/browse/SPARK-16365
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Nick Pentreath
>
> Since SPARK-13944 is all done, we should all think about what the "next 
> steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's 
> linear algebra", or "investigate how we will implement local models/pipelines 
> in Spark", etc.
> This ticket is for comments, ideas, brainstormings and PoCs. The separation 
> of linalg into a standalone project turned out to be significantly more 
> complex than originally expected. So I vote we devote sufficient discussion 
> and time to planning out the next move :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2016-04-01 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221806#comment-15221806
 ] 

RJ Nowling commented on SPARK-14174:


This is a dupe of [SPARK-2308] but that needs someone to take it over.

> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> KMeans2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01)   90sec
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12450) Un-persist broadcasted variables in KMeans

2015-12-21 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066629#comment-15066629
 ] 

RJ Nowling commented on SPARK-12450:


File a PR here: [https://github.com/apache/spark/pull/10415]

> Un-persist broadcasted variables in KMeans
> --
>
> Key: SPARK-12450
> URL: https://issues.apache.org/jira/browse/SPARK-12450
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.2
>Reporter: RJ Nowling
>Priority: Minor
>
> The broadcasted centers in KMeans are never un-persisted.  As a result, 
> memory usage accumulates with usage causing a memory leak.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12450) Un-persist broadcasted variables in KMeans

2015-12-21 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-12450:
--

 Summary: Un-persist broadcasted variables in KMeans
 Key: SPARK-12450
 URL: https://issues.apache.org/jira/browse/SPARK-12450
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.2, 1.4.1
Reporter: RJ Nowling
Priority: Minor


The broadcasted centers in KMeans are never un-persisted.  As a result, memory 
usage accumulates with usage causing a memory leak.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056675#comment-15056675
 ] 

RJ Nowling commented on SPARK-4816:
---

Agreed.  Thanks!

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.1, 1.4.2
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056630#comment-15056630
 ] 

RJ Nowling commented on SPARK-4816:
---

Tried with Maven 3.3.9.  I see no issues with the newer version of Maven:

{code}
$ mvn -version
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 
2015-11-10T16:41:47+00:00)
Maven home: /root/apache-maven-3.3.9
Java version: 1.7.0_85, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85-2.6.1.2.el7_1.x86_64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-229.1.2.el7.x86_64", arch: "amd64", family: 
"unix"
$ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.1-hadoop2.4.0.jar | 
grep netlib-native
netlib-native_ref-osx-x86_64.jnilib
netlib-native_ref-osx-x86_64.jnilib.asc
netlib-native_ref-osx-x86_64.pom
netlib-native_ref-osx-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.properties
netlib-native_ref-linux-x86_64.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.properties
netlib-native_ref-linux-i686.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.properties
netlib-native_ref-win-x86_64.dll
netlib-native_ref-win-x86_64.dll.asc
netlib-native_ref-win-x86_64.pom
netlib-native_ref-win-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.properties
netlib-native_ref-win-i686.dll
netlib-native_ref-win-i686.dll.asc
netlib-native_ref-win-i686.pom
netlib-native_ref-win-i686.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.properties
netlib-native_ref-linux-armhf.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.properties
netlib-native_system-osx-x86_64.jnilib
netlib-native_system-osx-x86_64.jnilib.asc
netlib-native_system-osx-x86_64.pom
netlib-native_system-osx-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.properties
netlib-native_system-linux-x86_64.pom.asc
netlib-native_system-linux-x86_64.pom
netlib-native_system-linux-x86_64.so
netlib-native_system-linux-x86_64.so.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.properties
netlib-native_system-linux-i686.pom
netlib-native_system-linux-i686.so.asc
netlib-native_system-linux-i686.pom.asc
netlib-native_system-linux-i686.so
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.properties
netlib-native_system-linux-armhf.pom
netlib-native_system-linux-armhf.so.asc
netlib-native_system-linux-armhf.pom.asc
netlib-native_system-linux-armhf.so
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.properties
netlib-native_system-win-x86_64.dll
netlib-native_system-win-x86_64.dll.asc
netlib-native_system-win-x86_64.pom
netlib-native_system-win-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.properties
netlib-native_system-win-i686.dll
netlib-native_system-win-i686.dll.asc
netlib-native_system-win-i686.pom
netlib-native_system-win-i686.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-wi

[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056414#comment-15056414
 ] 

RJ Nowling commented on SPARK-4816:
---

Happy to try Maven 3.3.x and report back.  Would certainly confirm if it's a 
Maven bug or regression in behavior.

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056375#comment-15056375
 ] 

RJ Nowling commented on SPARK-4816:
---

Also, what version of Maven are you running?

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056368#comment-15056368
 ] 

RJ Nowling commented on SPARK-4816:
---

I want to push for two things (a) some sort of documentation for users (e.g., 
release notes in the next releases) and (b) make sure it's fixed in the latest 
releases.  I want users to be able to find documentation (like this JIRA) so 
they don't have to spend time tracking it down like I did.  

Spark 1.4.2 hasn't been released yet and git has moved to a 1.4.3 SNAPSHOT.  
You mention adding the commit to the 1.5.x branch in the commit -- has this 
been done?

Until 1.4.3 and a 1.5.x release are out with your change, this could still hit 
certain users, even if it's rare because it's tied to a specific Maven version 
or such.

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056168#comment-15056168
 ] 

RJ Nowling edited comment on SPARK-4816 at 12/14/15 4:16 PM:
-

I think [SPARK-9507] fixed the issue. I checked out git commit 
{{5ad9f950c4bd0042d79cdccb5277c10f8412be85}} (the commit before 
[https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f])
 and found that the {{netlib-native}} libraries were missing:

{code}
$ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85
$ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean 
package
$ zipinfo -1 
assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep 
netlib-native

(No output)
{code}

I then checked out {{b53ca247d4a965002a9f31758ea2b28fe117d45f}} and built it to 
test:

{code}
zipinfo -1 
assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep 
netlib-native
netlib-native_ref-osx-x86_64.jnilib
netlib-native_ref-osx-x86_64.jnilib.asc
netlib-native_ref-osx-x86_64.pom
netlib-native_ref-osx-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.properties
netlib-native_ref-linux-x86_64.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.properties
netlib-native_ref-linux-i686.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.properties
netlib-native_ref-win-x86_64.dll
netlib-native_ref-win-x86_64.dll.asc
netlib-native_ref-win-x86_64.pom
netlib-native_ref-win-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.properties
netlib-native_ref-win-i686.dll
netlib-native_ref-win-i686.dll.asc
netlib-native_ref-win-i686.pom
netlib-native_ref-win-i686.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.properties
netlib-native_ref-linux-armhf.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.properties
netlib-native_system-osx-x86_64.jnilib
netlib-native_system-osx-x86_64.jnilib.asc
netlib-native_system-osx-x86_64.pom
netlib-native_system-osx-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.properties
netlib-native_system-linux-x86_64.pom.asc
netlib-native_system-linux-x86_64.pom
netlib-native_system-linux-x86_64.so
netlib-native_system-linux-x86_64.so.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.properties
netlib-native_system-linux-i686.pom
netlib-native_system-linux-i686.so.asc
netlib-native_system-linux-i686.pom.asc
netlib-native_system-linux-i686.so
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.properties
netlib-native_system-linux-armhf.pom
netlib-native_system-linux-armhf.so.asc
netlib-native_system-linux-armhf.pom.asc
netlib-native_system-linux-armhf.so
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.properties
netlib-native_system-win-x86_64.dll
netlib-native_system-win-x86_64.dll.asc
netlib-native_system-win-x86_64.pom
netlib-native_system-win-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.properties
netlib-native_system-win-i686.dll
netlib-native_system-win-i

[jira] [Comment Edited] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056168#comment-15056168
 ] 

RJ Nowling edited comment on SPARK-4816 at 12/14/15 4:19 PM:
-

I think [SPARK-9507] fixed the issue. I checked out git commit 
{{5ad9f950c4bd0042d79cdccb5277c10f8412be85}} (the commit before 
[https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f])
 and found that the {{netlib-native}} libraries were missing:

{code}
$ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85
$ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean 
package
$ zipinfo -1 
assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep 
netlib-native

(No output)
{code}

I then checked out {{b53ca247d4a965002a9f31758ea2b28fe117d45f}} and built it to 
test:

{code}
$ zipinfo -1 
assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep 
netlib-native
netlib-native_ref-osx-x86_64.jnilib
netlib-native_ref-osx-x86_64.jnilib.asc
netlib-native_ref-osx-x86_64.pom
netlib-native_ref-osx-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.properties
netlib-native_ref-linux-x86_64.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.properties
netlib-native_ref-linux-i686.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.properties
netlib-native_ref-win-x86_64.dll
netlib-native_ref-win-x86_64.dll.asc
netlib-native_ref-win-x86_64.pom
netlib-native_ref-win-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.properties
netlib-native_ref-win-i686.dll
netlib-native_ref-win-i686.dll.asc
netlib-native_ref-win-i686.pom
netlib-native_ref-win-i686.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.properties
netlib-native_ref-linux-armhf.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.properties
netlib-native_system-osx-x86_64.jnilib
netlib-native_system-osx-x86_64.jnilib.asc
netlib-native_system-osx-x86_64.pom
netlib-native_system-osx-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.properties
netlib-native_system-linux-x86_64.pom.asc
netlib-native_system-linux-x86_64.pom
netlib-native_system-linux-x86_64.so
netlib-native_system-linux-x86_64.so.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.properties
netlib-native_system-linux-i686.pom
netlib-native_system-linux-i686.so.asc
netlib-native_system-linux-i686.pom.asc
netlib-native_system-linux-i686.so
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.properties
netlib-native_system-linux-armhf.pom
netlib-native_system-linux-armhf.so.asc
netlib-native_system-linux-armhf.pom.asc
netlib-native_system-linux-armhf.so
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.properties
netlib-native_system-win-x86_64.dll
netlib-native_system-win-x86_64.dll.asc
netlib-native_system-win-x86_64.pom
netlib-native_system-win-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.properties
netlib-native_system-win-i686.dll
netlib-native_system-win

[jira] [Comment Edited] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056168#comment-15056168
 ] 

RJ Nowling edited comment on SPARK-4816 at 12/14/15 3:57 PM:
-

I think issue [SPARK-9507] fixed the issue. I checked out git commit 
5ad9f950c4bd0042d79cdccb5277c10f8412be85 (the commit before 
[https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f])
 and found that the {{netlib-native}} libraries were missing:

{code}
$ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85
$ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean 
package
$ zipinfo -1 
assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep 
netlib-native

(No output)
{code}

As such, the changes in [SPARK-8819] might have been the original cause.


was (Author: rnowling):
I think issue [SPARK-9507] fixed the issue. I checked out git commit 
5ad9f950c4bd0042d79cdccb5277c10f8412be85 (the commit before 
[https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f])
 and found that the {{netlib-native}} libraries were missing:

{code}
$ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85
$ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean 
package
$ zipinfo -1 
assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep 
netlib-native

(No output
{code}

As such, the changes in [SPARK-8819] might have been the original cause.

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056168#comment-15056168
 ] 

RJ Nowling edited comment on SPARK-4816 at 12/14/15 3:58 PM:
-

I think [SPARK-9507] fixed the issue. I checked out git commit 
{{5ad9f950c4bd0042d79cdccb5277c10f8412be85}} (the commit before 
[https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f])
 and found that the {{netlib-native}} libraries were missing:

{code}
$ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85
$ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean 
package
$ zipinfo -1 
assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep 
netlib-native

(No output)
{code}

As such, the changes in [SPARK-8819] might have been the original cause.


was (Author: rnowling):
I think issue [SPARK-9507] fixed the issue. I checked out git commit 
5ad9f950c4bd0042d79cdccb5277c10f8412be85 (the commit before 
[https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f])
 and found that the {{netlib-native}} libraries were missing:

{code}
$ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85
$ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean 
package
$ zipinfo -1 
assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep 
netlib-native

(No output)
{code}

As such, the changes in [SPARK-8819] might have been the original cause.

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056168#comment-15056168
 ] 

RJ Nowling commented on SPARK-4816:
---

I think issue [SPARK-9507] fixed the issue. I checked out git commit 
5ad9f950c4bd0042d79cdccb5277c10f8412be85 (the commit before 
[https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f])
 and found that the {{netlib-native}} libraries were missing:

{code}
$ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85
$ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean 
package
$ zipinfo -1 
assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep 
netlib-native

(No output
{code}

As such, the changes in [SPARK-8819] might have been the original cause.

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056115#comment-15056115
 ] 

RJ Nowling edited comment on SPARK-4816 at 12/14/15 3:42 PM:
-

I tested it again to make sure and ran into the same issue:

{code}
$ mvn -version
Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 
2014-12-14T17:29:23+00:00)
Maven home: /usr/share/apache-maven
Java version: 1.7.0_85, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85-2.6.1.2.el7_1.x86_64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-229.1.2.el7.x86_64", arch: "amd64", family: 
"unix"

$ export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1.tgz
$ tar -xzvf spark-1.4.1.tgz
$ cd spark-1.4.1
$ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean 
package
$ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.1-hadoop2.4.0.jar | 
grep netlib-native

(No output)
{code}

If I build the head from git {{branch-1.4}} and run {{zipinfo}}:

{code}
$ git clone https://github.com/apache/spark.git spark-1.4-netlib
$ cd spark-1.4-netlib
$ git checkout origin/branch-1.4
$ git log | head
commit c7c99857d47e4ca8373ee9ac59e108a9c443dd05
Author: Sean Owen 
Date:   Tue Dec 8 14:34:47 2015 +

[SPARK-11652][CORE] Remote code execution with InvokerTransformer

Fix commons-collection group ID to commons-collections for version 3.x

Patches earlier PR at https://github.com/apache/spark/pull/9731

$ zipinfo -1 
assembly/target/scala-2.10/spark-assembly-1.4.3-SNAPSHOT-hadoop2.4.0.jar | grep 
netlib-native
netlib-native_ref-osx-x86_64.jnilib
netlib-native_ref-osx-x86_64.jnilib.asc
netlib-native_ref-osx-x86_64.pom
netlib-native_ref-osx-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.properties
netlib-native_ref-linux-x86_64.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.properties
netlib-native_ref-linux-i686.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.properties
netlib-native_ref-win-x86_64.dll
netlib-native_ref-win-x86_64.dll.asc
netlib-native_ref-win-x86_64.pom
netlib-native_ref-win-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.properties
netlib-native_ref-win-i686.dll
netlib-native_ref-win-i686.dll.asc
netlib-native_ref-win-i686.pom
netlib-native_ref-win-i686.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.properties
netlib-native_ref-linux-armhf.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.properties
netlib-native_system-osx-x86_64.jnilib
netlib-native_system-osx-x86_64.jnilib.asc
netlib-native_system-osx-x86_64.pom
netlib-native_system-osx-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.properties
netlib-native_system-linux-x86_64.pom.asc
netlib-native_system-linux-x86_64.pom
netlib-native_system-linux-x86_64.so
netlib-native_system-linux-x86_64.so.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.properties
netlib-native_system-linux-i686.pom
netlib-native_system-linux-i686.so.asc
netlib-native_system-linux-i686.pom.asc
netlib-native_system-linux-i686.so
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.properties
netlib-native_system-linux-armhf.pom
netlib-native_system-linux-armhf.so.asc
netlib-native_system-l

[jira] [Reopened] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RJ Nowling reopened SPARK-4816:
---

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056115#comment-15056115
 ] 

RJ Nowling commented on SPARK-4816:
---

I tested it again to make sure and ran into the same issue:

{code}
$ mvn -version
Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 
2014-12-14T17:29:23+00:00)
Maven home: /usr/share/apache-maven
Java version: 1.7.0_85, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85-2.6.1.2.el7_1.x86_64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-229.1.2.el7.x86_64", arch: "amd64", family: 
"unix"

$ export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1.tgz
$ tar -xzvf spark-1.4.1.tgz
$ cd spark-1.4.1
$ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean 
package
$ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.1-hadoop2.4.0.jar | 
grep netlib-native

(No output)
{code}

If I build the head from git {{branch-1.4}} and run {{zipinfo}}:

{code}
$ git log | head
commit c7c99857d47e4ca8373ee9ac59e108a9c443dd05
Author: Sean Owen 
Date:   Tue Dec 8 14:34:47 2015 +

[SPARK-11652][CORE] Remote code execution with InvokerTransformer

Fix commons-collection group ID to commons-collections for version 3.x

Patches earlier PR at https://github.com/apache/spark/pull/9731

$ zipinfo -1 
assembly/target/scala-2.10/spark-assembly-1.4.3-SNAPSHOT-hadoop2.4.0.jar | grep 
netlib-native
netlib-native_ref-osx-x86_64.jnilib
netlib-native_ref-osx-x86_64.jnilib.asc
netlib-native_ref-osx-x86_64.pom
netlib-native_ref-osx-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.properties
netlib-native_ref-linux-x86_64.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.properties
netlib-native_ref-linux-i686.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.properties
netlib-native_ref-win-x86_64.dll
netlib-native_ref-win-x86_64.dll.asc
netlib-native_ref-win-x86_64.pom
netlib-native_ref-win-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.properties
netlib-native_ref-win-i686.dll
netlib-native_ref-win-i686.dll.asc
netlib-native_ref-win-i686.pom
netlib-native_ref-win-i686.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.properties
netlib-native_ref-linux-armhf.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.properties
netlib-native_system-osx-x86_64.jnilib
netlib-native_system-osx-x86_64.jnilib.asc
netlib-native_system-osx-x86_64.pom
netlib-native_system-osx-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.properties
netlib-native_system-linux-x86_64.pom.asc
netlib-native_system-linux-x86_64.pom
netlib-native_system-linux-x86_64.so
netlib-native_system-linux-x86_64.so.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.properties
netlib-native_system-linux-i686.pom
netlib-native_system-linux-i686.so.asc
netlib-native_system-linux-i686.pom.asc
netlib-native_system-linux-i686.so
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.properties
netlib-native_system-linux-armhf.pom
netlib-native_system-linux-armhf.so.asc
netlib-native_system-linux-armhf.pom.asc
netlib-native_system-linux-armhf.so
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/
META-INF/maven/com.github.fommil.netlib/ne

[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-10 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051056#comment-15051056
 ] 

RJ Nowling commented on SPARK-4816:
---

Hi [~srowen],

I haven't tried master yet but that wouldn't address the problem I'm seeing.  
As I said, I downloaded the source tarball from the spark.apache.org web site 
vs checkout out branch-1.4.  I think it has something to do with the release 
process (but saying this with ignorance of what is involved).  I ran the same 
build command with both the source tarball (which reported excluding the native 
libs in the shading) and the branch-1.4 head from git (which reported including 
the native libs in the shading).

The .m2 repo shouldn't be an issue.  Normally, Spark pulls in the {{core}} 
artifact ID, which excludes the native libraries.  When the {{netlib-lpgp} 
profile is enabled, the Spark MLLib pom.xml adds the {{all}} artifact ID which 
pulls in the native libs.  ({{all}} is really just a pom.xml file that pulls in 
{{core}} + native libs).

I get that this is weird.  I also get that my lack of knowledge of the release 
process is basically zero.  But I shouldn't have different results from git vs 
the released source tarball.  Maybe it's not the release process -- maybe 
something has changed in the mean time.  I'll search through the commits on the 
branch-1.4 for something to related to shading.

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-10 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051056#comment-15051056
 ] 

RJ Nowling edited comment on SPARK-4816 at 12/10/15 2:52 PM:
-

Hi [~srowen],

I haven't tried master yet but that wouldn't address the problem I'm seeing.  
As I said, I downloaded the source tarball from the spark.apache.org web site 
vs checkout out branch-1.4.  I think it has something to do with the release 
process (but saying this with ignorance of what is involved).  I ran the same 
build command with both the source tarball (which reported excluding the native 
libs in the shading) and the branch-1.4 head from git (which reported including 
the native libs in the shading).

The .m2 repo shouldn't be an issue.  Normally, Spark pulls in the {{core}} 
artifact ID, which excludes the native libraries.  When the {{netlib-lpgp}} 
profile is enabled, the Spark MLLib pom.xml adds the {{all}} artifact ID which 
pulls in the native libs.  ({{all}} is really just a pom.xml file that pulls in 
{{core}} + native libs).

I get that this is weird.  I also get that my lack of knowledge of the release 
process is basically zero.  But I shouldn't have different results from git vs 
the released source tarball.  Maybe it's not the release process -- maybe 
something has changed in the mean time.  I'll search through the commits on the 
branch-1.4 for something to related to shading.


was (Author: rnowling):
Hi [~srowen],

I haven't tried master yet but that wouldn't address the problem I'm seeing.  
As I said, I downloaded the source tarball from the spark.apache.org web site 
vs checkout out branch-1.4.  I think it has something to do with the release 
process (but saying this with ignorance of what is involved).  I ran the same 
build command with both the source tarball (which reported excluding the native 
libs in the shading) and the branch-1.4 head from git (which reported including 
the native libs in the shading).

The .m2 repo shouldn't be an issue.  Normally, Spark pulls in the {{core}} 
artifact ID, which excludes the native libraries.  When the {{netlib-lpgp} 
profile is enabled, the Spark MLLib pom.xml adds the {{all}} artifact ID which 
pulls in the native libs.  ({{all}} is really just a pom.xml file that pulls in 
{{core}} + native libs).

I get that this is weird.  I also get that my lack of knowledge of the release 
process is basically zero.  But I shouldn't have different results from git vs 
the released source tarball.  Maybe it's not the release process -- maybe 
something has changed in the mean time.  I'll search through the commits on the 
branch-1.4 for something to related to shading.

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-09 Thread RJ Nowling (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RJ Nowling reopened SPARK-4816:
---

I ran into the same issue with Spark 1.4.  If I download the tarball from 
{{spark.apache.org}} and build with {{-Pnetlib-lgpl}}, the native libraries are 
excluded from the jar by the shader.  However, if I check out the branch-1.4 
from github and build with that, the appropriate libraries are included.

I don't know much about the source release processes, but it is possible that 
something in that process is resulting in different maven builds?

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2015-07-10 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622342#comment-14622342
 ] 

RJ Nowling commented on SPARK-3644:
---

[~joshrosen] Thanks for pointing to the new JIRA! :)

> REST API for Spark application info (jobs / stages / tasks / storage info)
> --
>
> Key: SPARK-3644
> URL: https://issues.apache.org/jira/browse/SPARK-3644
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Reporter: Josh Rosen
>Assignee: Imran Rashid
> Fix For: 1.4.0
>
>
> This JIRA is a forum to draft a design proposal for a REST interface for 
> accessing information about Spark applications, such as job / stage / task / 
> storage status.
> There have been a number of proposals to serve JSON representations of the 
> information displayed in Spark's web UI.  Given that we might redesign the 
> pages of the web UI (and possibly re-implement the UI as a client of a REST 
> API), the API endpoints and their responses should be independent of what we 
> choose to display on particular web UI pages / layouts.
> Let's start a discussion of what a good REST API would look like from 
> first-principles.  We can discuss what urls / endpoints expose access to 
> data, how our JSON responses will be formatted, how fields will be named, how 
> the API will be documented and tested, etc.
> Some links for inspiration:
> https://developer.github.com/v3/
> http://developer.netflix.com/docs/REST_API_Reference
> https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2015-07-08 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619239#comment-14619239
 ] 

RJ Nowling commented on SPARK-3644:
---

[~joshrosen] Several users commented above about using a REST API for 
submitting and killing jobs to support integration with Sahara and web-based 
front-ends.  Adding support for killing jobs shouldn't be too hard.  Submitting 
jobs is properly harder to add at the moment since the Spark master doesn't 
exist until the application is launched.  But I think we should acknowledge the 
needs of these users instead of just closing this JIRA.

> REST API for Spark application info (jobs / stages / tasks / storage info)
> --
>
> Key: SPARK-3644
> URL: https://issues.apache.org/jira/browse/SPARK-3644
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Reporter: Josh Rosen
>Assignee: Imran Rashid
> Fix For: 1.4.0
>
>
> This JIRA is a forum to draft a design proposal for a REST interface for 
> accessing information about Spark applications, such as job / stage / task / 
> storage status.
> There have been a number of proposals to serve JSON representations of the 
> information displayed in Spark's web UI.  Given that we might redesign the 
> pages of the web UI (and possibly re-implement the UI as a client of a REST 
> API), the API endpoints and their responses should be independent of what we 
> choose to display on particular web UI pages / layouts.
> Let's start a discussion of what a good REST API would look like from 
> first-principles.  We can discuss what urls / endpoints expose access to 
> data, how our JSON responses will be formatted, how fields will be named, how 
> the API will be documented and tested, etc.
> Some links for inspiration:
> https://developer.github.com/v3/
> http://developer.netflix.com/docs/REST_API_Reference
> https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2015-07-08 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619144#comment-14619144
 ] 

RJ Nowling commented on SPARK-3644:
---

[~joshrosen] The issue and corresponding PR you reference only seem to provide 
read-only access.  Is that correct?  If so, then are there open issues to 
address the needs of the users above?  Thanks!

> REST API for Spark application info (jobs / stages / tasks / storage info)
> --
>
> Key: SPARK-3644
> URL: https://issues.apache.org/jira/browse/SPARK-3644
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Reporter: Josh Rosen
>Assignee: Imran Rashid
> Fix For: 1.4.0
>
>
> This JIRA is a forum to draft a design proposal for a REST interface for 
> accessing information about Spark applications, such as job / stage / task / 
> storage status.
> There have been a number of proposals to serve JSON representations of the 
> information displayed in Spark's web UI.  Given that we might redesign the 
> pages of the web UI (and possibly re-implement the UI as a client of a REST 
> API), the API endpoints and their responses should be independent of what we 
> choose to display on particular web UI pages / layouts.
> Let's start a discussion of what a good REST API would look like from 
> first-principles.  We can discuss what urls / endpoints expose access to 
> data, how our JSON responses will be formatted, how fields will be named, how 
> the API will be documented and tested, etc.
> Some links for inspiration:
> https://developer.github.com/v3/
> http://developer.netflix.com/docs/REST_API_Reference
> https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4729) Add time series subsampling to MLlib

2015-07-06 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14615016#comment-14615016
 ] 

RJ Nowling commented on SPARK-4729:
---

Hi [~yalamart],  I haven't looked at this in quite a while but feel free to 
take it over.  Since this would depend on a time index, it doesn't really make 
sense without an agreement on how time series should be recognized.  See the 
comments of [SPARK-4727] for work on time series packages -- maybe you would 
want to look into contributing to those efforts first?

> Add time series subsampling to MLlib
> 
>
> Key: SPARK-4729
> URL: https://issues.apache.org/jira/browse/SPARK-4729
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: RJ Nowling
>Priority: Minor
>
> MLlib supports several time series functions.  The ability to subsample a 
> time series (take every n data points) is missing. 
> I'd like to add it, so please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6522) Standardize Random Number Generation

2015-03-24 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-6522:
-

 Summary: Standardize Random Number Generation
 Key: SPARK-6522
 URL: https://issues.apache.org/jira/browse/SPARK-6522
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: RJ Nowling
Priority: Minor


Generation of random numbers in Spark has to be handled carefully since 
references to RNGs copy the state to the workers.  As such, a separate RNG 
needs to be seeded for each partition.  Each time random numbers are used in 
Spark's libraries, the RNG seeding is re-implemented, leaving open the 
possibility of mistakes.

It would be useful if RNG seeding was standardized through utility functions or 
random number generation functions that can be called in Spark pipelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357410#comment-14357410
 ] 

RJ Nowling commented on SPARK-2429:
---

I'm familiar with the community interest but I'm not terribly familiar with the 
implementations (old or now).  [~freeman-lab] may be the appropriate person to 
ask for help -- the original implementation was based on his gist.

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: clustering
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357389#comment-14357389
 ] 

RJ Nowling commented on SPARK-2429:
---

[~josephkb]  I think it would be great to get the new implementation into Spark 
but we need a champion for it.  [~yuu.ishik...@gmail.com] did some great work, 
and I've been trying to shepard the work but we need a committer who wants to 
bring it in.  If you want to do that, then I can step back and let you and 
[~yuu.ishik...@gmail.com] bring this across the finish line.

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: clustering
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357367#comment-14357367
 ] 

RJ Nowling commented on SPARK-2429:
---

Hi [~yuu.ishik...@gmail.com]

I think the new implementation is great.  Did you change the algorithm?

I've spoken with [~srowen].  The hierarchical clustering would be valuable to 
the community -- I actually had a couple people reach out to me about it. 
However, Spark is currently undergoing the transition to the new ML API and as 
such, there is concern about accepting code into the older MLlib library.  With 
the announcement of Spark packages, there is also a move to encourage external 
libraries instead of large commits into Spark itself.

Would you be interested in publishing your hierarchical clustering 
implementation as an external library like [~derrickburns] did for the [KMeans 
Mini Batch 
implementation|https://github.com/derrickburns/generalized-kmeans-clustering]?  
 It could be listed in the [Spark packages index|http://spark-packages.org/] 
along with two other clustering packages so users can find it.

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: clustering
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6167) Previous Commit Broke BroadcastTest

2015-03-04 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347773#comment-14347773
 ] 

RJ Nowling commented on SPARK-6167:
---

Great! Thanks!





> Previous Commit Broke BroadcastTest
> ---
>
> Key: SPARK-6167
> URL: https://issues.apache.org/jira/browse/SPARK-6167
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.2.1
>Reporter: RJ Nowling
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.2, 1.2.2
>
>
> Commit associated with SPARK-1010 spell class names incorrectly 
> (BroaddcastFactory instead of BroadcastFactory).  As a result, the 
> BroadcastTest doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6167) Previous Commit Broke BroadcastTest

2015-03-04 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347607#comment-14347607
 ] 

RJ Nowling commented on SPARK-6167:
---

This PR fixes the issue in master and the 1.3 branch:

https://github.com/apache/spark/pull/4724

Needs to be merged into 1.2 branch as well.

> Previous Commit Broke BroadcastTest
> ---
>
> Key: SPARK-6167
> URL: https://issues.apache.org/jira/browse/SPARK-6167
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.2.1
>Reporter: RJ Nowling
>Priority: Minor
>
> Commit associated with SPARK-1010 spell class names incorrectly 
> (BroaddcastFactory instead of BroadcastFactory).  As a result, the 
> BroadcastTest doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6167) Previous Commit Broke BroadcastTest

2015-03-04 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-6167:
-

 Summary: Previous Commit Broke BroadcastTest
 Key: SPARK-6167
 URL: https://issues.apache.org/jira/browse/SPARK-6167
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.2.1
Reporter: RJ Nowling
Priority: Minor


Commit associated with SPARK-1010 spell class names incorrectly 
(BroaddcastFactory instead of BroadcastFactory).  As a result, the 
BroadcastTest doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2015-03-02 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343829#comment-14343829
 ] 

RJ Nowling commented on SPARK-2308:
---

Ok, we should mark the status of the JIRA as won't fix. (cc [~mengxr] and 
[~srowen])

Thanks for the excellent implementation on your GitHub repo!  Will be very 
beneficial to the community!

> Add KMeans MiniBatch clustering algorithm to MLlib
> --
>
> Key: SPARK-2308
> URL: https://issues.apache.org/jira/browse/SPARK-2308
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>Priority: Minor
>  Labels: clustering
> Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-02 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343806#comment-14343806
 ] 

RJ Nowling commented on SPARK-2429:
---

[~yuu.ishik...@gmail.com] are you still working on this?  Thanks!

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: clustering
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2015-03-02 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343804#comment-14343804
 ] 

RJ Nowling commented on SPARK-2308:
---

[~derrickburns] and [~mengxr] Is work still being done on this JIRA and 
Derrick's PR? Thanks!

> Add KMeans MiniBatch clustering algorithm to MLlib
> --
>
> Key: SPARK-2308
> URL: https://issues.apache.org/jira/browse/SPARK-2308
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>Priority: Minor
>  Labels: clustering
> Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework

2015-03-01 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342382#comment-14342382
 ] 

RJ Nowling commented on SPARK-2430:
---

I think we can close this JIRA.  It's been superseded by the new Pipeline API 
as you mentioned.

> Standarized Clustering Algorithm API and Framework
> --
>
> Key: SPARK-2430
> URL: https://issues.apache.org/jira/browse/SPARK-2430
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Priority: Minor
>
> Recently, there has been a chorus of voices on the mailing lists about adding 
> new clustering algorithms to MLlib.  To support these additions, we should 
> develop a common framework and API to reduce code duplication and keep the 
> APIs consistent.
> At the same time, we can also expand the current API to incorporate requested 
> features such as arbitrary distance metrics or pre-computed distance matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-21 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14285980#comment-14285980
 ] 

RJ Nowling commented on SPARK-4894:
---

[~mengxr] Since [~lmcguire] has submitted the patch, can we assign the JIRA to 
her so she gets credit for it?  Thanks!

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5328) Update PySpark MLlib NaiveBayes API to take model type parameter for Bernoulli fit

2015-01-20 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283887#comment-14283887
 ] 

RJ Nowling commented on SPARK-5328:
---

The Python API for Naive Bayes is located in 
python/pyspark/mllib/classification.py .  The Python implementation calls the 
Scala implementation for training through the interface in 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala .

The classes in classification.py will need to be updated (with additional pydoc 
tests), a new method will need to be added to PythonMLLibAPI.scala, and the 
Python portion of docs/mllib-naive-bayes.md will need to be updated.



> Update PySpark MLlib NaiveBayes API to take model type parameter for 
> Bernoulli fit
> --
>
> Key: SPARK-5328
> URL: https://issues.apache.org/jira/browse/SPARK-5328
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Leah McGuire
>Priority: Minor
>  Labels: mllib
>
> [SPARK-4894] Adds Bernoulli-variant of Naive Bayes adds Bernoulli fitting to 
> NaiveBayes.scala need to update python API to accept model type parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

2015-01-15 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279258#comment-14279258
 ] 

RJ Nowling commented on SPARK-5272:
---

Hi [~josephkb], 

I can see benefits to your suggestions of feature types (e.g., categorial, 
discrete counts, continuous, binary, etc.).  If we created corresponding 
FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it 
would promote composition which would be easier to test, debug, and maintain 
versus multiple NB subclasses like sklearn.  Additionally, if the user can 
define a type for each feature, then users can mix and match likelihood types 
as well.  Most NB implementations treat all features the same -- what if we had 
a model that allowed heterozygous features?  If it works well in NB, it could 
be extended to other parts of MLlib.  (There is likely some overlap with 
decision trees since they support multiple feature types, so we might want to 
see if there is anything there we can reuse.)  At the API level, we could 
provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat} like 
the current API so that simplicity isn't compromised and provide a more 
advanced API for power users.

Does this sound like I'm understanding you correctly?

Re: Decision trees.  Decision tree models generally support different types of 
features (categorical, binary, discrete, continuous).  Does Spark's decision 
tree implementation support those different types?  How are they handled?  Do 
they abstract the feature type?  I feel there could be common ground here.


> Refactor NaiveBayes to support discrete and continuous labels,features
> --
>
> Key: SPARK-5272
> URL: https://issues.apache.org/jira/browse/SPARK-5272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> This JIRA is to discuss refactoring NaiveBayes in order to support both 
> discrete and continuous labels and features.
> Currently, NaiveBayes supports only discrete labels and features.
> Proposal: Generalize it to support continuous values as well.
> Some items to discuss are:
> * How commonly are continuous labels/features used in practice?  (Is this 
> necessary?)
> * What should the API look like?
> ** E.g., should NB have multiple classes for each type of label/feature, or 
> should it take a general Factor type parameter?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-15 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279250#comment-14279250
 ] 

RJ Nowling commented on SPARK-4894:
---

Thanks, [~josephkb]!  I'd be happy to help with the NB refactoring too :) 

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278228#comment-14278228
 ] 

RJ Nowling edited comment on SPARK-4894 at 1/15/15 4:21 AM:


[~josephkb], after some thought, I've come around and think your idea of 1 NB 
class with a Factor type parameter may be the more maintainable choice as well 
as offering some novel functionality.  But, there seems to be a lot to figure 
out (we should be checking the decision tree implementation for example) and I 
don't want to hold up what should be a relatively simple change to support 
Bernoulli NB.  Can we create a new JIRA to discuss the NB refactoring?

Comments about refactoring:
(1) how often is NB used with continuous values?  I see that sklearn supports 
Gaussian NB but is this used in practice?  My understanding is that NB is 
generally used for text classification with counts or binary values, possibly 
weighted by TF-IDF.   We should probably email the users and dev lists to get 
user feedback.  If no one is asking for it, we should shelve it and focus on 
other things.

(2) after some more reflection, I can see a few more benefits to your 
suggestions of feature types (e.g., categorial, discrete counts, continuous, 
binary, etc.).  If we created corresponding FeatureLikelihood types (e.g., 
Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which 
would be easier to test, debug, and maintain versus multiple NB subclasses like 
sklearn.  Additionally, if the user can define a type for each feature, then 
users can mix and match likelihood types as well.  Most NB implementations 
treat all features the same -- what if we had a model that allowed heterozygous 
features?  If it works well in NB, it could be extended to other parts of 
MLlib.  (There is likely some overlap with decision trees since they support 
multiple feature types, so we might want to see if there is anything there we 
can reuse.)  At the API level, we could provide a basic API which takes 
{noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity 
isn't compromised and provide a more advanced API for power users.



was (Author: rnowling):
[~josephkb], after some thought, I've come around and think your idea of 1 NB 
class with a Factor type parameter may be the more maintainable choice as well 
as offering some novel functionality.  But, there seems to be a lot to figure 
out (we should be checking the decision tree implementation for example) and I 
don't want to hold up what should be a relatively simple change to support 
Bernoulli NB.  What do you think?

Comments about refactoring:
(1) how often is NB used with continuous values?  I see that sklearn supports 
Gaussian NB but is this used in practice?  My understanding is that NB is 
generally used for text classification with counts or binary values, possibly 
weighted by TF-IDF.   We should probably email the users and dev lists to get 
user feedback.  If no one is asking for it, we should shelve it and focus on 
other things.

(2) after some more reflection, I can see a few more benefits to your 
suggestions of feature types (e.g., categorial, discrete counts, continuous, 
binary, etc.).  If we created corresponding FeatureLikelihood types (e.g., 
Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which 
would be easier to test, debug, and maintain versus multiple NB subclasses like 
sklearn.  Additionally, if the user can define a type for each feature, then 
users can mix and match likelihood types as well.  Most NB implementations 
treat all features the same -- what if we had a model that allowed heterozygous 
features?  If it works well in NB, it could be extended to other parts of 
MLlib.  (There is likely some overlap with decision trees since they support 
multiple feature types, so we might want to see if there is anything there we 
can reuse.)  At the API level, we could provide a basic API which takes 
{noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity 
isn't compromised and provide a more advanced API for power users.


> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-

[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278228#comment-14278228
 ] 

RJ Nowling commented on SPARK-4894:
---

[~josephkb], after some thought, I've come around and think your idea of 1 NB 
class with a Factor type parameter may be the more maintainable choice as well 
as offering some novel functionality.  But, there seems to be a lot to figure 
out (we should be checking the decision tree implementation for example) and I 
don't want to hold up what should be a relatively simple change to support 
Bernoulli NB.  What do you think?

Comments about refactoring:
(1) how often is NB used with continuous values?  I see that sklearn supports 
Gaussian NB but is this used in practice?  My understanding is that NB is 
generally used for text classification with counts or binary values, possibly 
weighted by TF-IDF.   We should probably email the users and dev lists to get 
user feedback.  If no one is asking for it, we should shelve it and focus on 
other things.

(2) after some more reflection, I can see a few more benefits to your 
suggestions of feature types (e.g., categorial, discrete counts, continuous, 
binary, etc.).  If we created corresponding FeatureLikelihood types (e.g., 
Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which 
would be easier to test, debug, and maintain versus multiple NB subclasses like 
sklearn.  Additionally, if the user can define a type for each feature, then 
users can mix and match likelihood types as well.  Most NB implementations 
treat all features the same -- what if we had a model that allowed heterozygous 
features?  If it works well in NB, it could be extended to other parts of 
MLlib.  (There is likely some overlap with decision trees since they support 
multiple feature types, so we might want to see if there is anything there we 
can reuse.)  At the API level, we could provide a basic API which takes 
{noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity 
isn't compromised and provide a more advanced API for power users.


> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1420#comment-1420
 ] 

RJ Nowling commented on SPARK-4894:
---

Hi [~josephkb], lots to think about!

In general, I'm a big fan of multiple small changes over time rather than one 
big change.  They're easier to verify and review.  Since MLLib is going through 
an interface refactoring to become ML anyway, we can focus on the Bernoulli NB 
change now and worry about a redesign of the API later.

What do you have in mind for other feature and label types?  I briefly reviewed 
Factorie -- their concept of Factors may be over complicated for Naive Bayes 
but I want to learn more about your ideas.  Do you have a few concrete examples 
of how Factors could be used with NB?  And for continuous labels, are you 
thinking of something like the Gaussian NB in sklearn?

>From bioinformatics, I know that folks tend to encode categorical variables 
>incorrectly.  E.g., for a DNA sequence consisting of A, T, C, G, and possibly 
>gaps, each position in a sequence should be encoded as four (five) features, 
>one for each nucleotide.  When folks try to represent each position as one 
>feature with the bases as numbers (A=1, T=2, etc.), this results in incorrect 
>distance metrics. E.g., ATT will differ from TTT by 1 but ATT will differ from 
>CTT by 2. By using one feature for each of the four (five) possibilities, you 
>get correct distances and can even weight mutations and deletions using BLOSUM 
>matrices and such.  For this type of case, I think the solution there is 
>education and documentation, not complicated type systems.




> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277631#comment-14277631
 ] 

RJ Nowling edited comment on SPARK-4894 at 1/14/15 8:50 PM:


Thanks [~lmcguire]!  I'll wait until next week in case you have time to put a 
patch together.

In the mean time, here were my thoughts for changes:
1. Add an optional `model` variable to the `NaiveBayes` object and class and 
`NaiveBayesModel`. It would be a string with a default value of `Multinomial`.  
For Bernoulli, we can use `Bernoulli`.

2.  In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta 
* testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * 
testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - 
exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for 
the 0-valued features.   (Breeze may not allow adding/subtracting scalars and 
vectors/matrices.)

In the current model, no term is added for rows of `testData` that have 0 
entries.  In the Bernoulli model, we would be adding a separate term for 
0-valued features.

Here is the sklearn source for comparison: 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py 
(Look at `_joint_log_likelihood` in the `MultinomialNB` and `BernoulliNB` 
classes.)

Note that sklearn adds the neg prob to all features and subtracts it from 
features with 1-values.

[~mengxr], [~lmcguire], [~josephkb] Any thoughts or comments on any of the 
above?


was (Author: rnowling):
Thanks [~lmcguire]!  I'll wait until next week in case you have time to put a 
patch together.

In the mean time, here were my thoughts for changes:
1. Add an optional `model` variable to the `NaiveBayes` object and class and 
`NaiveBayesModel`. It would be a string with a default value of `Multinomial`.  
For Bernoulli, we can use `Bernoulli`.

2.  In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta 
* testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * 
testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - 
exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for 
the 0-valued features.   (Breeze may not allow adding/subtracting scalars and 
vectors/matrices.)

In the current model, no term is added for rows of `testData` that have 0 
entries.  In the Bernoulli model, we would be adding a separate term for 
0-valued features.

Here is the sklearn source for comparison: 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py

Note that sklearn adds the neg prob to all features and subtracts it from 
features with 1-values.

[~mengxr], [~josephkb] Any thoughts or comments?

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277631#comment-14277631
 ] 

RJ Nowling commented on SPARK-4894:
---

Thanks [~lmcguire]!  I'll wait until next week in case you have time to put a 
patch together.

In the mean time, here were my thoughts for changes:
1. Add an optional `model` variable to the `NaiveBayes` object and class and 
`NaiveBayesModel`. It would be a string with a default value of `Multinomial`.  
For Bernoulli, we can use `Bernoulli`.

2.  In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta 
* testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * 
testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - 
exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for 
the 0-valued features.   (Breeze may not allow adding/subtracting scalars and 
vectors/matrices.)

In the current model, no term is added for rows of `testData` that have 0 
entries.  In the Bernoulli model, we would be adding a separate term for 
0-valued features.

Here is the sklearn source for comparison: 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py

Note that sklearn adds the neg prob to all features and subtracts it from 
features with 1-values.

[~mengxr], [~josephkb] Any thoughts or comments?

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-13 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276380#comment-14276380
 ] 

RJ Nowling edited comment on SPARK-4894 at 1/14/15 2:06 AM:


Hi [~lmcguire]

Always happy to have more help! :)

I started looking through the Spark NB functions but I haven't started writing 
code yet.  The docs for NB mention that using binary features will cause the 
multinomial NB to act like Bernoulli NB.  I don't believe the documentation is 
correct, at least when smoothing is used since P(0) != 1 - P(1).I was 
planning on comparing the sklearn implementation with the Spark implementation 
and showing that the docs were wrong.  Once verified, I think the changes will 
be very small to add a Bernoulli mode controlled by a flag in the constructor.

I won't get to this until next week, though.  If you have time now and want to 
tackle this, I'd be happy to hand it over to you and review any patches.  (I'm 
not a committer, though -- [~mengxr] would have to sign off.)Otherwise, if 
you want to wait until I have a patch and test it, that could work, too.  What 
do you think?


was (Author: rnowling):
Hi @lmcguire,

Always happy to have more help! :)

I started looking through the Spark NB functions but I haven't started writing 
code yet.  The docs for NB mention that using binary features will cause the 
multinomial NB to act like Bernoulli NB.  I don't believe the documentation is 
correct, at least when smoothing is used since P(0) != 1 - P(1).I was 
planning on comparing the sklearn implementation with the Spark implementation 
and showing that the docs were wrong.  Once verified, I think the changes will 
be very small to add a Bernoulli mode controlled by a flag in the constructor.

I won't get to this until next week, though.  If you have time now and want to 
tackle this, I'd be happy to hand it over to you and review any patches.  (I'm 
not a committer, though -- [~mengxr] would have to sign off.)Otherwise, if 
you want to wait until I have a patch and test it, that could work, too.  What 
do you think?

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.1
>Reporter: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-13 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276380#comment-14276380
 ] 

RJ Nowling commented on SPARK-4894:
---

Hi @lmcguire,

Always happy to have more help! :)

I started looking through the Spark NB functions but I haven't started writing 
code yet.  The docs for NB mention that using binary features will cause the 
multinomial NB to act like Bernoulli NB.  I don't believe the documentation is 
correct, at least when smoothing is used since P(0) != 1 - P(1).I was 
planning on comparing the sklearn implementation with the Spark implementation 
and showing that the docs were wrong.  Once verified, I think the changes will 
be very small to add a Bernoulli mode controlled by a flag in the constructor.

I won't get to this until next week, though.  If you have time now and want to 
tackle this, I'd be happy to hand it over to you and review any patches.  (I'm 
not a committer, though -- [~mengxr] would have to sign off.)Otherwise, if 
you want to wait until I have a patch and test it, that could work, too.  What 
do you think?

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.1
>Reporter: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2014-12-18 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252860#comment-14252860
 ] 

RJ Nowling commented on SPARK-4894:
---

[~mengxr] Could you assign this to me? Thanks!


> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.1
>Reporter: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2014-12-18 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-4894:
-

 Summary: Add Bernoulli-variant of Naive Bayes
 Key: SPARK-4894
 URL: https://issues.apache.org/jira/browse/SPARK-4894
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.1
Reporter: RJ Nowling


MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
version of Naive Bayes is more useful for situations where the features are 
binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4891) Add exponential, log normal, and gamma distributions to data generator to PySpark's MLlib

2014-12-18 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252534#comment-14252534
 ] 

RJ Nowling commented on SPARK-4891:
---

[~mengxr] Could you assign this to me?  Thanks! :)

> Add exponential, log normal, and gamma distributions to data generator to 
> PySpark's MLlib
> -
>
> Key: SPARK-4891
> URL: https://issues.apache.org/jira/browse/SPARK-4891
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.1.1
>Reporter: RJ Nowling
>Priority: Minor
>
> [SPARK-4728] adds sampling from exponential, gamma, and log normal 
> distributions to the Scala/Java MLlib APIs.  We need to add these functions 
> to the PySpark MLlib API for parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4891) Add exponential, log normal, and gamma distributions to data generator to PySpark's MLlib

2014-12-18 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-4891:
-

 Summary: Add exponential, log normal, and gamma distributions to 
data generator to PySpark's MLlib
 Key: SPARK-4891
 URL: https://issues.apache.org/jira/browse/SPARK-4891
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 1.1.1
Reporter: RJ Nowling
Priority: Minor


[SPARK-4728] adds sampling from exponential, gamma, and log normal 
distributions to the Scala/Java MLlib APIs.  We need to add these functions to 
the PySpark MLlib API for parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib

2014-12-18 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252224#comment-14252224
 ] 

RJ Nowling commented on SPARK-4728:
---

[~mengxr] can you assign this JIRA to me since I've created a PR?  Thanks!

> Add exponential, log normal, and gamma distributions to data generator to 
> MLlib
> ---
>
> Key: SPARK-4728
> URL: https://issues.apache.org/jira/browse/SPARK-4728
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: RJ Nowling
>Priority: Minor
>
> MLlib supports sampling from normal, uniform, and Poisson distributions.  
> I'd like to add support for sampling from exponential, gamma, and log normal 
> distributions, using the features of math3 like the other generators.
> Please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib

2014-12-11 Thread RJ Nowling (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RJ Nowling updated SPARK-4728:
--
Comment: was deleted

(was: I posted a PR for this issue:
https://github.com/apache/spark/pull/3680)

> Add exponential, log normal, and gamma distributions to data generator to 
> MLlib
> ---
>
> Key: SPARK-4728
> URL: https://issues.apache.org/jira/browse/SPARK-4728
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: RJ Nowling
>Priority: Minor
>
> MLlib supports sampling from normal, uniform, and Poisson distributions.  
> I'd like to add support for sampling from exponential, gamma, and log normal 
> distributions, using the features of math3 like the other generators.
> Please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib

2014-12-11 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242924#comment-14242924
 ] 

RJ Nowling commented on SPARK-4728:
---

I posted a PR for this issue:
https://github.com/apache/spark/pull/3680

> Add exponential, log normal, and gamma distributions to data generator to 
> MLlib
> ---
>
> Key: SPARK-4728
> URL: https://issues.apache.org/jira/browse/SPARK-4728
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: RJ Nowling
>Priority: Minor
>
> MLlib supports sampling from normal, uniform, and Poisson distributions.  
> I'd like to add support for sampling from exponential, gamma, and log normal 
> distributions, using the features of math3 like the other generators.
> Please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4727) Add "dimensional" RDDs (time series, spatial)

2014-12-04 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234399#comment-14234399
 ] 

RJ Nowling commented on SPARK-4727:
---

Thanks, Jeremy!

Your work may cover my needs, and if not, it seems like a great place to 
contribute to!

Was there some talk about encouraging people to build Spark libraries and 
putting together a community list?  I'd love to see this sort of work 
advertised more.

> Add "dimensional" RDDs (time series, spatial)
> -
>
> Key: SPARK-4727
> URL: https://issues.apache.org/jira/browse/SPARK-4727
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: RJ Nowling
>
> Certain types of data (times series, spatial) can benefit from specialized 
> RDDs.  I'd like to open a discussion about this.
> For example, time series data should be ordered by time and would benefit 
> from operations like:
> * Subsampling (taking every n data points)
> * Signal processing (correlations, FFTs, filtering)
> * Windowing functions
> Spatial data benefits from ordering and partitioning along a 2D or 3D grid.  
> For example, path finding algorithms can optimized by only comparing points 
> within a set distance, which can be computed more efficiently by partitioning 
> data into a grid.
> Although the operations on time series and spatial data may be different, 
> there is some commonality in the sense of the data having ordered dimensions 
> and the implementations may overlap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4729) Add time series subsampling to MLlib

2014-12-03 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-4729:
-

 Summary: Add time series subsampling to MLlib
 Key: SPARK-4729
 URL: https://issues.apache.org/jira/browse/SPARK-4729
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: RJ Nowling
Priority: Minor


MLlib supports several time series functions.  The ability to subsample a time 
series (take every n data points) is missing. 

I'd like to add it, so please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib

2014-12-03 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-4728:
-

 Summary: Add exponential, log normal, and gamma distributions to 
data generator to MLlib
 Key: SPARK-4728
 URL: https://issues.apache.org/jira/browse/SPARK-4728
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: RJ Nowling
Priority: Minor


MLlib supports sampling from normal, uniform, and Poisson distributions.  

I'd like to add support for sampling from exponential, gamma, and log normal 
distributions, using the features of math3 like the other generators.

Please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4727) Add "dimensional" RDDs (time series, spatial)

2014-12-03 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-4727:
-

 Summary: Add "dimensional" RDDs (time series, spatial)
 Key: SPARK-4727
 URL: https://issues.apache.org/jira/browse/SPARK-4727
 Project: Spark
  Issue Type: Brainstorming
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: RJ Nowling


Certain types of data (times series, spatial) can benefit from specialized 
RDDs.  I'd like to open a discussion about this.

For example, time series data should be ordered by time and would benefit from 
operations like:
* Subsampling (taking every n data points)
* Signal processing (correlations, FFTs, filtering)
* Windowing functions

Spatial data benefits from ordering and partitioning along a 2D or 3D grid.  
For example, path finding algorithms can optimized by only comparing points 
within a set distance, which can be computed more efficiently by partitioning 
data into a grid.

Although the operations on time series and spatial data may be different, there 
is some commonality in the sense of the data having ordered dimensions and the 
implementations may overlap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-11-16 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213945#comment-14213945
 ] 

RJ Nowling commented on SPARK-2429:
---

Hi Yu,

I'm having trouble finding the function to cut a dendrogram -- I see the tests 
but not the implementation.

I feel that you should be able to assign values in O(log N) time with the 
hierarchical method vs O(N) with the standard kmeans.  So, say you train a 
model (this may be slower than kmeans) then assign additional points to 
clusters after training.  If clusters at the same levels in the hierarchy do 
not overlap, you should be able to choose the closest cluster at each level 
until you find a leaf.  I'm assuming that the children of a given cluster are 
contained within that cluster (spacially) -- can you show this or find a 
reference for this?  If so, then assignment should be faster for a larger 
number of clusters as Jun was saying above.

Do you agree with this?  Or is there something I am misunderstanding!

Thanks!

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4158) Spark throws exception when Mesos resources are missing

2014-10-31 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192448#comment-14192448
 ] 

RJ Nowling commented on SPARK-4158:
---

I verified that the associated patch fixes this issue on our local cluster 
running Spark 1.1.0 and Mesos 0.21.

> Spark throws exception when Mesos resources are missing
> ---
>
> Key: SPARK-4158
> URL: https://issues.apache.org/jira/browse/SPARK-4158
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.1.0
>Reporter: Brenden Matthews
>
> Spark throws an exception when trying to check resources which haven't been 
> offered by Mesos.  This is an error in Spark, and should be corrected as 
> such.  Here's a sample:
> {code}
> val data Exception in thread "Thread-41" java.lang.IllegalArgumentException: 
> No resource called cpus in [name: "mem"
> type: SCALAR
> scalar {
>   value: 2067.0
> }
> role: "*"
> , name: "disk"
> type: SCALAR
> scalar {
>   value: 900.0
> }
> role: "*"
> , name: "ports"
> type: RANGES
> ranges {
>   range {
> begin: 31000
> end: 32000
>   }
> }
> role: "*"
> ]
> at 
> org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend.org$apache$spark$scheduler$cluster$mesos$CoarseMesosSchedulerBackend$$getResource(CoarseMesosSchedulerBackend.scala:236)
> at 
> org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend$$anonfun$resourceOffers$1.apply(CoarseMesosSchedulerBackend.scala:200)
> at 
> org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend$$anonfun$resourceOffers$1.apply(CoarseMesosSchedulerBackend.scala:197)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend.resourceOffers(CoarseMesosSchedulerBackend.scala:197)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-29 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188325#comment-14188325
 ] 

RJ Nowling commented on SPARK-2429:
---

The sparsity tests look good.  Have you compared training and assignment time 
to KMeans yet?  An improvement in the assignment time will be important.  Also, 
I don't see a breakdown of the total time by splitting clusters, assignments, 
etc.  Doesn't need to be for every combination of parameters just one or two.  
That would be very helpful.  Thanks!

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-23 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181183#comment-14181183
 ] 

RJ Nowling commented on SPARK-2429:
---

I added a couple comments to the PR.  

I would say stick with Euclidean distance for now.

For assignment, you should be able to do a binary search.  E.g., if a center 
has children, which of the two children is it closer to?  Choose that center.  
Repeat until you hit a leaf (cluster with no children).

I saw that you added logging for timing but can you update your report with the 
timing breakdown for each stage?

Thanks!

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-22 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179890#comment-14179890
 ] 

RJ Nowling commented on SPARK-2429:
---

A 6x performance improvement is great improvement!

Can you add a breakdown of the timings for each part of the algorithm?  (e.g, 
like you did to find out which parts were slowest?)  You don't need to do a 
sweep over multiple data sizes or number of data points -- just pick a 
representative number of data point and rows.

Have you compared the performance of the hierarchical KMeans vs KMeans 
implemented in MLLib?  I expect that the hierarchical will be slower to cluster 
but the assignment should be faster (O(log k) vs O(k)).  This improvement in 
assignment speed is the motivation for including the hierarchical KMeans in 
Spark.

Thanks!

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4040) calling count() on RDD's emitted from a DStream blocks forEachRDD progress.

2014-10-22 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179882#comment-14179882
 ] 

RJ Nowling commented on SPARK-4040:
---

I don't think you can access a RDD from with an operation performed on a RDD.  
Your code example may even be trying to serialize the RDD along with the 
operation, which may not be possible.

You would want to call {{count()}} outside the operation and pass it in through 
the closure or a broadcast. 

> calling count() on RDD's emitted from a DStream blocks forEachRDD progress.
> ---
>
> Key: SPARK-4040
> URL: https://issues.apache.org/jira/browse/SPARK-4040
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: jay vyas
>
> Please note that Im somewhat new to spark streaming's API, and am not a spark 
> expert - so I've done the best to write up and reproduce this "bug".  If its 
> not a bug i hope an expert will help to explain why and promptly close it.  
> However, it appears it could be a bug after discussing with [~rnowling] who 
> is a spark contributor.
> CC [~rnowling] [~willbenton] 
>  
> It appears that in a DStream context, a call to   {{MappedRDD.count()}} 
> blocks progress and prevents emission of RDDs from a stream.
> {noformat}
> tweetStream.foreachRDD((rdd,lent)=> {
>   tweetStream.repartition(1)
>   //val count = rdd.count()  DONT DO THIS !
>   checks += 1;
>   if (checks > 20) {
> ssc.stop()
>   }
>}
> {noformat} 
> The above code block should inevitably halt, after 20 intervals of RDDs... 
> However, if we *uncomment the call* to {{rdd.count()}}, it turns out that we 
> get an *infinite stream which emits no RDDs*, and thus our program *runs 
> forever* (ssc.stop is unreachable), because *forEach doesnt receive any more 
> entries*.  
> I suspect this is actually because the foreach block never completes, because 
> {{count()}} is winds up calling {{compute}}, which ultimately just reads from 
> the stream.
> I havent put together a minimal reproducer or unit test yet, but I can work 
> on doing so if more info is needed.
> I guess this could be seen as an application bug - but i think spark might be 
> made smarter to throw its hands up when people execute blocking code in a 
> stream processor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-15 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172486#comment-14172486
 ] 

RJ Nowling commented on SPARK-2429:
---

Great to know! I'm glad that isn't a bottleneck.

Have you been able to benchmark each of the major steps?  Which steps are
most expensive?

On Wed, Oct 15, 2014 at 10:24 AM, Yu Ishikawa (JIRA) 




-- 
em rnowl...@gmail.com
c 954.496.2314


> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
> Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-09 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165480#comment-14165480
 ] 

RJ Nowling commented on SPARK-2429:
---

Great work, Yu!

Ok, first off, let me make sure I understand what you're doing.  You start with 
2 centers.  You assign all the points.  You then apply KMeans recursively to 
each cluster, splitting each center into 2 centers.  Each instance of KMeans 
stops when the error is below a certain value or a fixed number of iterations 
have been run.

I think your analysis of the overall run time is good and probably what we 
expect.  Can you break down the timing to see which parts are the most 
expensive?  Maybe we can figure out where to optimize it.

A few thoughts on optimization:
1. It might be good to convert everything to Breeze vectors before you do any 
operations -- you need to convert the same vectors over and over again.  KMeans 
converts them at the beginning and converts the vectors for the centers back at 
the end.

2. Instead of passing the centers as part of the EuclideanClosestCenterFinder, 
look into using a broadcast variable.  See the latest KMeans implementation. 
This could improve performance by 10%+.

3. You may want to look into using reduceByKey or similar RDD operations -- 
they will enable parallel reductions which will be faster than a loop on the 
master.

If you look at the JIRAs and PRs, there is some recent work to speed up KMeans 
-- maybe some of that is applicable?

I'll probably have more questions -- it's a good way of helping me understand 
what you're doing :)

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
> Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2014-10-08 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163712#comment-14163712
 ] 

RJ Nowling commented on SPARK-3785:
---

Part of my graduate work involved implementing physics simulations on GPUs and 
managing multi-user GPU clusters.  

>From a performance perspective, we saw 100x+ speed ups on a single machine 
>with a GPU vs multiple cores using specialized GPU implementations such as 
>OpenMM or Gromacs.  But this was using hand-optimized GPU implementations that 
>were pipelined to prevent unnecessary host/GPU copies and do as much work as 
>possible on the GPU.

For clusters, we'd get the 2-5x speed up due to communication overhead between 
host/GPU and other nodes.  In these cases, you could only run a few iterations 
on the GPU before you had to communicate with other nodes.

Thus, GPUs are great if you're doing computation that will run using 
hand-optimized GPU implementations for long periods of time before 
communicating outside the GPU.  But I think you won't get much of a performance 
improvement using simple operations (like RDD operations) without explicit (and 
challenging) pipeline optimization work.

I think the most practical case for Spark/GPU integration is jobs involving 
large chunks of image processing, rendering, linear algebra, etc. work that can 
be done independently in each task.  For example, Naive Bayes where the number 
of features is large enough to fit on the GPUs in a single node but there are 
many, many samples to classify.  In this case, you may be able use a GPU linear 
algebra library to do the GPU operations and move data asynchronously and in 
large chunks to reduce performance issues.

Further, GPU scheduling is immature.  Very little isolation, GPUs often get 
into bad states that require machine reboots, and no OS support so mostly done 
by each application.  It's like MacOS 9 -- have to hope each process is a 
responsible citizen.  I think that would end up being a huge distraction for 
Spark's developers.

I think [~srowen]'s point about calling GPU libraries from your Spark driver is 
probably the most practical solution.


> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-22 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143898#comment-14143898
 ] 

RJ Nowling commented on SPARK-3614:
---

It could lead to over-fitting and thus mis-predictions.  In such cases, it may 
be valuable to exclude overly-specific terms.

> Filter on minimum occurrences of a term in IDF 
> ---
>
> Key: SPARK-3614
> URL: https://issues.apache.org/jira/browse/SPARK-3614
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jatinpreet Singh
>Assignee: RJ Nowling
>Priority: Minor
>  Labels: TFIDF
>
> The IDF class in MLlib does not provide the capability of defining a minimum 
> number of documents a term should appear in the corpus. The idea is to have a 
> cutoff variable which defines this minimum occurrence value, and the terms 
> which have lower frequency are ignored.
> Mathematically,
> IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance
> where, 
> D is the total number of documents in the corpus
> DF(t,D) is the number of documents that contain the term t
> minimumOccurance is the minimum number of documents the term appears in the 
> document corpus
> This would have an impact on accuracy as terms that appear in less than a 
> certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-22 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142860#comment-14142860
 ] 

RJ Nowling edited comment on SPARK-3614 at 9/22/14 5:52 PM:


Thanks, Andrew! I'll do that.


was (Author: rnowling):
Thanks, Andrew! I'll do that.




-- 
em rnowl...@gmail.com
c 954.496.2314


> Filter on minimum occurrences of a term in IDF 
> ---
>
> Key: SPARK-3614
> URL: https://issues.apache.org/jira/browse/SPARK-3614
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jatinpreet Singh
>Assignee: RJ Nowling
>Priority: Minor
>  Labels: TFIDF
>
> The IDF class in MLlib does not provide the capability of defining a minimum 
> number of documents a term should appear in the corpus. The idea is to have a 
> cutoff variable which defines this minimum occurrence value, and the terms 
> which have lower frequency are ignored.
> Mathematically,
> IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance
> where, 
> D is the total number of documents in the corpus
> DF(t,D) is the number of documents that contain the term t
> minimumOccurance is the minimum number of documents the term appears in the 
> document corpus
> This would have an impact on accuracy as terms that appear in less than a 
> certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-21 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142860#comment-14142860
 ] 

RJ Nowling commented on SPARK-3614:
---

Thanks, Andrew! I'll do that.




-- 
em rnowl...@gmail.com
c 954.496.2314


> Filter on minimum occurrences of a term in IDF 
> ---
>
> Key: SPARK-3614
> URL: https://issues.apache.org/jira/browse/SPARK-3614
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jatinpreet Singh
>Assignee: RJ Nowling
>Priority: Minor
>  Labels: TFIDF
>
> The IDF class in MLlib does not provide the capability of defining a minimum 
> number of documents a term should appear in the corpus. The idea is to have a 
> cutoff variable which defines this minimum occurrence value, and the terms 
> which have lower frequency are ignored.
> Mathematically,
> IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance
> where, 
> D is the total number of documents in the corpus
> DF(t,D) is the number of documents that contain the term t
> minimumOccurance is the minimum number of documents the term appears in the 
> document corpus
> This would have an impact on accuracy as terms that appear in less than a 
> certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-21 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142631#comment-14142631
 ] 

RJ Nowling commented on SPARK-3614:
---

I would like to work on this.

> Filter on minimum occurrences of a term in IDF 
> ---
>
> Key: SPARK-3614
> URL: https://issues.apache.org/jira/browse/SPARK-3614
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jatinpreet Singh
>Priority: Minor
>  Labels: TFIDF
>
> The IDF class in MLlib does not provide the capability of defining a minimum 
> number of documents a term should appear in the corpus. The idea is to have a 
> cutoff variable which defines this minimum occurrence value, and the terms 
> which have lower frequency are ignored.
> Mathematically,
> IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance
> where, 
> D is the total number of documents in the corpus
> DF(t,D) is the number of documents that contain the term t
> minimumOccurance is the minimum number of documents the term appears in the 
> document corpus
> This would have an impact on accuracy as terms that appear in less than a 
> certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-09-16 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136671#comment-14136671
 ] 

RJ Nowling commented on SPARK-2429:
---

Great!  I look forward to seeing your implementation.  :)





-- 
em rnowl...@gmail.com
c 954.496.2314


> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Priority: Minor
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-09-15 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134301#comment-14134301
 ] 

RJ Nowling commented on SPARK-2308:
---

I'm not a committer but [~mengxr] is.  That said, I'm very happy to help in any 
way I can.

The issue of different distance metrics has come up on the mailing list -- a 
must requested feature.  If you provide it as a PR, maybe others who are more 
familiar with the work to add additional distance metrics can comment and we, 
as a community, can move forward to get it included.

> Add KMeans MiniBatch clustering algorithm to MLlib
> --
>
> Key: SPARK-2308
> URL: https://issues.apache.org/jira/browse/SPARK-2308
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>Priority: Minor
> Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-09-15 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134264#comment-14134264
 ] 

RJ Nowling commented on SPARK-2308:
---

It is true that we will save on the distance calculations for high dimensional 
data sets.  There is also work under way to improve sampling in Spark, so this 
will also benefit further from that.

Are you planning on creating a PR for your implementation?  It would be 
valuable for the community.  I closed mine due to the sampling issues.  But I'd 
be happy to review and test yours.

> Add KMeans MiniBatch clustering algorithm to MLlib
> --
>
> Key: SPARK-2308
> URL: https://issues.apache.org/jira/browse/SPARK-2308
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>Priority: Minor
> Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3250) More Efficient Sampling

2014-09-11 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14130880#comment-14130880
 ] 

RJ Nowling commented on SPARK-3250:
---

Great work!

If these performance improvements hold up when implemented in Spark, this could 
offer minibatch methods a fighting chance.  In particular, we would need the 
count the elements once and then we'd have faster sampling for the subsequent 
iterations, especially if the underlying data structures can we coerced into 
Arrays when we do the copy.

> More Efficient Sampling
> ---
>
> Key: SPARK-3250
> URL: https://issues.apache.org/jira/browse/SPARK-3250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: RJ Nowling
>
> Sampling, as currently implemented in Spark, is an O\(n\) operation.  A 
> number of stochastic algorithms achieve speed ups by exploiting O\(k\) 
> sampling, where k is the number of data points to sample.  Examples of such 
> algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient 
> Descent with mini batching.
> More efficient sampling may be achievable by packing partitions with an 
> ArrayBuffer or other data structure supporting random access.  Since many of 
> these stochastic algorithms perform repeated rounds of sampling, it may be 
> feasible to perform a transformation to change the backing data structure 
> followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib

2014-09-05 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123077#comment-14123077
 ] 

RJ Nowling commented on SPARK-2966:
---

Wonderful!

If I can help or when you're ready for reviews, let me know!

> Add an approximation algorithm for hierarchical clustering to MLlib
> ---
>
> Key: SPARK-2966
> URL: https://issues.apache.org/jira/browse/SPARK-2966
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yu Ishikawa
>Priority: Minor
>
> A hierarchical clustering algorithm is a useful unsupervised learning method.
> Koga. et al. proposed highly scalable hierarchical clustering altgorithm in 
> (1).
> I would like to implement this method.
> I suggest adding an approximate hierarchical clustering algorithm to MLlib.
> I'd like this to be assigned to me.
> h3. Reference
> # Fast agglomerative hierarchical clustering algorithm using 
> Locality-Sensitive Hashing
> http://dl.acm.org/citation.cfm?id=1266811



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework

2014-09-04 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122273#comment-14122273
 ] 

RJ Nowling commented on SPARK-2430:
---

Hi Yu,

The community had suggested looking into scikit-learn's API so that is a good 
idea.

I am hesitant to make backwards-incompatible API changes, however, until we 
know the new API will be stable for a long time.  I think it would be best to 
implement a few more clustering algorithms to get a clear idea of what is 
similar vs different before making a new API.  May I suggest you work on 
SPARK-2966 / SPARK-2429 first?

RJ

> Standarized Clustering Algorithm API and Framework
> --
>
> Key: SPARK-2430
> URL: https://issues.apache.org/jira/browse/SPARK-2430
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Priority: Minor
>
> Recently, there has been a chorus of voices on the mailing lists about adding 
> new clustering algorithms to MLlib.  To support these additions, we should 
> develop a common framework and API to reduce code duplication and keep the 
> APIs consistent.
> At the same time, we can also expand the current API to incorporate requested 
> features such as arbitrary distance metrics or pre-computed distance matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib

2014-09-04 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122266#comment-14122266
 ] 

RJ Nowling commented on SPARK-2966:
---

No worries.

Based on my reading of the Spark contribution guidelines ( 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark ), I 
think that the Spark community would prefer to have one good implementation of 
an algorithm instead of multiple similar algorithms.

Since the community has stated a clear preference for divisive hierarchical 
clustering, I think that is a better aim.  You seem very motivated and have 
made some good contributions -- would you like to take the lead on the 
hierarchical clustering?  I can review your code to help you improve it.

That said, I suggest you look at the comment I added to SPARK-2429 and see what 
you think of that approach.  If you like the example code and papers, why don't 
you work on implementing it efficiently in Spark?

> Add an approximation algorithm for hierarchical clustering to MLlib
> ---
>
> Key: SPARK-2966
> URL: https://issues.apache.org/jira/browse/SPARK-2966
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yu Ishikawa
>Priority: Minor
>
> A hierarchical clustering algorithm is a useful unsupervised learning method.
> Koga. et al. proposed highly scalable hierarchical clustering altgorithm in 
> (1).
> I would like to implement this method.
> I suggest adding an approximate hierarchical clustering algorithm to MLlib.
> I'd like this to be assigned to me.
> h3. Reference
> # Fast agglomerative hierarchical clustering algorithm using 
> Locality-Sensitive Hashing
> http://dl.acm.org/citation.cfm?id=1266811



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans

2014-09-03 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120294#comment-14120294
 ] 

RJ Nowling commented on SPARK-3384:
---

Xiangrui Meng

I'll try to get a code example together in the next couple days.

Even if Spark itself is thread safe, I would re-iterate that it is easy to make 
the mistake of using += in the wrong place.  I suggest that we should frown 
upon that behavior, document it when we use it, and maybe even add checks for 
the presence of += with Breeze vectors in the tests so we can flag it.

> Potential thread unsafe Breeze vector addition in KMeans
> 
>
> Key: SPARK-3384
> URL: https://issues.apache.org/jira/browse/SPARK-3384
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: RJ Nowling
>
> In the KMeans clustering implementation, the Breeze vectors are accumulated 
> using +=.  For example,
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162
>  This is potentially a thread unsafe operation.  (This is what I observed in 
> local testing.)  I suggest changing the += to + -- a new object will be 
> allocated but it will be thread safe since it won't write to an old location 
> accessed by multiple threads.
> Further testing is required to reproduce and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans

2014-09-03 Thread RJ Nowling (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RJ Nowling updated SPARK-3384:
--
Description: 
In the KMeans clustering implementation, the Breeze vectors are accumulated 
using +=.  For example,

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162

 This is potentially a thread unsafe operation.  (This is what I observed in 
local testing.)  I suggest changing the += to + -- a new object will be 
allocated but it will be thread safe since it won't write to an old location 
accessed by multiple threads.

Further testing is required to reproduce and verify.

  was:
In the KMeans clustering implementation, the Breeze vectors are accumulated 
using +=: 

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162

 This is potentially a thread unsafe operation.  (This is what I observed in 
local testing.)  I suggest changing the += to + -- a new object will be 
allocated but it will be thread safe since it won't write to an old location 
accessed by multiple threads.

Further testing is required to reproduce and verify.


> Potential thread unsafe Breeze vector addition in KMeans
> 
>
> Key: SPARK-3384
> URL: https://issues.apache.org/jira/browse/SPARK-3384
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: RJ Nowling
>
> In the KMeans clustering implementation, the Breeze vectors are accumulated 
> using +=.  For example,
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162
>  This is potentially a thread unsafe operation.  (This is what I observed in 
> local testing.)  I suggest changing the += to + -- a new object will be 
> allocated but it will be thread safe since it won't write to an old location 
> accessed by multiple threads.
> Further testing is required to reproduce and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans

2014-09-03 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-3384:
-

 Summary: Potential thread unsafe Breeze vector addition in KMeans
 Key: SPARK-3384
 URL: https://issues.apache.org/jira/browse/SPARK-3384
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: RJ Nowling


In the KMeans clustering implementation, the Breeze vectors are accumulated 
using +=: 

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162

 This is potentially a thread unsafe operation.  (This is what I observed in 
local testing.)  I suggest changing the += to + -- a new object will be 
allocated but it will be thread safe since it won't write to an old location 
accessed by multiple threads.

Further testing is required to reproduce and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3250) More Efficient Sampling

2014-08-29 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115213#comment-14115213
 ] 

RJ Nowling commented on SPARK-3250:
---

Very clever!  Once it's verified to sample correctly, it would make a nice 
incremental improvement to the current sampling in Spark.

Can you try different data sizes?  It shouldn't change the O(n) scaling but it 
would be good to verify.

> More Efficient Sampling
> ---
>
> Key: SPARK-3250
> URL: https://issues.apache.org/jira/browse/SPARK-3250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: RJ Nowling
>
> Sampling, as currently implemented in Spark, is an O\(n\) operation.  A 
> number of stochastic algorithms achieve speed ups by exploiting O\(k\) 
> sampling, where k is the number of data points to sample.  Examples of such 
> algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient 
> Descent with mini batching.
> More efficient sampling may be achievable by packing partitions with an 
> ArrayBuffer or other data structure supporting random access.  Since many of 
> these stochastic algorithms perform repeated rounds of sampling, it may be 
> feasible to perform a transformation to change the backing data structure 
> followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3263) PR #720 broke GraphGenerator.logNormal

2014-08-27 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-3263:
-

 Summary: PR #720 broke GraphGenerator.logNormal
 Key: SPARK-3263
 URL: https://issues.apache.org/jira/browse/SPARK-3263
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: RJ Nowling


PR #720 made multiple changes to GraphGenerator.logNormalGraph including:

* Replacing the call to functions for generating random vertices and edges with 
in-line implementations with different equations
* Hard-coding of RNG seeds so that method now generates the same graph for a 
given number of vertices, edges, mu, and sigma -- user is not able to override 
seed or specify that seed should be randomly generated.
* Backwards-incompatible change to logNormalGraph signature with introduction 
of new required parameter.
* Failed to update scala docs and programming guide for API changes

I also see that PR #720 added a Synthetic Benchmark in the examples.

Based on reading the Pregel paper, I believe the in-line functions are 
incorrect.  I proposed to:

* Removing the in-line calls
* Adding a seed for deterministic behavior (when desired)
* Keeping the number of partitions parameter.
* Updating the synthetic benchmark example



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib

2014-08-27 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14112223#comment-14112223
 ] 

RJ Nowling commented on SPARK-2966:
---

This is a duplicate of SPARK-2429.  Please see the comments on that JIRA and 
Spark dev list archives for community discussion on preferred approaches 
(divisive, not agglomerative clustering).



> Add an approximation algorithm for hierarchical clustering to MLlib
> ---
>
> Key: SPARK-2966
> URL: https://issues.apache.org/jira/browse/SPARK-2966
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yu Ishikawa
>Priority: Minor
>
> A hierarchical clustering algorithm is a useful unsupervised learning method.
> Koga. et al. proposed highly scalable hierarchical clustering altgorithm in 
> (1).
> I would like to implement this method.
> I suggest adding an approximate hierarchical clustering algorithm to MLlib.
> I'd like this to be assigned to me.
> h3. Reference
> # Fast agglomerative hierarchical clustering algorithm using 
> Locality-Sensitive Hashing
> http://dl.acm.org/citation.cfm?id=1266811



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-08-27 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1411#comment-1411
 ] 

RJ Nowling commented on SPARK-2429:
---

Discussion on the dev list mentioned a community preference for implementing 
KMeans recursively (a divisive approach).  Jeremy Freeman provided an example 
here:

https://gist.github.com/freeman-lab/5947e7c53b368fe90371

The example needs to be optimized but provided a good starting point.  For 
example, every time KMeans is called, the data is converted to Breeze Vectors.

Here are two papers on divisive Kmeans:

A combined K-means and hierarchical clustering method for improving the 
clustering efficiency of microarray (2005) by Chen, et al.

Divisive Hierarchical K-Means (2006) by Lamrous, et al.


> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Priority: Minor
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3250) More Efficient Sampling

2014-08-27 Thread RJ Nowling (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RJ Nowling updated SPARK-3250:
--

Description: 
Sampling, as currently implemented in Spark, is an O\(n\) operation.  A number 
of stochastic algorithms achieve speed ups by exploiting O\(k\) sampling, where 
k is the number of data points to sample.  Examples of such algorithms include 
KMeans MiniBatch (SPARK-2308) and Stochastic Gradient Descent with mini 
batching.

More efficient sampling may be achievable by packing partitions with an 
ArrayBuffer or other data structure supporting random access.  Since many of 
these stochastic algorithms perform repeated rounds of sampling, it may be 
feasible to perform a transformation to change the backing data structure 
followed by multiple rounds of sampling.

  was:
Sampling, as currently implemented in Spark, is an O(n) operation.  A number of 
stochastic algorithms achieve speed ups by exploiting O(k) sampling, where k is 
the number of data points to sample.  Examples of such algorithms include 
KMeans MiniBatch (SPARK-2308) and Stochastic Gradient Descent with mini 
batching.

More efficient sampling may be achievable by packing partitions with an 
ArrayBuffer or other data structure supporting random access.  Since many of 
these stochastic algorithms perform repeated rounds of sampling, it may be 
feasible to perform a transformation to change the backing data structure 
followed by multiple rounds of sampling.


> More Efficient Sampling
> ---
>
> Key: SPARK-3250
> URL: https://issues.apache.org/jira/browse/SPARK-3250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: RJ Nowling
>
> Sampling, as currently implemented in Spark, is an O\(n\) operation.  A 
> number of stochastic algorithms achieve speed ups by exploiting O\(k\) 
> sampling, where k is the number of data points to sample.  Examples of such 
> algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient 
> Descent with mini batching.
> More efficient sampling may be achievable by packing partitions with an 
> ArrayBuffer or other data structure supporting random access.  Since many of 
> these stochastic algorithms perform repeated rounds of sampling, it may be 
> feasible to perform a transformation to change the backing data structure 
> followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3250) More Efficient Sampling

2014-08-27 Thread RJ Nowling (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RJ Nowling updated SPARK-3250:
--

Description: 
Sampling, as currently implemented in Spark, is an O(n) operation.  A number of 
stochastic algorithms achieve speed ups by exploiting O(k) sampling, where k is 
the number of data points to sample.  Examples of such algorithms include 
KMeans MiniBatch (SPARK-2308) and Stochastic Gradient Descent with mini 
batching.

More efficient sampling may be achievable by packing partitions with an 
ArrayBuffer or other data structure supporting random access.  Since many of 
these stochastic algorithms perform repeated rounds of sampling, it may be 
feasible to perform a transformation to change the backing data structure 
followed by multiple rounds of sampling.

  was:
Sampling, as currently implemented in Spark, is an O(n) operation.  A number of 
stochastic algorithms achieve speed ups by exploiting O(k) sampling, where k is 
the number of data points to sample.  Examples of such algorithms include 
KMeans MiniBatch and Stochastic Gradient Descent with mini batching.

More efficient sampling may be achievable by packing partitions with an 
ArrayBuffer or other data structure supporting random access.  Since many of 
these stochastic algorithms perform repeated rounds of sampling, it may be 
feasible to perform a transformation to change the backing data structure 
followed by multiple rounds of sampling.


> More Efficient Sampling
> ---
>
> Key: SPARK-3250
> URL: https://issues.apache.org/jira/browse/SPARK-3250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: RJ Nowling
>
> Sampling, as currently implemented in Spark, is an O(n) operation.  A number 
> of stochastic algorithms achieve speed ups by exploiting O(k) sampling, where 
> k is the number of data points to sample.  Examples of such algorithms 
> include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient Descent with 
> mini batching.
> More efficient sampling may be achievable by packing partitions with an 
> ArrayBuffer or other data structure supporting random access.  Since many of 
> these stochastic algorithms perform repeated rounds of sampling, it may be 
> feasible to perform a transformation to change the backing data structure 
> followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3250) More Efficient Sampling

2014-08-27 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-3250:
-

 Summary: More Efficient Sampling
 Key: SPARK-3250
 URL: https://issues.apache.org/jira/browse/SPARK-3250
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: RJ Nowling


Sampling, as currently implemented in Spark, is an O(n) operation.  A number of 
stochastic algorithms achieve speed ups by exploiting O(k) sampling, where k is 
the number of data points to sample.  Examples of such algorithms include 
KMeans MiniBatch and Stochastic Gradient Descent with mini batching.

More efficient sampling may be achievable by packing partitions with an 
ArrayBuffer or other data structure supporting random access.  Since many of 
these stochastic algorithms perform repeated rounds of sampling, it may be 
feasible to perform a transformation to change the backing data structure 
followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-08-26 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111224#comment-14111224
 ] 

RJ Nowling commented on SPARK-2308:
---

Xiangrui,

I realized that sampling in Spark is O(n), where n is the number of elements in 
the data set. To get a performance advantage from MiniBatch KMeans, we need a 
sampling method that provides O(k) time, where k is the number of points to 
sample.

I don't see any obvious way to implement a more efficient sampling method.  If 
you concur, I'll create a separate JIRA to document the need for a more 
efficient sampling method and close this JIRA.

Thanks.

> Add KMeans MiniBatch clustering algorithm to MLlib
> --
>
> Key: SPARK-2308
> URL: https://issues.apache.org/jira/browse/SPARK-2308
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>Priority: Minor
> Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-07-30 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14079260#comment-14079260
 ] 

RJ Nowling commented on SPARK-2308:
---

Thanks for the clarification. :)  I'll run the additional tests to try to 
answer those questions.

I'll also work on trying to implement MiniBatch KMeans as a flag for the 
current KMeans implementation -- that would be a nicer API.

> Add KMeans MiniBatch clustering algorithm to MLlib
> --
>
> Key: SPARK-2308
> URL: https://issues.apache.org/jira/browse/SPARK-2308
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>Priority: Minor
> Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-07-29 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078078#comment-14078078
 ] 

RJ Nowling commented on SPARK-2308:
---

I did all of my tests with scikit-learn given your suggestion.  Scikit-learn 
uses k-means++, not k-means||.I should have made that clear.

I'm not clear on what you're looking for.

I have a few observations at this point:
1. KMeans seems to be very sensitive to initialization -- cluster positions 
doesn't seem to change significantly after initialization
2. Initialization seems to be more important than whether you use KMeans or 
KMeans MiniBatch -- given the same initialization, they tend to do equally well 
3. Random and kmeans++ / kmeans|| initialization methods seem sensitive to 
variations in cluster sizes.

However, I'm happy to run more tests if you think they will be useful, but at 
this point, I feel the behavior we're seeing is expected.  Hierarchical KMeans 
or methods such as KCenters, which guarantee that the space is partitioned 
equally (regardless of cluster density), may be useful for cases where KMeans 
doesn't perform as desired.


> Add KMeans MiniBatch clustering algorithm to MLlib
> --
>
> Key: SPARK-2308
> URL: https://issues.apache.org/jira/browse/SPARK-2308
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Priority: Minor
> Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-07-16 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063601#comment-14063601
 ] 

RJ Nowling commented on SPARK-2308:
---

I tested kmeans vs minibatch kmeans under 2 scenarios:

* 4 centers of 1000, 100, 10, and 1 data points.
* 100 centers with 10 points each

The proposed centers were generated along a grid.  The data points were 
generated by adding samples from N(0, 1.0) in each dimension to the centers. I 
found the expected centers by averaging the points generated from each proposed 
center.

I ran KMeans and MiniBatch KMeans for each set of data points with 30 
iterations and k-means++ initialization.

I plotted the expected centers (blue), KMeans centers (red), and MiniBatch 
centers (green).  The two method showed similar results.  They both struggled 
with the small clusters and ended up finding two centers for the large cluster, 
ignoring the single data point.  For the 100 even clusters, both methods got 
most of the centers reasonably correct and in a few cases, had 2 centers where 
there should be 1.

I've attached the plots (many_small_centers,pdf, uneven_centers.pdf).

In reviewing the scikit-learn implementation, I saw that they handled small 
clusters as special cases.  In the case of small clusters, one of the points in 
the cluster is randomly chosen as the center instead of finding the center as a 
running average of the points sampled.


> Add KMeans MiniBatch clustering algorithm to MLlib
> --
>
> Key: SPARK-2308
> URL: https://issues.apache.org/jira/browse/SPARK-2308
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Priority: Minor
> Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-07-16 Thread RJ Nowling (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RJ Nowling updated SPARK-2308:
--

Attachment: uneven_centers.pdf
many_small_centers.pdf

> Add KMeans MiniBatch clustering algorithm to MLlib
> --
>
> Key: SPARK-2308
> URL: https://issues.apache.org/jira/browse/SPARK-2308
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Priority: Minor
> Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-07-10 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14057513#comment-14057513
 ] 

RJ Nowling commented on SPARK-2308:
---

That sounds like a good idea for a test.  I'll report back.

> Add KMeans MiniBatch clustering algorithm to MLlib
> --
>
> Key: SPARK-2308
> URL: https://issues.apache.org/jira/browse/SPARK-2308
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Priority: Minor
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2430) Standarized Clustering Algorithm API and Framework

2014-07-10 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-2430:
-

 Summary: Standarized Clustering Algorithm API and Framework
 Key: SPARK-2430
 URL: https://issues.apache.org/jira/browse/SPARK-2430
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Priority: Minor


Recently, there has been a chorus of voices on the mailing lists about adding 
new clustering algorithms to MLlib.  To support these additions, we should 
develop a common framework and API to reduce code duplication and keep the APIs 
consistent.

At the same time, we can also expand the current API to incorporate requested 
features such as arbitrary distance metrics or pre-computed distance matrices.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2429) Hierarchical Implementation of KMeans

2014-07-10 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-2429:
-

 Summary: Hierarchical Implementation of KMeans
 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Priority: Minor


Hierarchical clustering algorithms are widely used and would make a nice 
addition to MLlib.  Clustering algorithms are useful for determining 
relationships between clusters as well as offering faster assignment. 
Discussion on the dev list suggested the following possible approaches:

* Top down, recursive application of KMeans
* Reuse DecisionTree implementation with different objective function
* Hierarchical SVD

It was also suggested that support for distance metrics other than Euclidean 
such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-07-04 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052558#comment-14052558
 ] 

RJ Nowling commented on SPARK-2308:
---

Hi Xiangrui,

Here's the paper:
http://www.ra.ethz.ch/CDstore/www2010/www/p1177.pdf

This discussion in the scikit-learn documentation could also be useful:
http://scikit-learn.org/stable/modules/clustering.html

I agree that smaller clusters will be at a disadvantage with uniform sampling.  
I imagine one could weight the points inversely by cluster size or the like.  
However, the challenge would be to do it in a way that doesn't require touching 
all of the data points.  The MiniBatch approach only samples batchSize number 
of data points in each iteration.  Those data points are used to update their 
respective centers.  You would have to reassign all the data points to the 
updated cluster centers in each iteration to prevent the weights from quickly 
becoming inaccurate.  This would defeat one of the main optimizations of the 
method.

Do you have any suggestions on how to achieve the weighting in a way that would 
maintain the properties necessary for convergence and keep the efficiency 
advantages?

Thanks!

> Add KMeans MiniBatch clustering algorithm to MLlib
> --
>
> Key: SPARK-2308
> URL: https://issues.apache.org/jira/browse/SPARK-2308
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Priority: Minor
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-06-27 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-2308:
-

 Summary: Add KMeans MiniBatch clustering algorithm to MLlib
 Key: SPARK-2308
 URL: https://issues.apache.org/jira/browse/SPARK-2308
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Priority: Minor


Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
data points in each iteration instead of the full set of data points, improving 
performance (and in some cases, accuracy).  The mini-batch version is 
compatible with the KMeans|| initialization algorithm currently implemented in 
MLlib.

I suggest adding KMeans Mini-batch as an alternative.

I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   >