[jira] [Comment Edited] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876202#comment-15876202 ] RJ Nowling edited comment on SPARK-14174 at 2/21/17 4:08 PM: - I did the initial implementation for SPARK-2308. re: the random sampling. With Spark's approach to random sampling, a Bernoulli trial is performed for each data point in the RDD. It's not as efficient as the case where random-access indexing is available. That said, if your vector dare quite long, then you save computational time on evaluating distances and such. Thus, when evaluating the performance, don't just look at the case of a large number of vectors -- look at the case of vectors with many elements. was (Author: rnowling): I did the initial implementation for SPARK-2308. re: the random sampling. With Spark's approach to random sampling, a Bernoulli trial is performed for each data point in the RDD. It's not as efficient as the case where random-access indexing is available. That said, if your vector are quite long, then you save computational time on evaluating distances and such. Thus, when evaluating the performance, don't just look at the case of a large number of vectors -- look at the case of vectors with many elements. > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > I have implemented mini-batch kmeans in Mllib, and the acceleration is realy > significant. > The MiniBatch KMeans is named XMeans in following lines. > {code} > val path = "/tmp/mnist8m.scale" > val data = MLUtils.loadLibSVMFile(sc, path) > val vecs = data.map(_.features).persist() > val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", seed=123l) > km.computeCost(vecs) > res0: Double = 3.317029898599564E8 > val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) > xm.computeCost(vecs) > res1: Double = 3.3169865959604424E8 > val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) > xm2.computeCost(vecs) > res2: Double = 3.317195831216454E8 > {code} > The above three training all reached the max number of iterations 10. > We can see that the WSSSEs are almost the same. While their speed perfermence > have significant difference: > {code} > KMeans2876sec > MiniBatch KMeans (fraction=0.1) 263sec > MiniBatch KMeans (fraction=0.01) 90sec > {code} > With appropriate fraction, the bigger the dataset is, the higher speedup is. > The data used above have 8,100,000 samples, 784 features. It can be > downloaded here > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) > Comparison of the K-Means and MiniBatchKMeans on sklearn : > http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876202#comment-15876202 ] RJ Nowling commented on SPARK-14174: I did the initial implementation for SPARK-2308. re: the random sampling. With Spark's approach to random sampling, a Bernoulli trial is performed for each data point in the RDD. It's not as efficient as the case where random-access indexing is available. That said, if your vector are quite long, then you save computational time on evaluating distances and such. Thus, when evaluating the performance, don't just look at the case of a large number of vectors -- look at the case of vectors with many elements. > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > I have implemented mini-batch kmeans in Mllib, and the acceleration is realy > significant. > The MiniBatch KMeans is named XMeans in following lines. > {code} > val path = "/tmp/mnist8m.scale" > val data = MLUtils.loadLibSVMFile(sc, path) > val vecs = data.map(_.features).persist() > val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", seed=123l) > km.computeCost(vecs) > res0: Double = 3.317029898599564E8 > val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) > xm.computeCost(vecs) > res1: Double = 3.3169865959604424E8 > val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) > xm2.computeCost(vecs) > res2: Double = 3.317195831216454E8 > {code} > The above three training all reached the max number of iterations 10. > We can see that the WSSSEs are almost the same. While their speed perfermence > have significant difference: > {code} > KMeans2876sec > MiniBatch KMeans (fraction=0.1) 263sec > MiniBatch KMeans (fraction=0.01) 90sec > {code} > With appropriate fraction, the bigger the dataset is, the higher speedup is. > The data used above have 8,100,000 samples, 784 features. It can be > downloaded here > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) > Comparison of the K-Means and MiniBatchKMeans on sklearn : > http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16365) Ideas for moving "mllib-local" forward
[ https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375928#comment-15375928 ] RJ Nowling edited comment on SPARK-16365 at 7/13/16 10:40 PM: -- I'm really looking forward to this feature. Spark is great where model training is expensive and involves large data sets but I want to be able to deploy those models as part of mobile or other applications without a dependency on Spark. It would be especially nice if there were implementations not only for the JVM but Python, Go, and other languages. [~MechCoder], applying models often requires less computation than training them. So use Spark to train then have a local, non distributed library for embedding models in other applications is how I interpret the feature. was (Author: rnowling): I'm really looking forward to this feature. Spark is great where model training is expensive and involves large data sets but I want to be able to deploy those models as per of mobile or other applications without a dependency on Spark. It would be especially nice if there were implementations only for the JVM but Python, Go, and other languages. [~MechCoder], applying models often requires less computation than training them. So use Spark to train then have a local, non distributed library for embedding in other applications is how I interpret the feature. > Ideas for moving "mllib-local" forward > -- > > Key: SPARK-16365 > URL: https://issues.apache.org/jira/browse/SPARK-16365 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Nick Pentreath > > Since SPARK-13944 is all done, we should all think about what the "next > steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's > linear algebra", or "investigate how we will implement local models/pipelines > in Spark", etc. > This ticket is for comments, ideas, brainstormings and PoCs. The separation > of linalg into a standalone project turned out to be significantly more > complex than originally expected. So I vote we devote sufficient discussion > and time to planning out the next move :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward
[ https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375928#comment-15375928 ] RJ Nowling commented on SPARK-16365: I'm really looking forward to this feature. Spark is great where model training is expensive and involves large data sets but I want to be able to deploy those models as per of mobile or other applications without a dependency on Spark. It would be especially nice if there were implementations only for the JVM but Python, Go, and other languages. [~MechCoder], applying models often requires less computation than training them. So use Spark to train then have a local, non distributed library for embedding in other applications is how I interpret the feature. > Ideas for moving "mllib-local" forward > -- > > Key: SPARK-16365 > URL: https://issues.apache.org/jira/browse/SPARK-16365 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Nick Pentreath > > Since SPARK-13944 is all done, we should all think about what the "next > steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's > linear algebra", or "investigate how we will implement local models/pipelines > in Spark", etc. > This ticket is for comments, ideas, brainstormings and PoCs. The separation > of linalg into a standalone project turned out to be significantly more > complex than originally expected. So I vote we devote sufficient discussion > and time to planning out the next move :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221806#comment-15221806 ] RJ Nowling commented on SPARK-14174: This is a dupe of [SPARK-2308] but that needs someone to take it over. > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Priority: Minor > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > I have implemented mini-batch kmeans in Mllib, and the acceleration is realy > significant. > The MiniBatch KMeans is named XMeans in following lines. > val path = "/tmp/mnist8m.scale" > val data = MLUtils.loadLibSVMFile(sc, path) > val vecs = data.map(_.features).persist() > val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", seed=123l) > km.computeCost(vecs) > res0: Double = 3.317029898599564E8 > val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) > xm.computeCost(vecs) > res1: Double = 3.3169865959604424E8 > val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) > xm2.computeCost(vecs) > res2: Double = 3.317195831216454E8 > The above three training all reached the max number of iterations 10. > We can see that the WSSSEs are almost the same. While their speed perfermence > have significant difference: > KMeans2876sec > MiniBatch KMeans (fraction=0.1) 263sec > MiniBatch KMeans (fraction=0.01) 90sec > With appropriate fraction, the bigger the dataset is, the higher speedup is. > The data used above have 8,100,000 samples, 784 features. It can be > downloaded here > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12450) Un-persist broadcasted variables in KMeans
[ https://issues.apache.org/jira/browse/SPARK-12450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066629#comment-15066629 ] RJ Nowling commented on SPARK-12450: File a PR here: [https://github.com/apache/spark/pull/10415] > Un-persist broadcasted variables in KMeans > -- > > Key: SPARK-12450 > URL: https://issues.apache.org/jira/browse/SPARK-12450 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.4.1, 1.5.2 >Reporter: RJ Nowling >Priority: Minor > > The broadcasted centers in KMeans are never un-persisted. As a result, > memory usage accumulates with usage causing a memory leak. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12450) Un-persist broadcasted variables in KMeans
RJ Nowling created SPARK-12450: -- Summary: Un-persist broadcasted variables in KMeans Key: SPARK-12450 URL: https://issues.apache.org/jira/browse/SPARK-12450 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.2, 1.4.1 Reporter: RJ Nowling Priority: Minor The broadcasted centers in KMeans are never un-persisted. As a result, memory usage accumulates with usage causing a memory leak. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056675#comment-15056675 ] RJ Nowling commented on SPARK-4816: --- Agreed. Thanks! > Maven profile netlib-lgpl does not work > --- > > Key: SPARK-4816 > URL: https://issues.apache.org/jira/browse/SPARK-4816 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 > Environment: maven 3.0.5 / Ubuntu >Reporter: Guillaume Pitel >Assignee: Sean Owen >Priority: Minor > Fix For: 1.1.1, 1.4.2 > > > When doing what the documentation recommends to recompile Spark with Netlib > Native system binding (i.e. to bind with openblas or, in my case, MKL), > mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests > clean package > The resulting assembly jar still lacked the netlib-system class. (I checked > the content of spark-assembly...jar) > When forcing the netlib-lgpl profile in MLLib package to be active, the jar > is correctly built. > So I guess it's a problem with the way maven passes profiles activitations to > children modules. > Also, despite the documentation claiming that if the job's jar contains > netlib with necessary bindings, it should works, it does not. The classloader > must be unhappy with two occurrences of netlib ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056630#comment-15056630 ] RJ Nowling commented on SPARK-4816: --- Tried with Maven 3.3.9. I see no issues with the newer version of Maven: {code} $ mvn -version Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T16:41:47+00:00) Maven home: /root/apache-maven-3.3.9 Java version: 1.7.0_85, vendor: Oracle Corporation Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85-2.6.1.2.el7_1.x86_64/jre Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "3.10.0-229.1.2.el7.x86_64", arch: "amd64", family: "unix" $ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.1-hadoop2.4.0.jar | grep netlib-native netlib-native_ref-osx-x86_64.jnilib netlib-native_ref-osx-x86_64.jnilib.asc netlib-native_ref-osx-x86_64.pom netlib-native_ref-osx-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.properties netlib-native_ref-linux-x86_64.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.properties netlib-native_ref-linux-i686.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.properties netlib-native_ref-win-x86_64.dll netlib-native_ref-win-x86_64.dll.asc netlib-native_ref-win-x86_64.pom netlib-native_ref-win-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.properties netlib-native_ref-win-i686.dll netlib-native_ref-win-i686.dll.asc netlib-native_ref-win-i686.pom netlib-native_ref-win-i686.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.properties netlib-native_ref-linux-armhf.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.properties netlib-native_system-osx-x86_64.jnilib netlib-native_system-osx-x86_64.jnilib.asc netlib-native_system-osx-x86_64.pom netlib-native_system-osx-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.properties netlib-native_system-linux-x86_64.pom.asc netlib-native_system-linux-x86_64.pom netlib-native_system-linux-x86_64.so netlib-native_system-linux-x86_64.so.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.properties netlib-native_system-linux-i686.pom netlib-native_system-linux-i686.so.asc netlib-native_system-linux-i686.pom.asc netlib-native_system-linux-i686.so META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.properties netlib-native_system-linux-armhf.pom netlib-native_system-linux-armhf.so.asc netlib-native_system-linux-armhf.pom.asc netlib-native_system-linux-armhf.so META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.properties netlib-native_system-win-x86_64.dll netlib-native_system-win-x86_64.dll.asc netlib-native_system-win-x86_64.pom netlib-native_system-win-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.properties netlib-native_system-win-i686.dll netlib-native_system-win-i686.dll.asc netlib-native_system-win-i686.pom netlib-native_system-win-i686.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-wi
[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056414#comment-15056414 ] RJ Nowling commented on SPARK-4816: --- Happy to try Maven 3.3.x and report back. Would certainly confirm if it's a Maven bug or regression in behavior. > Maven profile netlib-lgpl does not work > --- > > Key: SPARK-4816 > URL: https://issues.apache.org/jira/browse/SPARK-4816 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 > Environment: maven 3.0.5 / Ubuntu >Reporter: Guillaume Pitel >Priority: Minor > Fix For: 1.1.1 > > > When doing what the documentation recommends to recompile Spark with Netlib > Native system binding (i.e. to bind with openblas or, in my case, MKL), > mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests > clean package > The resulting assembly jar still lacked the netlib-system class. (I checked > the content of spark-assembly...jar) > When forcing the netlib-lgpl profile in MLLib package to be active, the jar > is correctly built. > So I guess it's a problem with the way maven passes profiles activitations to > children modules. > Also, despite the documentation claiming that if the job's jar contains > netlib with necessary bindings, it should works, it does not. The classloader > must be unhappy with two occurrences of netlib ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056375#comment-15056375 ] RJ Nowling commented on SPARK-4816: --- Also, what version of Maven are you running? > Maven profile netlib-lgpl does not work > --- > > Key: SPARK-4816 > URL: https://issues.apache.org/jira/browse/SPARK-4816 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 > Environment: maven 3.0.5 / Ubuntu >Reporter: Guillaume Pitel >Priority: Minor > Fix For: 1.1.1 > > > When doing what the documentation recommends to recompile Spark with Netlib > Native system binding (i.e. to bind with openblas or, in my case, MKL), > mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests > clean package > The resulting assembly jar still lacked the netlib-system class. (I checked > the content of spark-assembly...jar) > When forcing the netlib-lgpl profile in MLLib package to be active, the jar > is correctly built. > So I guess it's a problem with the way maven passes profiles activitations to > children modules. > Also, despite the documentation claiming that if the job's jar contains > netlib with necessary bindings, it should works, it does not. The classloader > must be unhappy with two occurrences of netlib ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056368#comment-15056368 ] RJ Nowling commented on SPARK-4816: --- I want to push for two things (a) some sort of documentation for users (e.g., release notes in the next releases) and (b) make sure it's fixed in the latest releases. I want users to be able to find documentation (like this JIRA) so they don't have to spend time tracking it down like I did. Spark 1.4.2 hasn't been released yet and git has moved to a 1.4.3 SNAPSHOT. You mention adding the commit to the 1.5.x branch in the commit -- has this been done? Until 1.4.3 and a 1.5.x release are out with your change, this could still hit certain users, even if it's rare because it's tied to a specific Maven version or such. > Maven profile netlib-lgpl does not work > --- > > Key: SPARK-4816 > URL: https://issues.apache.org/jira/browse/SPARK-4816 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 > Environment: maven 3.0.5 / Ubuntu >Reporter: Guillaume Pitel >Priority: Minor > Fix For: 1.1.1 > > > When doing what the documentation recommends to recompile Spark with Netlib > Native system binding (i.e. to bind with openblas or, in my case, MKL), > mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests > clean package > The resulting assembly jar still lacked the netlib-system class. (I checked > the content of spark-assembly...jar) > When forcing the netlib-lgpl profile in MLLib package to be active, the jar > is correctly built. > So I guess it's a problem with the way maven passes profiles activitations to > children modules. > Also, despite the documentation claiming that if the job's jar contains > netlib with necessary bindings, it should works, it does not. The classloader > must be unhappy with two occurrences of netlib ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056168#comment-15056168 ] RJ Nowling edited comment on SPARK-4816 at 12/14/15 4:16 PM: - I think [SPARK-9507] fixed the issue. I checked out git commit {{5ad9f950c4bd0042d79cdccb5277c10f8412be85}} (the commit before [https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f]) and found that the {{netlib-native}} libraries were missing: {code} $ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85 $ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package $ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep netlib-native (No output) {code} I then checked out {{b53ca247d4a965002a9f31758ea2b28fe117d45f}} and built it to test: {code} zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep netlib-native netlib-native_ref-osx-x86_64.jnilib netlib-native_ref-osx-x86_64.jnilib.asc netlib-native_ref-osx-x86_64.pom netlib-native_ref-osx-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.properties netlib-native_ref-linux-x86_64.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.properties netlib-native_ref-linux-i686.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.properties netlib-native_ref-win-x86_64.dll netlib-native_ref-win-x86_64.dll.asc netlib-native_ref-win-x86_64.pom netlib-native_ref-win-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.properties netlib-native_ref-win-i686.dll netlib-native_ref-win-i686.dll.asc netlib-native_ref-win-i686.pom netlib-native_ref-win-i686.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.properties netlib-native_ref-linux-armhf.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.properties netlib-native_system-osx-x86_64.jnilib netlib-native_system-osx-x86_64.jnilib.asc netlib-native_system-osx-x86_64.pom netlib-native_system-osx-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.properties netlib-native_system-linux-x86_64.pom.asc netlib-native_system-linux-x86_64.pom netlib-native_system-linux-x86_64.so netlib-native_system-linux-x86_64.so.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.properties netlib-native_system-linux-i686.pom netlib-native_system-linux-i686.so.asc netlib-native_system-linux-i686.pom.asc netlib-native_system-linux-i686.so META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.properties netlib-native_system-linux-armhf.pom netlib-native_system-linux-armhf.so.asc netlib-native_system-linux-armhf.pom.asc netlib-native_system-linux-armhf.so META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.properties netlib-native_system-win-x86_64.dll netlib-native_system-win-x86_64.dll.asc netlib-native_system-win-x86_64.pom netlib-native_system-win-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.properties netlib-native_system-win-i686.dll netlib-native_system-win-i
[jira] [Comment Edited] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056168#comment-15056168 ] RJ Nowling edited comment on SPARK-4816 at 12/14/15 4:19 PM: - I think [SPARK-9507] fixed the issue. I checked out git commit {{5ad9f950c4bd0042d79cdccb5277c10f8412be85}} (the commit before [https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f]) and found that the {{netlib-native}} libraries were missing: {code} $ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85 $ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package $ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep netlib-native (No output) {code} I then checked out {{b53ca247d4a965002a9f31758ea2b28fe117d45f}} and built it to test: {code} $ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep netlib-native netlib-native_ref-osx-x86_64.jnilib netlib-native_ref-osx-x86_64.jnilib.asc netlib-native_ref-osx-x86_64.pom netlib-native_ref-osx-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.properties netlib-native_ref-linux-x86_64.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.properties netlib-native_ref-linux-i686.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.properties netlib-native_ref-win-x86_64.dll netlib-native_ref-win-x86_64.dll.asc netlib-native_ref-win-x86_64.pom netlib-native_ref-win-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.properties netlib-native_ref-win-i686.dll netlib-native_ref-win-i686.dll.asc netlib-native_ref-win-i686.pom netlib-native_ref-win-i686.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.properties netlib-native_ref-linux-armhf.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.properties netlib-native_system-osx-x86_64.jnilib netlib-native_system-osx-x86_64.jnilib.asc netlib-native_system-osx-x86_64.pom netlib-native_system-osx-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.properties netlib-native_system-linux-x86_64.pom.asc netlib-native_system-linux-x86_64.pom netlib-native_system-linux-x86_64.so netlib-native_system-linux-x86_64.so.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.properties netlib-native_system-linux-i686.pom netlib-native_system-linux-i686.so.asc netlib-native_system-linux-i686.pom.asc netlib-native_system-linux-i686.so META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.properties netlib-native_system-linux-armhf.pom netlib-native_system-linux-armhf.so.asc netlib-native_system-linux-armhf.pom.asc netlib-native_system-linux-armhf.so META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.properties netlib-native_system-win-x86_64.dll netlib-native_system-win-x86_64.dll.asc netlib-native_system-win-x86_64.pom netlib-native_system-win-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.properties netlib-native_system-win-i686.dll netlib-native_system-win
[jira] [Comment Edited] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056168#comment-15056168 ] RJ Nowling edited comment on SPARK-4816 at 12/14/15 3:57 PM: - I think issue [SPARK-9507] fixed the issue. I checked out git commit 5ad9f950c4bd0042d79cdccb5277c10f8412be85 (the commit before [https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f]) and found that the {{netlib-native}} libraries were missing: {code} $ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85 $ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package $ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep netlib-native (No output) {code} As such, the changes in [SPARK-8819] might have been the original cause. was (Author: rnowling): I think issue [SPARK-9507] fixed the issue. I checked out git commit 5ad9f950c4bd0042d79cdccb5277c10f8412be85 (the commit before [https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f]) and found that the {{netlib-native}} libraries were missing: {code} $ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85 $ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package $ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep netlib-native (No output {code} As such, the changes in [SPARK-8819] might have been the original cause. > Maven profile netlib-lgpl does not work > --- > > Key: SPARK-4816 > URL: https://issues.apache.org/jira/browse/SPARK-4816 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 > Environment: maven 3.0.5 / Ubuntu >Reporter: Guillaume Pitel >Priority: Minor > Fix For: 1.1.1 > > > When doing what the documentation recommends to recompile Spark with Netlib > Native system binding (i.e. to bind with openblas or, in my case, MKL), > mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests > clean package > The resulting assembly jar still lacked the netlib-system class. (I checked > the content of spark-assembly...jar) > When forcing the netlib-lgpl profile in MLLib package to be active, the jar > is correctly built. > So I guess it's a problem with the way maven passes profiles activitations to > children modules. > Also, despite the documentation claiming that if the job's jar contains > netlib with necessary bindings, it should works, it does not. The classloader > must be unhappy with two occurrences of netlib ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056168#comment-15056168 ] RJ Nowling edited comment on SPARK-4816 at 12/14/15 3:58 PM: - I think [SPARK-9507] fixed the issue. I checked out git commit {{5ad9f950c4bd0042d79cdccb5277c10f8412be85}} (the commit before [https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f]) and found that the {{netlib-native}} libraries were missing: {code} $ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85 $ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package $ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep netlib-native (No output) {code} As such, the changes in [SPARK-8819] might have been the original cause. was (Author: rnowling): I think issue [SPARK-9507] fixed the issue. I checked out git commit 5ad9f950c4bd0042d79cdccb5277c10f8412be85 (the commit before [https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f]) and found that the {{netlib-native}} libraries were missing: {code} $ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85 $ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package $ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep netlib-native (No output) {code} As such, the changes in [SPARK-8819] might have been the original cause. > Maven profile netlib-lgpl does not work > --- > > Key: SPARK-4816 > URL: https://issues.apache.org/jira/browse/SPARK-4816 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 > Environment: maven 3.0.5 / Ubuntu >Reporter: Guillaume Pitel >Priority: Minor > Fix For: 1.1.1 > > > When doing what the documentation recommends to recompile Spark with Netlib > Native system binding (i.e. to bind with openblas or, in my case, MKL), > mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests > clean package > The resulting assembly jar still lacked the netlib-system class. (I checked > the content of spark-assembly...jar) > When forcing the netlib-lgpl profile in MLLib package to be active, the jar > is correctly built. > So I guess it's a problem with the way maven passes profiles activitations to > children modules. > Also, despite the documentation claiming that if the job's jar contains > netlib with necessary bindings, it should works, it does not. The classloader > must be unhappy with two occurrences of netlib ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056168#comment-15056168 ] RJ Nowling commented on SPARK-4816: --- I think issue [SPARK-9507] fixed the issue. I checked out git commit 5ad9f950c4bd0042d79cdccb5277c10f8412be85 (the commit before [https://github.com/apache/spark/commit/b53ca247d4a965002a9f31758ea2b28fe117d45f]) and found that the {{netlib-native}} libraries were missing: {code} $ git checkout 5ad9f950c4bd0042d79cdccb5277c10f8412be85 $ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package $ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.2-SNAPSHOT-hadoop2.4.0.jar | grep netlib-native (No output {code} As such, the changes in [SPARK-8819] might have been the original cause. > Maven profile netlib-lgpl does not work > --- > > Key: SPARK-4816 > URL: https://issues.apache.org/jira/browse/SPARK-4816 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 > Environment: maven 3.0.5 / Ubuntu >Reporter: Guillaume Pitel >Priority: Minor > Fix For: 1.1.1 > > > When doing what the documentation recommends to recompile Spark with Netlib > Native system binding (i.e. to bind with openblas or, in my case, MKL), > mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests > clean package > The resulting assembly jar still lacked the netlib-system class. (I checked > the content of spark-assembly...jar) > When forcing the netlib-lgpl profile in MLLib package to be active, the jar > is correctly built. > So I guess it's a problem with the way maven passes profiles activitations to > children modules. > Also, despite the documentation claiming that if the job's jar contains > netlib with necessary bindings, it should works, it does not. The classloader > must be unhappy with two occurrences of netlib ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056115#comment-15056115 ] RJ Nowling edited comment on SPARK-4816 at 12/14/15 3:42 PM: - I tested it again to make sure and ran into the same issue: {code} $ mvn -version Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-14T17:29:23+00:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_85, vendor: Oracle Corporation Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85-2.6.1.2.el7_1.x86_64/jre Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "3.10.0-229.1.2.el7.x86_64", arch: "amd64", family: "unix" $ export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" $ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1.tgz $ tar -xzvf spark-1.4.1.tgz $ cd spark-1.4.1 $ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package $ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.1-hadoop2.4.0.jar | grep netlib-native (No output) {code} If I build the head from git {{branch-1.4}} and run {{zipinfo}}: {code} $ git clone https://github.com/apache/spark.git spark-1.4-netlib $ cd spark-1.4-netlib $ git checkout origin/branch-1.4 $ git log | head commit c7c99857d47e4ca8373ee9ac59e108a9c443dd05 Author: Sean Owen Date: Tue Dec 8 14:34:47 2015 + [SPARK-11652][CORE] Remote code execution with InvokerTransformer Fix commons-collection group ID to commons-collections for version 3.x Patches earlier PR at https://github.com/apache/spark/pull/9731 $ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.3-SNAPSHOT-hadoop2.4.0.jar | grep netlib-native netlib-native_ref-osx-x86_64.jnilib netlib-native_ref-osx-x86_64.jnilib.asc netlib-native_ref-osx-x86_64.pom netlib-native_ref-osx-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.properties netlib-native_ref-linux-x86_64.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.properties netlib-native_ref-linux-i686.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.properties netlib-native_ref-win-x86_64.dll netlib-native_ref-win-x86_64.dll.asc netlib-native_ref-win-x86_64.pom netlib-native_ref-win-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.properties netlib-native_ref-win-i686.dll netlib-native_ref-win-i686.dll.asc netlib-native_ref-win-i686.pom netlib-native_ref-win-i686.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.properties netlib-native_ref-linux-armhf.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.properties netlib-native_system-osx-x86_64.jnilib netlib-native_system-osx-x86_64.jnilib.asc netlib-native_system-osx-x86_64.pom netlib-native_system-osx-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.properties netlib-native_system-linux-x86_64.pom.asc netlib-native_system-linux-x86_64.pom netlib-native_system-linux-x86_64.so netlib-native_system-linux-x86_64.so.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.properties netlib-native_system-linux-i686.pom netlib-native_system-linux-i686.so.asc netlib-native_system-linux-i686.pom.asc netlib-native_system-linux-i686.so META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.properties netlib-native_system-linux-armhf.pom netlib-native_system-linux-armhf.so.asc netlib-native_system-l
[jira] [Reopened] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] RJ Nowling reopened SPARK-4816: --- > Maven profile netlib-lgpl does not work > --- > > Key: SPARK-4816 > URL: https://issues.apache.org/jira/browse/SPARK-4816 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 > Environment: maven 3.0.5 / Ubuntu >Reporter: Guillaume Pitel >Priority: Minor > Fix For: 1.1.1 > > > When doing what the documentation recommends to recompile Spark with Netlib > Native system binding (i.e. to bind with openblas or, in my case, MKL), > mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests > clean package > The resulting assembly jar still lacked the netlib-system class. (I checked > the content of spark-assembly...jar) > When forcing the netlib-lgpl profile in MLLib package to be active, the jar > is correctly built. > So I guess it's a problem with the way maven passes profiles activitations to > children modules. > Also, despite the documentation claiming that if the job's jar contains > netlib with necessary bindings, it should works, it does not. The classloader > must be unhappy with two occurrences of netlib ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056115#comment-15056115 ] RJ Nowling commented on SPARK-4816: --- I tested it again to make sure and ran into the same issue: {code} $ mvn -version Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-14T17:29:23+00:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_85, vendor: Oracle Corporation Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85-2.6.1.2.el7_1.x86_64/jre Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "3.10.0-229.1.2.el7.x86_64", arch: "amd64", family: "unix" $ export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" $ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1.tgz $ tar -xzvf spark-1.4.1.tgz $ cd spark-1.4.1 $ mvn -Pnetlib-lgpl -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package $ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.1-hadoop2.4.0.jar | grep netlib-native (No output) {code} If I build the head from git {{branch-1.4}} and run {{zipinfo}}: {code} $ git log | head commit c7c99857d47e4ca8373ee9ac59e108a9c443dd05 Author: Sean Owen Date: Tue Dec 8 14:34:47 2015 + [SPARK-11652][CORE] Remote code execution with InvokerTransformer Fix commons-collection group ID to commons-collections for version 3.x Patches earlier PR at https://github.com/apache/spark/pull/9731 $ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.3-SNAPSHOT-hadoop2.4.0.jar | grep netlib-native netlib-native_ref-osx-x86_64.jnilib netlib-native_ref-osx-x86_64.jnilib.asc netlib-native_ref-osx-x86_64.pom netlib-native_ref-osx-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.properties netlib-native_ref-linux-x86_64.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.properties netlib-native_ref-linux-i686.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.properties netlib-native_ref-win-x86_64.dll netlib-native_ref-win-x86_64.dll.asc netlib-native_ref-win-x86_64.pom netlib-native_ref-win-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.properties netlib-native_ref-win-i686.dll netlib-native_ref-win-i686.dll.asc netlib-native_ref-win-i686.pom netlib-native_ref-win-i686.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.properties netlib-native_ref-linux-armhf.so META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/ META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.properties netlib-native_system-osx-x86_64.jnilib netlib-native_system-osx-x86_64.jnilib.asc netlib-native_system-osx-x86_64.pom netlib-native_system-osx-x86_64.pom.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.properties netlib-native_system-linux-x86_64.pom.asc netlib-native_system-linux-x86_64.pom netlib-native_system-linux-x86_64.so netlib-native_system-linux-x86_64.so.asc META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.properties netlib-native_system-linux-i686.pom netlib-native_system-linux-i686.so.asc netlib-native_system-linux-i686.pom.asc netlib-native_system-linux-i686.so META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/ META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.xml META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.properties netlib-native_system-linux-armhf.pom netlib-native_system-linux-armhf.so.asc netlib-native_system-linux-armhf.pom.asc netlib-native_system-linux-armhf.so META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/ META-INF/maven/com.github.fommil.netlib/ne
[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051056#comment-15051056 ] RJ Nowling commented on SPARK-4816: --- Hi [~srowen], I haven't tried master yet but that wouldn't address the problem I'm seeing. As I said, I downloaded the source tarball from the spark.apache.org web site vs checkout out branch-1.4. I think it has something to do with the release process (but saying this with ignorance of what is involved). I ran the same build command with both the source tarball (which reported excluding the native libs in the shading) and the branch-1.4 head from git (which reported including the native libs in the shading). The .m2 repo shouldn't be an issue. Normally, Spark pulls in the {{core}} artifact ID, which excludes the native libraries. When the {{netlib-lpgp} profile is enabled, the Spark MLLib pom.xml adds the {{all}} artifact ID which pulls in the native libs. ({{all}} is really just a pom.xml file that pulls in {{core}} + native libs). I get that this is weird. I also get that my lack of knowledge of the release process is basically zero. But I shouldn't have different results from git vs the released source tarball. Maybe it's not the release process -- maybe something has changed in the mean time. I'll search through the commits on the branch-1.4 for something to related to shading. > Maven profile netlib-lgpl does not work > --- > > Key: SPARK-4816 > URL: https://issues.apache.org/jira/browse/SPARK-4816 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 > Environment: maven 3.0.5 / Ubuntu >Reporter: Guillaume Pitel >Priority: Minor > Fix For: 1.1.1 > > > When doing what the documentation recommends to recompile Spark with Netlib > Native system binding (i.e. to bind with openblas or, in my case, MKL), > mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests > clean package > The resulting assembly jar still lacked the netlib-system class. (I checked > the content of spark-assembly...jar) > When forcing the netlib-lgpl profile in MLLib package to be active, the jar > is correctly built. > So I guess it's a problem with the way maven passes profiles activitations to > children modules. > Also, despite the documentation claiming that if the job's jar contains > netlib with necessary bindings, it should works, it does not. The classloader > must be unhappy with two occurrences of netlib ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051056#comment-15051056 ] RJ Nowling edited comment on SPARK-4816 at 12/10/15 2:52 PM: - Hi [~srowen], I haven't tried master yet but that wouldn't address the problem I'm seeing. As I said, I downloaded the source tarball from the spark.apache.org web site vs checkout out branch-1.4. I think it has something to do with the release process (but saying this with ignorance of what is involved). I ran the same build command with both the source tarball (which reported excluding the native libs in the shading) and the branch-1.4 head from git (which reported including the native libs in the shading). The .m2 repo shouldn't be an issue. Normally, Spark pulls in the {{core}} artifact ID, which excludes the native libraries. When the {{netlib-lpgp}} profile is enabled, the Spark MLLib pom.xml adds the {{all}} artifact ID which pulls in the native libs. ({{all}} is really just a pom.xml file that pulls in {{core}} + native libs). I get that this is weird. I also get that my lack of knowledge of the release process is basically zero. But I shouldn't have different results from git vs the released source tarball. Maybe it's not the release process -- maybe something has changed in the mean time. I'll search through the commits on the branch-1.4 for something to related to shading. was (Author: rnowling): Hi [~srowen], I haven't tried master yet but that wouldn't address the problem I'm seeing. As I said, I downloaded the source tarball from the spark.apache.org web site vs checkout out branch-1.4. I think it has something to do with the release process (but saying this with ignorance of what is involved). I ran the same build command with both the source tarball (which reported excluding the native libs in the shading) and the branch-1.4 head from git (which reported including the native libs in the shading). The .m2 repo shouldn't be an issue. Normally, Spark pulls in the {{core}} artifact ID, which excludes the native libraries. When the {{netlib-lpgp} profile is enabled, the Spark MLLib pom.xml adds the {{all}} artifact ID which pulls in the native libs. ({{all}} is really just a pom.xml file that pulls in {{core}} + native libs). I get that this is weird. I also get that my lack of knowledge of the release process is basically zero. But I shouldn't have different results from git vs the released source tarball. Maybe it's not the release process -- maybe something has changed in the mean time. I'll search through the commits on the branch-1.4 for something to related to shading. > Maven profile netlib-lgpl does not work > --- > > Key: SPARK-4816 > URL: https://issues.apache.org/jira/browse/SPARK-4816 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 > Environment: maven 3.0.5 / Ubuntu >Reporter: Guillaume Pitel >Priority: Minor > Fix For: 1.1.1 > > > When doing what the documentation recommends to recompile Spark with Netlib > Native system binding (i.e. to bind with openblas or, in my case, MKL), > mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests > clean package > The resulting assembly jar still lacked the netlib-system class. (I checked > the content of spark-assembly...jar) > When forcing the netlib-lgpl profile in MLLib package to be active, the jar > is correctly built. > So I guess it's a problem with the way maven passes profiles activitations to > children modules. > Also, despite the documentation claiming that if the job's jar contains > netlib with necessary bindings, it should works, it does not. The classloader > must be unhappy with two occurrences of netlib ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4816) Maven profile netlib-lgpl does not work
[ https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] RJ Nowling reopened SPARK-4816: --- I ran into the same issue with Spark 1.4. If I download the tarball from {{spark.apache.org}} and build with {{-Pnetlib-lgpl}}, the native libraries are excluded from the jar by the shader. However, if I check out the branch-1.4 from github and build with that, the appropriate libraries are included. I don't know much about the source release processes, but it is possible that something in that process is resulting in different maven builds? > Maven profile netlib-lgpl does not work > --- > > Key: SPARK-4816 > URL: https://issues.apache.org/jira/browse/SPARK-4816 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 > Environment: maven 3.0.5 / Ubuntu >Reporter: Guillaume Pitel >Priority: Minor > Fix For: 1.1.1 > > > When doing what the documentation recommends to recompile Spark with Netlib > Native system binding (i.e. to bind with openblas or, in my case, MKL), > mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests > clean package > The resulting assembly jar still lacked the netlib-system class. (I checked > the content of spark-assembly...jar) > When forcing the netlib-lgpl profile in MLLib package to be active, the jar > is correctly built. > So I guess it's a problem with the way maven passes profiles activitations to > children modules. > Also, despite the documentation claiming that if the job's jar contains > netlib with necessary bindings, it should works, it does not. The classloader > must be unhappy with two occurrences of netlib ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622342#comment-14622342 ] RJ Nowling commented on SPARK-3644: --- [~joshrosen] Thanks for pointing to the new JIRA! :) > REST API for Spark application info (jobs / stages / tasks / storage info) > -- > > Key: SPARK-3644 > URL: https://issues.apache.org/jira/browse/SPARK-3644 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Reporter: Josh Rosen >Assignee: Imran Rashid > Fix For: 1.4.0 > > > This JIRA is a forum to draft a design proposal for a REST interface for > accessing information about Spark applications, such as job / stage / task / > storage status. > There have been a number of proposals to serve JSON representations of the > information displayed in Spark's web UI. Given that we might redesign the > pages of the web UI (and possibly re-implement the UI as a client of a REST > API), the API endpoints and their responses should be independent of what we > choose to display on particular web UI pages / layouts. > Let's start a discussion of what a good REST API would look like from > first-principles. We can discuss what urls / endpoints expose access to > data, how our JSON responses will be formatted, how fields will be named, how > the API will be documented and tested, etc. > Some links for inspiration: > https://developer.github.com/v3/ > http://developer.netflix.com/docs/REST_API_Reference > https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619239#comment-14619239 ] RJ Nowling commented on SPARK-3644: --- [~joshrosen] Several users commented above about using a REST API for submitting and killing jobs to support integration with Sahara and web-based front-ends. Adding support for killing jobs shouldn't be too hard. Submitting jobs is properly harder to add at the moment since the Spark master doesn't exist until the application is launched. But I think we should acknowledge the needs of these users instead of just closing this JIRA. > REST API for Spark application info (jobs / stages / tasks / storage info) > -- > > Key: SPARK-3644 > URL: https://issues.apache.org/jira/browse/SPARK-3644 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Reporter: Josh Rosen >Assignee: Imran Rashid > Fix For: 1.4.0 > > > This JIRA is a forum to draft a design proposal for a REST interface for > accessing information about Spark applications, such as job / stage / task / > storage status. > There have been a number of proposals to serve JSON representations of the > information displayed in Spark's web UI. Given that we might redesign the > pages of the web UI (and possibly re-implement the UI as a client of a REST > API), the API endpoints and their responses should be independent of what we > choose to display on particular web UI pages / layouts. > Let's start a discussion of what a good REST API would look like from > first-principles. We can discuss what urls / endpoints expose access to > data, how our JSON responses will be formatted, how fields will be named, how > the API will be documented and tested, etc. > Some links for inspiration: > https://developer.github.com/v3/ > http://developer.netflix.com/docs/REST_API_Reference > https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619144#comment-14619144 ] RJ Nowling commented on SPARK-3644: --- [~joshrosen] The issue and corresponding PR you reference only seem to provide read-only access. Is that correct? If so, then are there open issues to address the needs of the users above? Thanks! > REST API for Spark application info (jobs / stages / tasks / storage info) > -- > > Key: SPARK-3644 > URL: https://issues.apache.org/jira/browse/SPARK-3644 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Reporter: Josh Rosen >Assignee: Imran Rashid > Fix For: 1.4.0 > > > This JIRA is a forum to draft a design proposal for a REST interface for > accessing information about Spark applications, such as job / stage / task / > storage status. > There have been a number of proposals to serve JSON representations of the > information displayed in Spark's web UI. Given that we might redesign the > pages of the web UI (and possibly re-implement the UI as a client of a REST > API), the API endpoints and their responses should be independent of what we > choose to display on particular web UI pages / layouts. > Let's start a discussion of what a good REST API would look like from > first-principles. We can discuss what urls / endpoints expose access to > data, how our JSON responses will be formatted, how fields will be named, how > the API will be documented and tested, etc. > Some links for inspiration: > https://developer.github.com/v3/ > http://developer.netflix.com/docs/REST_API_Reference > https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4729) Add time series subsampling to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14615016#comment-14615016 ] RJ Nowling commented on SPARK-4729: --- Hi [~yalamart], I haven't looked at this in quite a while but feel free to take it over. Since this would depend on a time index, it doesn't really make sense without an agreement on how time series should be recognized. See the comments of [SPARK-4727] for work on time series packages -- maybe you would want to look into contributing to those efforts first? > Add time series subsampling to MLlib > > > Key: SPARK-4729 > URL: https://issues.apache.org/jira/browse/SPARK-4729 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.0 >Reporter: RJ Nowling >Priority: Minor > > MLlib supports several time series functions. The ability to subsample a > time series (take every n data points) is missing. > I'd like to add it, so please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6522) Standardize Random Number Generation
RJ Nowling created SPARK-6522: - Summary: Standardize Random Number Generation Key: SPARK-6522 URL: https://issues.apache.org/jira/browse/SPARK-6522 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: RJ Nowling Priority: Minor Generation of random numbers in Spark has to be handled carefully since references to RNGs copy the state to the workers. As such, a separate RNG needs to be seeded for each partition. Each time random numbers are used in Spark's libraries, the RNG seeding is re-implemented, leaving open the possibility of mistakes. It would be useful if RNG seeding was standardized through utility functions or random number generation functions that can be called in Spark pipelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357410#comment-14357410 ] RJ Nowling commented on SPARK-2429: --- I'm familiar with the community interest but I'm not terribly familiar with the implementations (old or now). [~freeman-lab] may be the appropriate person to ask for help -- the original implementation was based on his gist. > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Labels: clustering > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357389#comment-14357389 ] RJ Nowling commented on SPARK-2429: --- [~josephkb] I think it would be great to get the new implementation into Spark but we need a champion for it. [~yuu.ishik...@gmail.com] did some great work, and I've been trying to shepard the work but we need a committer who wants to bring it in. If you want to do that, then I can step back and let you and [~yuu.ishik...@gmail.com] bring this across the finish line. > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Labels: clustering > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357367#comment-14357367 ] RJ Nowling commented on SPARK-2429: --- Hi [~yuu.ishik...@gmail.com] I think the new implementation is great. Did you change the algorithm? I've spoken with [~srowen]. The hierarchical clustering would be valuable to the community -- I actually had a couple people reach out to me about it. However, Spark is currently undergoing the transition to the new ML API and as such, there is concern about accepting code into the older MLlib library. With the announcement of Spark packages, there is also a move to encourage external libraries instead of large commits into Spark itself. Would you be interested in publishing your hierarchical clustering implementation as an external library like [~derrickburns] did for the [KMeans Mini Batch implementation|https://github.com/derrickburns/generalized-kmeans-clustering]? It could be listed in the [Spark packages index|http://spark-packages.org/] along with two other clustering packages so users can find it. > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Labels: clustering > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6167) Previous Commit Broke BroadcastTest
[ https://issues.apache.org/jira/browse/SPARK-6167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347773#comment-14347773 ] RJ Nowling commented on SPARK-6167: --- Great! Thanks! > Previous Commit Broke BroadcastTest > --- > > Key: SPARK-6167 > URL: https://issues.apache.org/jira/browse/SPARK-6167 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 1.2.1 >Reporter: RJ Nowling >Assignee: Sean Owen >Priority: Minor > Fix For: 1.1.2, 1.2.2 > > > Commit associated with SPARK-1010 spell class names incorrectly > (BroaddcastFactory instead of BroadcastFactory). As a result, the > BroadcastTest doesn't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6167) Previous Commit Broke BroadcastTest
[ https://issues.apache.org/jira/browse/SPARK-6167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347607#comment-14347607 ] RJ Nowling commented on SPARK-6167: --- This PR fixes the issue in master and the 1.3 branch: https://github.com/apache/spark/pull/4724 Needs to be merged into 1.2 branch as well. > Previous Commit Broke BroadcastTest > --- > > Key: SPARK-6167 > URL: https://issues.apache.org/jira/browse/SPARK-6167 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 1.2.1 >Reporter: RJ Nowling >Priority: Minor > > Commit associated with SPARK-1010 spell class names incorrectly > (BroaddcastFactory instead of BroadcastFactory). As a result, the > BroadcastTest doesn't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6167) Previous Commit Broke BroadcastTest
RJ Nowling created SPARK-6167: - Summary: Previous Commit Broke BroadcastTest Key: SPARK-6167 URL: https://issues.apache.org/jira/browse/SPARK-6167 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.2.1 Reporter: RJ Nowling Priority: Minor Commit associated with SPARK-1010 spell class names incorrectly (BroaddcastFactory instead of BroadcastFactory). As a result, the BroadcastTest doesn't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343829#comment-14343829 ] RJ Nowling commented on SPARK-2308: --- Ok, we should mark the status of the JIRA as won't fix. (cc [~mengxr] and [~srowen]) Thanks for the excellent implementation on your GitHub repo! Will be very beneficial to the community! > Add KMeans MiniBatch clustering algorithm to MLlib > -- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: RJ Nowling >Priority: Minor > Labels: clustering > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343806#comment-14343806 ] RJ Nowling commented on SPARK-2429: --- [~yuu.ishik...@gmail.com] are you still working on this? Thanks! > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Labels: clustering > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343804#comment-14343804 ] RJ Nowling commented on SPARK-2308: --- [~derrickburns] and [~mengxr] Is work still being done on this JIRA and Derrick's PR? Thanks! > Add KMeans MiniBatch clustering algorithm to MLlib > -- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: RJ Nowling >Priority: Minor > Labels: clustering > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework
[ https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342382#comment-14342382 ] RJ Nowling commented on SPARK-2430: --- I think we can close this JIRA. It's been superseded by the new Pipeline API as you mentioned. > Standarized Clustering Algorithm API and Framework > -- > > Key: SPARK-2430 > URL: https://issues.apache.org/jira/browse/SPARK-2430 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Priority: Minor > > Recently, there has been a chorus of voices on the mailing lists about adding > new clustering algorithms to MLlib. To support these additions, we should > develop a common framework and API to reduce code duplication and keep the > APIs consistent. > At the same time, we can also expand the current API to incorporate requested > features such as arbitrary distance metrics or pre-computed distance matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14285980#comment-14285980 ] RJ Nowling commented on SPARK-4894: --- [~mengxr] Since [~lmcguire] has submitted the patch, can we assign the JIRA to her so she gets credit for it? Thanks! > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5328) Update PySpark MLlib NaiveBayes API to take model type parameter for Bernoulli fit
[ https://issues.apache.org/jira/browse/SPARK-5328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283887#comment-14283887 ] RJ Nowling commented on SPARK-5328: --- The Python API for Naive Bayes is located in python/pyspark/mllib/classification.py . The Python implementation calls the Scala implementation for training through the interface in mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala . The classes in classification.py will need to be updated (with additional pydoc tests), a new method will need to be added to PythonMLLibAPI.scala, and the Python portion of docs/mllib-naive-bayes.md will need to be updated. > Update PySpark MLlib NaiveBayes API to take model type parameter for > Bernoulli fit > -- > > Key: SPARK-5328 > URL: https://issues.apache.org/jira/browse/SPARK-5328 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Leah McGuire >Priority: Minor > Labels: mllib > > [SPARK-4894] Adds Bernoulli-variant of Naive Bayes adds Bernoulli fitting to > NaiveBayes.scala need to update python API to accept model type parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features
[ https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279258#comment-14279258 ] RJ Nowling commented on SPARK-5272: --- Hi [~josephkb], I can see benefits to your suggestions of feature types (e.g., categorial, discrete counts, continuous, binary, etc.). If we created corresponding FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which would be easier to test, debug, and maintain versus multiple NB subclasses like sklearn. Additionally, if the user can define a type for each feature, then users can mix and match likelihood types as well. Most NB implementations treat all features the same -- what if we had a model that allowed heterozygous features? If it works well in NB, it could be extended to other parts of MLlib. (There is likely some overlap with decision trees since they support multiple feature types, so we might want to see if there is anything there we can reuse.) At the API level, we could provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity isn't compromised and provide a more advanced API for power users. Does this sound like I'm understanding you correctly? Re: Decision trees. Decision tree models generally support different types of features (categorical, binary, discrete, continuous). Does Spark's decision tree implementation support those different types? How are they handled? Do they abstract the feature type? I feel there could be common ground here. > Refactor NaiveBayes to support discrete and continuous labels,features > -- > > Key: SPARK-5272 > URL: https://issues.apache.org/jira/browse/SPARK-5272 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > This JIRA is to discuss refactoring NaiveBayes in order to support both > discrete and continuous labels and features. > Currently, NaiveBayes supports only discrete labels and features. > Proposal: Generalize it to support continuous values as well. > Some items to discuss are: > * How commonly are continuous labels/features used in practice? (Is this > necessary?) > * What should the API look like? > ** E.g., should NB have multiple classes for each type of label/feature, or > should it take a general Factor type parameter? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279250#comment-14279250 ] RJ Nowling commented on SPARK-4894: --- Thanks, [~josephkb]! I'd be happy to help with the NB refactoring too :) > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278228#comment-14278228 ] RJ Nowling edited comment on SPARK-4894 at 1/15/15 4:21 AM: [~josephkb], after some thought, I've come around and think your idea of 1 NB class with a Factor type parameter may be the more maintainable choice as well as offering some novel functionality. But, there seems to be a lot to figure out (we should be checking the decision tree implementation for example) and I don't want to hold up what should be a relatively simple change to support Bernoulli NB. Can we create a new JIRA to discuss the NB refactoring? Comments about refactoring: (1) how often is NB used with continuous values? I see that sklearn supports Gaussian NB but is this used in practice? My understanding is that NB is generally used for text classification with counts or binary values, possibly weighted by TF-IDF. We should probably email the users and dev lists to get user feedback. If no one is asking for it, we should shelve it and focus on other things. (2) after some more reflection, I can see a few more benefits to your suggestions of feature types (e.g., categorial, discrete counts, continuous, binary, etc.). If we created corresponding FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which would be easier to test, debug, and maintain versus multiple NB subclasses like sklearn. Additionally, if the user can define a type for each feature, then users can mix and match likelihood types as well. Most NB implementations treat all features the same -- what if we had a model that allowed heterozygous features? If it works well in NB, it could be extended to other parts of MLlib. (There is likely some overlap with decision trees since they support multiple feature types, so we might want to see if there is anything there we can reuse.) At the API level, we could provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity isn't compromised and provide a more advanced API for power users. was (Author: rnowling): [~josephkb], after some thought, I've come around and think your idea of 1 NB class with a Factor type parameter may be the more maintainable choice as well as offering some novel functionality. But, there seems to be a lot to figure out (we should be checking the decision tree implementation for example) and I don't want to hold up what should be a relatively simple change to support Bernoulli NB. What do you think? Comments about refactoring: (1) how often is NB used with continuous values? I see that sklearn supports Gaussian NB but is this used in practice? My understanding is that NB is generally used for text classification with counts or binary values, possibly weighted by TF-IDF. We should probably email the users and dev lists to get user feedback. If no one is asking for it, we should shelve it and focus on other things. (2) after some more reflection, I can see a few more benefits to your suggestions of feature types (e.g., categorial, discrete counts, continuous, binary, etc.). If we created corresponding FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which would be easier to test, debug, and maintain versus multiple NB subclasses like sklearn. Additionally, if the user can define a type for each feature, then users can mix and match likelihood types as well. Most NB implementations treat all features the same -- what if we had a model that allowed heterozygous features? If it works well in NB, it could be extended to other parts of MLlib. (There is likely some overlap with decision trees since they support multiple feature types, so we might want to see if there is anything there we can reuse.) At the API level, we could provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity isn't compromised and provide a more advanced API for power users. > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278228#comment-14278228 ] RJ Nowling commented on SPARK-4894: --- [~josephkb], after some thought, I've come around and think your idea of 1 NB class with a Factor type parameter may be the more maintainable choice as well as offering some novel functionality. But, there seems to be a lot to figure out (we should be checking the decision tree implementation for example) and I don't want to hold up what should be a relatively simple change to support Bernoulli NB. What do you think? Comments about refactoring: (1) how often is NB used with continuous values? I see that sklearn supports Gaussian NB but is this used in practice? My understanding is that NB is generally used for text classification with counts or binary values, possibly weighted by TF-IDF. We should probably email the users and dev lists to get user feedback. If no one is asking for it, we should shelve it and focus on other things. (2) after some more reflection, I can see a few more benefits to your suggestions of feature types (e.g., categorial, discrete counts, continuous, binary, etc.). If we created corresponding FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which would be easier to test, debug, and maintain versus multiple NB subclasses like sklearn. Additionally, if the user can define a type for each feature, then users can mix and match likelihood types as well. Most NB implementations treat all features the same -- what if we had a model that allowed heterozygous features? If it works well in NB, it could be extended to other parts of MLlib. (There is likely some overlap with decision trees since they support multiple feature types, so we might want to see if there is anything there we can reuse.) At the API level, we could provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity isn't compromised and provide a more advanced API for power users. > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1420#comment-1420 ] RJ Nowling commented on SPARK-4894: --- Hi [~josephkb], lots to think about! In general, I'm a big fan of multiple small changes over time rather than one big change. They're easier to verify and review. Since MLLib is going through an interface refactoring to become ML anyway, we can focus on the Bernoulli NB change now and worry about a redesign of the API later. What do you have in mind for other feature and label types? I briefly reviewed Factorie -- their concept of Factors may be over complicated for Naive Bayes but I want to learn more about your ideas. Do you have a few concrete examples of how Factors could be used with NB? And for continuous labels, are you thinking of something like the Gaussian NB in sklearn? >From bioinformatics, I know that folks tend to encode categorical variables >incorrectly. E.g., for a DNA sequence consisting of A, T, C, G, and possibly >gaps, each position in a sequence should be encoded as four (five) features, >one for each nucleotide. When folks try to represent each position as one >feature with the bases as numbers (A=1, T=2, etc.), this results in incorrect >distance metrics. E.g., ATT will differ from TTT by 1 but ATT will differ from >CTT by 2. By using one feature for each of the four (five) possibilities, you >get correct distances and can even weight mutations and deletions using BLOSUM >matrices and such. For this type of case, I think the solution there is >education and documentation, not complicated type systems. > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277631#comment-14277631 ] RJ Nowling edited comment on SPARK-4894 at 1/14/15 8:50 PM: Thanks [~lmcguire]! I'll wait until next week in case you have time to put a patch together. In the mean time, here were my thoughts for changes: 1. Add an optional `model` variable to the `NaiveBayes` object and class and `NaiveBayesModel`. It would be a string with a default value of `Multinomial`. For Bernoulli, we can use `Bernoulli`. 2. In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta * testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for the 0-valued features. (Breeze may not allow adding/subtracting scalars and vectors/matrices.) In the current model, no term is added for rows of `testData` that have 0 entries. In the Bernoulli model, we would be adding a separate term for 0-valued features. Here is the sklearn source for comparison: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py (Look at `_joint_log_likelihood` in the `MultinomialNB` and `BernoulliNB` classes.) Note that sklearn adds the neg prob to all features and subtracts it from features with 1-values. [~mengxr], [~lmcguire], [~josephkb] Any thoughts or comments on any of the above? was (Author: rnowling): Thanks [~lmcguire]! I'll wait until next week in case you have time to put a patch together. In the mean time, here were my thoughts for changes: 1. Add an optional `model` variable to the `NaiveBayes` object and class and `NaiveBayesModel`. It would be a string with a default value of `Multinomial`. For Bernoulli, we can use `Bernoulli`. 2. In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta * testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for the 0-valued features. (Breeze may not allow adding/subtracting scalars and vectors/matrices.) In the current model, no term is added for rows of `testData` that have 0 entries. In the Bernoulli model, we would be adding a separate term for 0-valued features. Here is the sklearn source for comparison: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py Note that sklearn adds the neg prob to all features and subtracts it from features with 1-values. [~mengxr], [~josephkb] Any thoughts or comments? > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277631#comment-14277631 ] RJ Nowling commented on SPARK-4894: --- Thanks [~lmcguire]! I'll wait until next week in case you have time to put a patch together. In the mean time, here were my thoughts for changes: 1. Add an optional `model` variable to the `NaiveBayes` object and class and `NaiveBayesModel`. It would be a string with a default value of `Multinomial`. For Bernoulli, we can use `Bernoulli`. 2. In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta * testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for the 0-valued features. (Breeze may not allow adding/subtracting scalars and vectors/matrices.) In the current model, no term is added for rows of `testData` that have 0 entries. In the Bernoulli model, we would be adding a separate term for 0-valued features. Here is the sklearn source for comparison: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py Note that sklearn adds the neg prob to all features and subtracts it from features with 1-values. [~mengxr], [~josephkb] Any thoughts or comments? > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276380#comment-14276380 ] RJ Nowling edited comment on SPARK-4894 at 1/14/15 2:06 AM: Hi [~lmcguire] Always happy to have more help! :) I started looking through the Spark NB functions but I haven't started writing code yet. The docs for NB mention that using binary features will cause the multinomial NB to act like Bernoulli NB. I don't believe the documentation is correct, at least when smoothing is used since P(0) != 1 - P(1).I was planning on comparing the sklearn implementation with the Spark implementation and showing that the docs were wrong. Once verified, I think the changes will be very small to add a Bernoulli mode controlled by a flag in the constructor. I won't get to this until next week, though. If you have time now and want to tackle this, I'd be happy to hand it over to you and review any patches. (I'm not a committer, though -- [~mengxr] would have to sign off.)Otherwise, if you want to wait until I have a patch and test it, that could work, too. What do you think? was (Author: rnowling): Hi @lmcguire, Always happy to have more help! :) I started looking through the Spark NB functions but I haven't started writing code yet. The docs for NB mention that using binary features will cause the multinomial NB to act like Bernoulli NB. I don't believe the documentation is correct, at least when smoothing is used since P(0) != 1 - P(1).I was planning on comparing the sklearn implementation with the Spark implementation and showing that the docs were wrong. Once verified, I think the changes will be very small to add a Bernoulli mode controlled by a flag in the constructor. I won't get to this until next week, though. If you have time now and want to tackle this, I'd be happy to hand it over to you and review any patches. (I'm not a committer, though -- [~mengxr] would have to sign off.)Otherwise, if you want to wait until I have a patch and test it, that could work, too. What do you think? > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.1 >Reporter: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276380#comment-14276380 ] RJ Nowling commented on SPARK-4894: --- Hi @lmcguire, Always happy to have more help! :) I started looking through the Spark NB functions but I haven't started writing code yet. The docs for NB mention that using binary features will cause the multinomial NB to act like Bernoulli NB. I don't believe the documentation is correct, at least when smoothing is used since P(0) != 1 - P(1).I was planning on comparing the sklearn implementation with the Spark implementation and showing that the docs were wrong. Once verified, I think the changes will be very small to add a Bernoulli mode controlled by a flag in the constructor. I won't get to this until next week, though. If you have time now and want to tackle this, I'd be happy to hand it over to you and review any patches. (I'm not a committer, though -- [~mengxr] would have to sign off.)Otherwise, if you want to wait until I have a patch and test it, that could work, too. What do you think? > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.1 >Reporter: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252860#comment-14252860 ] RJ Nowling commented on SPARK-4894: --- [~mengxr] Could you assign this to me? Thanks! > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.1 >Reporter: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
RJ Nowling created SPARK-4894: - Summary: Add Bernoulli-variant of Naive Bayes Key: SPARK-4894 URL: https://issues.apache.org/jira/browse/SPARK-4894 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.1 Reporter: RJ Nowling MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli version of Naive Bayes is more useful for situations where the features are binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4891) Add exponential, log normal, and gamma distributions to data generator to PySpark's MLlib
[ https://issues.apache.org/jira/browse/SPARK-4891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252534#comment-14252534 ] RJ Nowling commented on SPARK-4891: --- [~mengxr] Could you assign this to me? Thanks! :) > Add exponential, log normal, and gamma distributions to data generator to > PySpark's MLlib > - > > Key: SPARK-4891 > URL: https://issues.apache.org/jira/browse/SPARK-4891 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.1.1 >Reporter: RJ Nowling >Priority: Minor > > [SPARK-4728] adds sampling from exponential, gamma, and log normal > distributions to the Scala/Java MLlib APIs. We need to add these functions > to the PySpark MLlib API for parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4891) Add exponential, log normal, and gamma distributions to data generator to PySpark's MLlib
RJ Nowling created SPARK-4891: - Summary: Add exponential, log normal, and gamma distributions to data generator to PySpark's MLlib Key: SPARK-4891 URL: https://issues.apache.org/jira/browse/SPARK-4891 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 1.1.1 Reporter: RJ Nowling Priority: Minor [SPARK-4728] adds sampling from exponential, gamma, and log normal distributions to the Scala/Java MLlib APIs. We need to add these functions to the PySpark MLlib API for parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252224#comment-14252224 ] RJ Nowling commented on SPARK-4728: --- [~mengxr] can you assign this JIRA to me since I've created a PR? Thanks! > Add exponential, log normal, and gamma distributions to data generator to > MLlib > --- > > Key: SPARK-4728 > URL: https://issues.apache.org/jira/browse/SPARK-4728 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.0 >Reporter: RJ Nowling >Priority: Minor > > MLlib supports sampling from normal, uniform, and Poisson distributions. > I'd like to add support for sampling from exponential, gamma, and log normal > distributions, using the features of math3 like the other generators. > Please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] RJ Nowling updated SPARK-4728: -- Comment: was deleted (was: I posted a PR for this issue: https://github.com/apache/spark/pull/3680) > Add exponential, log normal, and gamma distributions to data generator to > MLlib > --- > > Key: SPARK-4728 > URL: https://issues.apache.org/jira/browse/SPARK-4728 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.0 >Reporter: RJ Nowling >Priority: Minor > > MLlib supports sampling from normal, uniform, and Poisson distributions. > I'd like to add support for sampling from exponential, gamma, and log normal > distributions, using the features of math3 like the other generators. > Please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242924#comment-14242924 ] RJ Nowling commented on SPARK-4728: --- I posted a PR for this issue: https://github.com/apache/spark/pull/3680 > Add exponential, log normal, and gamma distributions to data generator to > MLlib > --- > > Key: SPARK-4728 > URL: https://issues.apache.org/jira/browse/SPARK-4728 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.0 >Reporter: RJ Nowling >Priority: Minor > > MLlib supports sampling from normal, uniform, and Poisson distributions. > I'd like to add support for sampling from exponential, gamma, and log normal > distributions, using the features of math3 like the other generators. > Please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4727) Add "dimensional" RDDs (time series, spatial)
[ https://issues.apache.org/jira/browse/SPARK-4727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234399#comment-14234399 ] RJ Nowling commented on SPARK-4727: --- Thanks, Jeremy! Your work may cover my needs, and if not, it seems like a great place to contribute to! Was there some talk about encouraging people to build Spark libraries and putting together a community list? I'd love to see this sort of work advertised more. > Add "dimensional" RDDs (time series, spatial) > - > > Key: SPARK-4727 > URL: https://issues.apache.org/jira/browse/SPARK-4727 > Project: Spark > Issue Type: Brainstorming > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: RJ Nowling > > Certain types of data (times series, spatial) can benefit from specialized > RDDs. I'd like to open a discussion about this. > For example, time series data should be ordered by time and would benefit > from operations like: > * Subsampling (taking every n data points) > * Signal processing (correlations, FFTs, filtering) > * Windowing functions > Spatial data benefits from ordering and partitioning along a 2D or 3D grid. > For example, path finding algorithms can optimized by only comparing points > within a set distance, which can be computed more efficiently by partitioning > data into a grid. > Although the operations on time series and spatial data may be different, > there is some commonality in the sense of the data having ordered dimensions > and the implementations may overlap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4729) Add time series subsampling to MLlib
RJ Nowling created SPARK-4729: - Summary: Add time series subsampling to MLlib Key: SPARK-4729 URL: https://issues.apache.org/jira/browse/SPARK-4729 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: RJ Nowling Priority: Minor MLlib supports several time series functions. The ability to subsample a time series (take every n data points) is missing. I'd like to add it, so please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib
RJ Nowling created SPARK-4728: - Summary: Add exponential, log normal, and gamma distributions to data generator to MLlib Key: SPARK-4728 URL: https://issues.apache.org/jira/browse/SPARK-4728 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: RJ Nowling Priority: Minor MLlib supports sampling from normal, uniform, and Poisson distributions. I'd like to add support for sampling from exponential, gamma, and log normal distributions, using the features of math3 like the other generators. Please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4727) Add "dimensional" RDDs (time series, spatial)
RJ Nowling created SPARK-4727: - Summary: Add "dimensional" RDDs (time series, spatial) Key: SPARK-4727 URL: https://issues.apache.org/jira/browse/SPARK-4727 Project: Spark Issue Type: Brainstorming Components: Spark Core Affects Versions: 1.1.0 Reporter: RJ Nowling Certain types of data (times series, spatial) can benefit from specialized RDDs. I'd like to open a discussion about this. For example, time series data should be ordered by time and would benefit from operations like: * Subsampling (taking every n data points) * Signal processing (correlations, FFTs, filtering) * Windowing functions Spatial data benefits from ordering and partitioning along a 2D or 3D grid. For example, path finding algorithms can optimized by only comparing points within a set distance, which can be computed more efficiently by partitioning data into a grid. Although the operations on time series and spatial data may be different, there is some commonality in the sense of the data having ordered dimensions and the implementations may overlap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213945#comment-14213945 ] RJ Nowling commented on SPARK-2429: --- Hi Yu, I'm having trouble finding the function to cut a dendrogram -- I see the tests but not the implementation. I feel that you should be able to assign values in O(log N) time with the hierarchical method vs O(N) with the standard kmeans. So, say you train a model (this may be slower than kmeans) then assign additional points to clusters after training. If clusters at the same levels in the hierarchy do not overlap, you should be able to choose the closest cluster at each level until you find a leaf. I'm assuming that the children of a given cluster are contained within that cluster (spacially) -- can you show this or find a reference for this? If so, then assignment should be faster for a larger number of clusters as Jun was saying above. Do you agree with this? Or is there something I am misunderstanding! Thanks! > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4158) Spark throws exception when Mesos resources are missing
[ https://issues.apache.org/jira/browse/SPARK-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192448#comment-14192448 ] RJ Nowling commented on SPARK-4158: --- I verified that the associated patch fixes this issue on our local cluster running Spark 1.1.0 and Mesos 0.21. > Spark throws exception when Mesos resources are missing > --- > > Key: SPARK-4158 > URL: https://issues.apache.org/jira/browse/SPARK-4158 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.1.0 >Reporter: Brenden Matthews > > Spark throws an exception when trying to check resources which haven't been > offered by Mesos. This is an error in Spark, and should be corrected as > such. Here's a sample: > {code} > val data Exception in thread "Thread-41" java.lang.IllegalArgumentException: > No resource called cpus in [name: "mem" > type: SCALAR > scalar { > value: 2067.0 > } > role: "*" > , name: "disk" > type: SCALAR > scalar { > value: 900.0 > } > role: "*" > , name: "ports" > type: RANGES > ranges { > range { > begin: 31000 > end: 32000 > } > } > role: "*" > ] > at > org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend.org$apache$spark$scheduler$cluster$mesos$CoarseMesosSchedulerBackend$$getResource(CoarseMesosSchedulerBackend.scala:236) > at > org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend$$anonfun$resourceOffers$1.apply(CoarseMesosSchedulerBackend.scala:200) > at > org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend$$anonfun$resourceOffers$1.apply(CoarseMesosSchedulerBackend.scala:197) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend.resourceOffers(CoarseMesosSchedulerBackend.scala:197) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188325#comment-14188325 ] RJ Nowling commented on SPARK-2429: --- The sparsity tests look good. Have you compared training and assignment time to KMeans yet? An improvement in the assignment time will be important. Also, I don't see a breakdown of the total time by splitting clusters, assignments, etc. Doesn't need to be for every combination of parameters just one or two. That would be very helpful. Thanks! > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181183#comment-14181183 ] RJ Nowling commented on SPARK-2429: --- I added a couple comments to the PR. I would say stick with Euclidean distance for now. For assignment, you should be able to do a binary search. E.g., if a center has children, which of the two children is it closer to? Choose that center. Repeat until you hit a leaf (cluster with no children). I saw that you added logging for timing but can you update your report with the timing breakdown for each stage? Thanks! > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179890#comment-14179890 ] RJ Nowling commented on SPARK-2429: --- A 6x performance improvement is great improvement! Can you add a breakdown of the timings for each part of the algorithm? (e.g, like you did to find out which parts were slowest?) You don't need to do a sweep over multiple data sizes or number of data points -- just pick a representative number of data point and rows. Have you compared the performance of the hierarchical KMeans vs KMeans implemented in MLLib? I expect that the hierarchical will be slower to cluster but the assignment should be faster (O(log k) vs O(k)). This improvement in assignment speed is the motivation for including the hierarchical KMeans in Spark. Thanks! > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4040) calling count() on RDD's emitted from a DStream blocks forEachRDD progress.
[ https://issues.apache.org/jira/browse/SPARK-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179882#comment-14179882 ] RJ Nowling commented on SPARK-4040: --- I don't think you can access a RDD from with an operation performed on a RDD. Your code example may even be trying to serialize the RDD along with the operation, which may not be possible. You would want to call {{count()}} outside the operation and pass it in through the closure or a broadcast. > calling count() on RDD's emitted from a DStream blocks forEachRDD progress. > --- > > Key: SPARK-4040 > URL: https://issues.apache.org/jira/browse/SPARK-4040 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: jay vyas > > Please note that Im somewhat new to spark streaming's API, and am not a spark > expert - so I've done the best to write up and reproduce this "bug". If its > not a bug i hope an expert will help to explain why and promptly close it. > However, it appears it could be a bug after discussing with [~rnowling] who > is a spark contributor. > CC [~rnowling] [~willbenton] > > It appears that in a DStream context, a call to {{MappedRDD.count()}} > blocks progress and prevents emission of RDDs from a stream. > {noformat} > tweetStream.foreachRDD((rdd,lent)=> { > tweetStream.repartition(1) > //val count = rdd.count() DONT DO THIS ! > checks += 1; > if (checks > 20) { > ssc.stop() > } >} > {noformat} > The above code block should inevitably halt, after 20 intervals of RDDs... > However, if we *uncomment the call* to {{rdd.count()}}, it turns out that we > get an *infinite stream which emits no RDDs*, and thus our program *runs > forever* (ssc.stop is unreachable), because *forEach doesnt receive any more > entries*. > I suspect this is actually because the foreach block never completes, because > {{count()}} is winds up calling {{compute}}, which ultimately just reads from > the stream. > I havent put together a minimal reproducer or unit test yet, but I can work > on doing so if more info is needed. > I guess this could be seen as an application bug - but i think spark might be > made smarter to throw its hands up when people execute blocking code in a > stream processor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172486#comment-14172486 ] RJ Nowling commented on SPARK-2429: --- Great to know! I'm glad that isn't a bottleneck. Have you been able to benchmark each of the major steps? Which steps are most expensive? On Wed, Oct 15, 2014 at 10:24 AM, Yu Ishikawa (JIRA) -- em rnowl...@gmail.com c 954.496.2314 > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165480#comment-14165480 ] RJ Nowling commented on SPARK-2429: --- Great work, Yu! Ok, first off, let me make sure I understand what you're doing. You start with 2 centers. You assign all the points. You then apply KMeans recursively to each cluster, splitting each center into 2 centers. Each instance of KMeans stops when the error is below a certain value or a fixed number of iterations have been run. I think your analysis of the overall run time is good and probably what we expect. Can you break down the timing to see which parts are the most expensive? Maybe we can figure out where to optimize it. A few thoughts on optimization: 1. It might be good to convert everything to Breeze vectors before you do any operations -- you need to convert the same vectors over and over again. KMeans converts them at the beginning and converts the vectors for the centers back at the end. 2. Instead of passing the centers as part of the EuclideanClosestCenterFinder, look into using a broadcast variable. See the latest KMeans implementation. This could improve performance by 10%+. 3. You may want to look into using reduceByKey or similar RDD operations -- they will enable parallel reductions which will be faster than a loop on the master. If you look at the JIRAs and PRs, there is some recent work to speed up KMeans -- maybe some of that is applicable? I'll probably have more questions -- it's a good way of helping me understand what you're doing :) > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163712#comment-14163712 ] RJ Nowling commented on SPARK-3785: --- Part of my graduate work involved implementing physics simulations on GPUs and managing multi-user GPU clusters. >From a performance perspective, we saw 100x+ speed ups on a single machine >with a GPU vs multiple cores using specialized GPU implementations such as >OpenMM or Gromacs. But this was using hand-optimized GPU implementations that >were pipelined to prevent unnecessary host/GPU copies and do as much work as >possible on the GPU. For clusters, we'd get the 2-5x speed up due to communication overhead between host/GPU and other nodes. In these cases, you could only run a few iterations on the GPU before you had to communicate with other nodes. Thus, GPUs are great if you're doing computation that will run using hand-optimized GPU implementations for long periods of time before communicating outside the GPU. But I think you won't get much of a performance improvement using simple operations (like RDD operations) without explicit (and challenging) pipeline optimization work. I think the most practical case for Spark/GPU integration is jobs involving large chunks of image processing, rendering, linear algebra, etc. work that can be done independently in each task. For example, Naive Bayes where the number of features is large enough to fit on the GPUs in a single node but there are many, many samples to classify. In this case, you may be able use a GPU linear algebra library to do the GPU operations and move data asynchronously and in large chunks to reduce performance issues. Further, GPU scheduling is immature. Very little isolation, GPUs often get into bad states that require machine reboots, and no OS support so mostly done by each application. It's like MacOS 9 -- have to hope each process is a responsible citizen. I think that would end up being a huge distraction for Spark's developers. I think [~srowen]'s point about calling GPU libraries from your Spark driver is probably the most practical solution. > Support off-loading computations to a GPU > - > > Key: SPARK-3785 > URL: https://issues.apache.org/jira/browse/SPARK-3785 > Project: Spark > Issue Type: Brainstorming > Components: MLlib >Reporter: Thomas Darimont >Priority: Minor > > Are there any plans to adding support for off-loading computations to the > GPU, e.g. via an open-cl binding? > http://www.jocl.org/ > https://code.google.com/p/javacl/ > http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143898#comment-14143898 ] RJ Nowling commented on SPARK-3614: --- It could lead to over-fitting and thus mis-predictions. In such cases, it may be valuable to exclude overly-specific terms. > Filter on minimum occurrences of a term in IDF > --- > > Key: SPARK-3614 > URL: https://issues.apache.org/jira/browse/SPARK-3614 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jatinpreet Singh >Assignee: RJ Nowling >Priority: Minor > Labels: TFIDF > > The IDF class in MLlib does not provide the capability of defining a minimum > number of documents a term should appear in the corpus. The idea is to have a > cutoff variable which defines this minimum occurrence value, and the terms > which have lower frequency are ignored. > Mathematically, > IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance > where, > D is the total number of documents in the corpus > DF(t,D) is the number of documents that contain the term t > minimumOccurance is the minimum number of documents the term appears in the > document corpus > This would have an impact on accuracy as terms that appear in less than a > certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142860#comment-14142860 ] RJ Nowling edited comment on SPARK-3614 at 9/22/14 5:52 PM: Thanks, Andrew! I'll do that. was (Author: rnowling): Thanks, Andrew! I'll do that. -- em rnowl...@gmail.com c 954.496.2314 > Filter on minimum occurrences of a term in IDF > --- > > Key: SPARK-3614 > URL: https://issues.apache.org/jira/browse/SPARK-3614 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jatinpreet Singh >Assignee: RJ Nowling >Priority: Minor > Labels: TFIDF > > The IDF class in MLlib does not provide the capability of defining a minimum > number of documents a term should appear in the corpus. The idea is to have a > cutoff variable which defines this minimum occurrence value, and the terms > which have lower frequency are ignored. > Mathematically, > IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance > where, > D is the total number of documents in the corpus > DF(t,D) is the number of documents that contain the term t > minimumOccurance is the minimum number of documents the term appears in the > document corpus > This would have an impact on accuracy as terms that appear in less than a > certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142860#comment-14142860 ] RJ Nowling commented on SPARK-3614: --- Thanks, Andrew! I'll do that. -- em rnowl...@gmail.com c 954.496.2314 > Filter on minimum occurrences of a term in IDF > --- > > Key: SPARK-3614 > URL: https://issues.apache.org/jira/browse/SPARK-3614 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jatinpreet Singh >Assignee: RJ Nowling >Priority: Minor > Labels: TFIDF > > The IDF class in MLlib does not provide the capability of defining a minimum > number of documents a term should appear in the corpus. The idea is to have a > cutoff variable which defines this minimum occurrence value, and the terms > which have lower frequency are ignored. > Mathematically, > IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance > where, > D is the total number of documents in the corpus > DF(t,D) is the number of documents that contain the term t > minimumOccurance is the minimum number of documents the term appears in the > document corpus > This would have an impact on accuracy as terms that appear in less than a > certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142631#comment-14142631 ] RJ Nowling commented on SPARK-3614: --- I would like to work on this. > Filter on minimum occurrences of a term in IDF > --- > > Key: SPARK-3614 > URL: https://issues.apache.org/jira/browse/SPARK-3614 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jatinpreet Singh >Priority: Minor > Labels: TFIDF > > The IDF class in MLlib does not provide the capability of defining a minimum > number of documents a term should appear in the corpus. The idea is to have a > cutoff variable which defines this minimum occurrence value, and the terms > which have lower frequency are ignored. > Mathematically, > IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance > where, > D is the total number of documents in the corpus > DF(t,D) is the number of documents that contain the term t > minimumOccurance is the minimum number of documents the term appears in the > document corpus > This would have an impact on accuracy as terms that appear in less than a > certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136671#comment-14136671 ] RJ Nowling commented on SPARK-2429: --- Great! I look forward to seeing your implementation. :) -- em rnowl...@gmail.com c 954.496.2314 > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Priority: Minor > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134301#comment-14134301 ] RJ Nowling commented on SPARK-2308: --- I'm not a committer but [~mengxr] is. That said, I'm very happy to help in any way I can. The issue of different distance metrics has come up on the mailing list -- a must requested feature. If you provide it as a PR, maybe others who are more familiar with the work to add additional distance metrics can comment and we, as a community, can move forward to get it included. > Add KMeans MiniBatch clustering algorithm to MLlib > -- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: RJ Nowling >Priority: Minor > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134264#comment-14134264 ] RJ Nowling commented on SPARK-2308: --- It is true that we will save on the distance calculations for high dimensional data sets. There is also work under way to improve sampling in Spark, so this will also benefit further from that. Are you planning on creating a PR for your implementation? It would be valuable for the community. I closed mine due to the sampling issues. But I'd be happy to review and test yours. > Add KMeans MiniBatch clustering algorithm to MLlib > -- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: RJ Nowling >Priority: Minor > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3250) More Efficient Sampling
[ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14130880#comment-14130880 ] RJ Nowling commented on SPARK-3250: --- Great work! If these performance improvements hold up when implemented in Spark, this could offer minibatch methods a fighting chance. In particular, we would need the count the elements once and then we'd have faster sampling for the subsequent iterations, especially if the underlying data structures can we coerced into Arrays when we do the copy. > More Efficient Sampling > --- > > Key: SPARK-3250 > URL: https://issues.apache.org/jira/browse/SPARK-3250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: RJ Nowling > > Sampling, as currently implemented in Spark, is an O\(n\) operation. A > number of stochastic algorithms achieve speed ups by exploiting O\(k\) > sampling, where k is the number of data points to sample. Examples of such > algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient > Descent with mini batching. > More efficient sampling may be achievable by packing partitions with an > ArrayBuffer or other data structure supporting random access. Since many of > these stochastic algorithms perform repeated rounds of sampling, it may be > feasible to perform a transformation to change the backing data structure > followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123077#comment-14123077 ] RJ Nowling commented on SPARK-2966: --- Wonderful! If I can help or when you're ready for reviews, let me know! > Add an approximation algorithm for hierarchical clustering to MLlib > --- > > Key: SPARK-2966 > URL: https://issues.apache.org/jira/browse/SPARK-2966 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yu Ishikawa >Priority: Minor > > A hierarchical clustering algorithm is a useful unsupervised learning method. > Koga. et al. proposed highly scalable hierarchical clustering altgorithm in > (1). > I would like to implement this method. > I suggest adding an approximate hierarchical clustering algorithm to MLlib. > I'd like this to be assigned to me. > h3. Reference > # Fast agglomerative hierarchical clustering algorithm using > Locality-Sensitive Hashing > http://dl.acm.org/citation.cfm?id=1266811 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework
[ https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122273#comment-14122273 ] RJ Nowling commented on SPARK-2430: --- Hi Yu, The community had suggested looking into scikit-learn's API so that is a good idea. I am hesitant to make backwards-incompatible API changes, however, until we know the new API will be stable for a long time. I think it would be best to implement a few more clustering algorithms to get a clear idea of what is similar vs different before making a new API. May I suggest you work on SPARK-2966 / SPARK-2429 first? RJ > Standarized Clustering Algorithm API and Framework > -- > > Key: SPARK-2430 > URL: https://issues.apache.org/jira/browse/SPARK-2430 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Priority: Minor > > Recently, there has been a chorus of voices on the mailing lists about adding > new clustering algorithms to MLlib. To support these additions, we should > develop a common framework and API to reduce code duplication and keep the > APIs consistent. > At the same time, we can also expand the current API to incorporate requested > features such as arbitrary distance metrics or pre-computed distance matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122266#comment-14122266 ] RJ Nowling commented on SPARK-2966: --- No worries. Based on my reading of the Spark contribution guidelines ( https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark ), I think that the Spark community would prefer to have one good implementation of an algorithm instead of multiple similar algorithms. Since the community has stated a clear preference for divisive hierarchical clustering, I think that is a better aim. You seem very motivated and have made some good contributions -- would you like to take the lead on the hierarchical clustering? I can review your code to help you improve it. That said, I suggest you look at the comment I added to SPARK-2429 and see what you think of that approach. If you like the example code and papers, why don't you work on implementing it efficiently in Spark? > Add an approximation algorithm for hierarchical clustering to MLlib > --- > > Key: SPARK-2966 > URL: https://issues.apache.org/jira/browse/SPARK-2966 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yu Ishikawa >Priority: Minor > > A hierarchical clustering algorithm is a useful unsupervised learning method. > Koga. et al. proposed highly scalable hierarchical clustering altgorithm in > (1). > I would like to implement this method. > I suggest adding an approximate hierarchical clustering algorithm to MLlib. > I'd like this to be assigned to me. > h3. Reference > # Fast agglomerative hierarchical clustering algorithm using > Locality-Sensitive Hashing > http://dl.acm.org/citation.cfm?id=1266811 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans
[ https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120294#comment-14120294 ] RJ Nowling commented on SPARK-3384: --- Xiangrui Meng I'll try to get a code example together in the next couple days. Even if Spark itself is thread safe, I would re-iterate that it is easy to make the mistake of using += in the wrong place. I suggest that we should frown upon that behavior, document it when we use it, and maybe even add checks for the presence of += with Breeze vectors in the tests so we can flag it. > Potential thread unsafe Breeze vector addition in KMeans > > > Key: SPARK-3384 > URL: https://issues.apache.org/jira/browse/SPARK-3384 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: RJ Nowling > > In the KMeans clustering implementation, the Breeze vectors are accumulated > using +=. For example, > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 > This is potentially a thread unsafe operation. (This is what I observed in > local testing.) I suggest changing the += to + -- a new object will be > allocated but it will be thread safe since it won't write to an old location > accessed by multiple threads. > Further testing is required to reproduce and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans
[ https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] RJ Nowling updated SPARK-3384: -- Description: In the KMeans clustering implementation, the Breeze vectors are accumulated using +=. For example, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. was: In the KMeans clustering implementation, the Breeze vectors are accumulated using +=: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. > Potential thread unsafe Breeze vector addition in KMeans > > > Key: SPARK-3384 > URL: https://issues.apache.org/jira/browse/SPARK-3384 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: RJ Nowling > > In the KMeans clustering implementation, the Breeze vectors are accumulated > using +=. For example, > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 > This is potentially a thread unsafe operation. (This is what I observed in > local testing.) I suggest changing the += to + -- a new object will be > allocated but it will be thread safe since it won't write to an old location > accessed by multiple threads. > Further testing is required to reproduce and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans
RJ Nowling created SPARK-3384: - Summary: Potential thread unsafe Breeze vector addition in KMeans Key: SPARK-3384 URL: https://issues.apache.org/jira/browse/SPARK-3384 Project: Spark Issue Type: Bug Components: MLlib Reporter: RJ Nowling In the KMeans clustering implementation, the Breeze vectors are accumulated using +=: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3250) More Efficient Sampling
[ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115213#comment-14115213 ] RJ Nowling commented on SPARK-3250: --- Very clever! Once it's verified to sample correctly, it would make a nice incremental improvement to the current sampling in Spark. Can you try different data sizes? It shouldn't change the O(n) scaling but it would be good to verify. > More Efficient Sampling > --- > > Key: SPARK-3250 > URL: https://issues.apache.org/jira/browse/SPARK-3250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: RJ Nowling > > Sampling, as currently implemented in Spark, is an O\(n\) operation. A > number of stochastic algorithms achieve speed ups by exploiting O\(k\) > sampling, where k is the number of data points to sample. Examples of such > algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient > Descent with mini batching. > More efficient sampling may be achievable by packing partitions with an > ArrayBuffer or other data structure supporting random access. Since many of > these stochastic algorithms perform repeated rounds of sampling, it may be > feasible to perform a transformation to change the backing data structure > followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3263) PR #720 broke GraphGenerator.logNormal
RJ Nowling created SPARK-3263: - Summary: PR #720 broke GraphGenerator.logNormal Key: SPARK-3263 URL: https://issues.apache.org/jira/browse/SPARK-3263 Project: Spark Issue Type: Bug Components: GraphX Reporter: RJ Nowling PR #720 made multiple changes to GraphGenerator.logNormalGraph including: * Replacing the call to functions for generating random vertices and edges with in-line implementations with different equations * Hard-coding of RNG seeds so that method now generates the same graph for a given number of vertices, edges, mu, and sigma -- user is not able to override seed or specify that seed should be randomly generated. * Backwards-incompatible change to logNormalGraph signature with introduction of new required parameter. * Failed to update scala docs and programming guide for API changes I also see that PR #720 added a Synthetic Benchmark in the examples. Based on reading the Pregel paper, I believe the in-line functions are incorrect. I proposed to: * Removing the in-line calls * Adding a seed for deterministic behavior (when desired) * Keeping the number of partitions parameter. * Updating the synthetic benchmark example -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14112223#comment-14112223 ] RJ Nowling commented on SPARK-2966: --- This is a duplicate of SPARK-2429. Please see the comments on that JIRA and Spark dev list archives for community discussion on preferred approaches (divisive, not agglomerative clustering). > Add an approximation algorithm for hierarchical clustering to MLlib > --- > > Key: SPARK-2966 > URL: https://issues.apache.org/jira/browse/SPARK-2966 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yu Ishikawa >Priority: Minor > > A hierarchical clustering algorithm is a useful unsupervised learning method. > Koga. et al. proposed highly scalable hierarchical clustering altgorithm in > (1). > I would like to implement this method. > I suggest adding an approximate hierarchical clustering algorithm to MLlib. > I'd like this to be assigned to me. > h3. Reference > # Fast agglomerative hierarchical clustering algorithm using > Locality-Sensitive Hashing > http://dl.acm.org/citation.cfm?id=1266811 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1411#comment-1411 ] RJ Nowling commented on SPARK-2429: --- Discussion on the dev list mentioned a community preference for implementing KMeans recursively (a divisive approach). Jeremy Freeman provided an example here: https://gist.github.com/freeman-lab/5947e7c53b368fe90371 The example needs to be optimized but provided a good starting point. For example, every time KMeans is called, the data is converted to Breeze Vectors. Here are two papers on divisive Kmeans: A combined K-means and hierarchical clustering method for improving the clustering efficiency of microarray (2005) by Chen, et al. Divisive Hierarchical K-Means (2006) by Lamrous, et al. > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Priority: Minor > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3250) More Efficient Sampling
[ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] RJ Nowling updated SPARK-3250: -- Description: Sampling, as currently implemented in Spark, is an O\(n\) operation. A number of stochastic algorithms achieve speed ups by exploiting O\(k\) sampling, where k is the number of data points to sample. Examples of such algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient Descent with mini batching. More efficient sampling may be achievable by packing partitions with an ArrayBuffer or other data structure supporting random access. Since many of these stochastic algorithms perform repeated rounds of sampling, it may be feasible to perform a transformation to change the backing data structure followed by multiple rounds of sampling. was: Sampling, as currently implemented in Spark, is an O(n) operation. A number of stochastic algorithms achieve speed ups by exploiting O(k) sampling, where k is the number of data points to sample. Examples of such algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient Descent with mini batching. More efficient sampling may be achievable by packing partitions with an ArrayBuffer or other data structure supporting random access. Since many of these stochastic algorithms perform repeated rounds of sampling, it may be feasible to perform a transformation to change the backing data structure followed by multiple rounds of sampling. > More Efficient Sampling > --- > > Key: SPARK-3250 > URL: https://issues.apache.org/jira/browse/SPARK-3250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: RJ Nowling > > Sampling, as currently implemented in Spark, is an O\(n\) operation. A > number of stochastic algorithms achieve speed ups by exploiting O\(k\) > sampling, where k is the number of data points to sample. Examples of such > algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient > Descent with mini batching. > More efficient sampling may be achievable by packing partitions with an > ArrayBuffer or other data structure supporting random access. Since many of > these stochastic algorithms perform repeated rounds of sampling, it may be > feasible to perform a transformation to change the backing data structure > followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3250) More Efficient Sampling
[ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] RJ Nowling updated SPARK-3250: -- Description: Sampling, as currently implemented in Spark, is an O(n) operation. A number of stochastic algorithms achieve speed ups by exploiting O(k) sampling, where k is the number of data points to sample. Examples of such algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient Descent with mini batching. More efficient sampling may be achievable by packing partitions with an ArrayBuffer or other data structure supporting random access. Since many of these stochastic algorithms perform repeated rounds of sampling, it may be feasible to perform a transformation to change the backing data structure followed by multiple rounds of sampling. was: Sampling, as currently implemented in Spark, is an O(n) operation. A number of stochastic algorithms achieve speed ups by exploiting O(k) sampling, where k is the number of data points to sample. Examples of such algorithms include KMeans MiniBatch and Stochastic Gradient Descent with mini batching. More efficient sampling may be achievable by packing partitions with an ArrayBuffer or other data structure supporting random access. Since many of these stochastic algorithms perform repeated rounds of sampling, it may be feasible to perform a transformation to change the backing data structure followed by multiple rounds of sampling. > More Efficient Sampling > --- > > Key: SPARK-3250 > URL: https://issues.apache.org/jira/browse/SPARK-3250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: RJ Nowling > > Sampling, as currently implemented in Spark, is an O(n) operation. A number > of stochastic algorithms achieve speed ups by exploiting O(k) sampling, where > k is the number of data points to sample. Examples of such algorithms > include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient Descent with > mini batching. > More efficient sampling may be achievable by packing partitions with an > ArrayBuffer or other data structure supporting random access. Since many of > these stochastic algorithms perform repeated rounds of sampling, it may be > feasible to perform a transformation to change the backing data structure > followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3250) More Efficient Sampling
RJ Nowling created SPARK-3250: - Summary: More Efficient Sampling Key: SPARK-3250 URL: https://issues.apache.org/jira/browse/SPARK-3250 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: RJ Nowling Sampling, as currently implemented in Spark, is an O(n) operation. A number of stochastic algorithms achieve speed ups by exploiting O(k) sampling, where k is the number of data points to sample. Examples of such algorithms include KMeans MiniBatch and Stochastic Gradient Descent with mini batching. More efficient sampling may be achievable by packing partitions with an ArrayBuffer or other data structure supporting random access. Since many of these stochastic algorithms perform repeated rounds of sampling, it may be feasible to perform a transformation to change the backing data structure followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111224#comment-14111224 ] RJ Nowling commented on SPARK-2308: --- Xiangrui, I realized that sampling in Spark is O(n), where n is the number of elements in the data set. To get a performance advantage from MiniBatch KMeans, we need a sampling method that provides O(k) time, where k is the number of points to sample. I don't see any obvious way to implement a more efficient sampling method. If you concur, I'll create a separate JIRA to document the need for a more efficient sampling method and close this JIRA. Thanks. > Add KMeans MiniBatch clustering algorithm to MLlib > -- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: RJ Nowling >Priority: Minor > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14079260#comment-14079260 ] RJ Nowling commented on SPARK-2308: --- Thanks for the clarification. :) I'll run the additional tests to try to answer those questions. I'll also work on trying to implement MiniBatch KMeans as a flag for the current KMeans implementation -- that would be a nicer API. > Add KMeans MiniBatch clustering algorithm to MLlib > -- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: RJ Nowling >Priority: Minor > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078078#comment-14078078 ] RJ Nowling commented on SPARK-2308: --- I did all of my tests with scikit-learn given your suggestion. Scikit-learn uses k-means++, not k-means||.I should have made that clear. I'm not clear on what you're looking for. I have a few observations at this point: 1. KMeans seems to be very sensitive to initialization -- cluster positions doesn't seem to change significantly after initialization 2. Initialization seems to be more important than whether you use KMeans or KMeans MiniBatch -- given the same initialization, they tend to do equally well 3. Random and kmeans++ / kmeans|| initialization methods seem sensitive to variations in cluster sizes. However, I'm happy to run more tests if you think they will be useful, but at this point, I feel the behavior we're seeing is expected. Hierarchical KMeans or methods such as KCenters, which guarantee that the space is partitioned equally (regardless of cluster density), may be useful for cases where KMeans doesn't perform as desired. > Add KMeans MiniBatch clustering algorithm to MLlib > -- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Priority: Minor > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063601#comment-14063601 ] RJ Nowling commented on SPARK-2308: --- I tested kmeans vs minibatch kmeans under 2 scenarios: * 4 centers of 1000, 100, 10, and 1 data points. * 100 centers with 10 points each The proposed centers were generated along a grid. The data points were generated by adding samples from N(0, 1.0) in each dimension to the centers. I found the expected centers by averaging the points generated from each proposed center. I ran KMeans and MiniBatch KMeans for each set of data points with 30 iterations and k-means++ initialization. I plotted the expected centers (blue), KMeans centers (red), and MiniBatch centers (green). The two method showed similar results. They both struggled with the small clusters and ended up finding two centers for the large cluster, ignoring the single data point. For the 100 even clusters, both methods got most of the centers reasonably correct and in a few cases, had 2 centers where there should be 1. I've attached the plots (many_small_centers,pdf, uneven_centers.pdf). In reviewing the scikit-learn implementation, I saw that they handled small clusters as special cases. In the case of small clusters, one of the points in the cluster is randomly chosen as the center instead of finding the center as a running average of the points sampled. > Add KMeans MiniBatch clustering algorithm to MLlib > -- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Priority: Minor > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] RJ Nowling updated SPARK-2308: -- Attachment: uneven_centers.pdf many_small_centers.pdf > Add KMeans MiniBatch clustering algorithm to MLlib > -- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Priority: Minor > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14057513#comment-14057513 ] RJ Nowling commented on SPARK-2308: --- That sounds like a good idea for a test. I'll report back. > Add KMeans MiniBatch clustering algorithm to MLlib > -- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Priority: Minor > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2430) Standarized Clustering Algorithm API and Framework
RJ Nowling created SPARK-2430: - Summary: Standarized Clustering Algorithm API and Framework Key: SPARK-2430 URL: https://issues.apache.org/jira/browse/SPARK-2430 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Recently, there has been a chorus of voices on the mailing lists about adding new clustering algorithms to MLlib. To support these additions, we should develop a common framework and API to reduce code duplication and keep the APIs consistent. At the same time, we can also expand the current API to incorporate requested features such as arbitrary distance metrics or pre-computed distance matrices. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2429) Hierarchical Implementation of KMeans
RJ Nowling created SPARK-2429: - Summary: Hierarchical Implementation of KMeans Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052558#comment-14052558 ] RJ Nowling commented on SPARK-2308: --- Hi Xiangrui, Here's the paper: http://www.ra.ethz.ch/CDstore/www2010/www/p1177.pdf This discussion in the scikit-learn documentation could also be useful: http://scikit-learn.org/stable/modules/clustering.html I agree that smaller clusters will be at a disadvantage with uniform sampling. I imagine one could weight the points inversely by cluster size or the like. However, the challenge would be to do it in a way that doesn't require touching all of the data points. The MiniBatch approach only samples batchSize number of data points in each iteration. Those data points are used to update their respective centers. You would have to reassign all the data points to the updated cluster centers in each iteration to prevent the weights from quickly becoming inaccurate. This would defeat one of the main optimizations of the method. Do you have any suggestions on how to achieve the weighting in a way that would maintain the properties necessary for convergence and keep the efficiency advantages? Thanks! > Add KMeans MiniBatch clustering algorithm to MLlib > -- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Priority: Minor > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
RJ Nowling created SPARK-2308: - Summary: Add KMeans MiniBatch clustering algorithm to MLlib Key: SPARK-2308 URL: https://issues.apache.org/jira/browse/SPARK-2308 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Mini-batch is a version of KMeans that uses a randomly-sampled subset of the data points in each iteration instead of the full set of data points, improving performance (and in some cases, accuracy). The mini-batch version is compatible with the KMeans|| initialization algorithm currently implemented in MLlib. I suggest adding KMeans Mini-batch as an alternative. I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)