[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance
[ https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955408#comment-14955408 ] Mohamed Baddar commented on SPARK-10791: [~aspa] would you please clarify the specific thread in the link https://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/browser which discusses this performance issue as i am working on [SPARK-10808] > Optimize MLlib LDA topic distribution query performance > --- > > Key: SPARK-10791 > URL: https://issues.apache.org/jira/browse/SPARK-10791 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 > Environment: Ubuntu 13.10, Oracle Java 8 >Reporter: Marko Asplund > > I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size > and ~3.4 M documents using EMLDAOptimizer. > Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit > training with the same data and on the same system set took ~5 minutes. > Loading the persisted model from disk (~2 minutes), as well as querying LDA > model topic distributions (~4 seconds for one document) are also quite slow > operations. > Our application is querying LDA model topic distribution (for one doc at a > time) as part of end-user operation execution flow, so a ~4 second execution > time is very problematic. > The log includes the following message, which AFAIK, should mean that > netlib-java is using machine optimised native implementation: > "com.github.fommil.jni.JniLoader - successfully loaded > /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so" > My test code can be found here: > https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57 > I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable > change in training performance. Model loading time was reduced to ~ 5 seconds > from ~ 2 minutes (now persisted as LocalLDAModel). However, query / > prediction time was unchanged. > Unfortunately, this is the critical performance characteristic in our case. > I did some profiling for my LDA prototype code that requests topic > distributions from a model. According to Java Mission Control more than 80 % > of execution time during sample interval is spent in the following methods: > - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07% > - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91% > - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50; > 6.98% > - java.lang.Double.valueOf(double); count: 31; 4.33% > Is there any way of using the API more optimally? > Are there any opportunities for optimising the "topicDistributions" code > path in MLlib? > My query test code looks like this essentially: > // executed once > val model = LocalLDAModel.load(ctx, ModelFileName) > // executed four times > val samples = Transformers.toSparseVectors(vocabularySize, > ctx.parallelize(Seq(input))) // fast > model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this > seems to take about 4 seconds to execute -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance
[ https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955444#comment-14955444 ] Marko Asplund commented on SPARK-10791: --- Please see: Sep 2015 / Thread view / page 8 thread title: "How to speed up MLlib LDA?" https://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3CCANoUZR-xcmvj%3DYgUc1JEHu54vWfyP0n-%3DHfz2dxiWFRuk8zRpQ%40mail.gmail.com%3E > Optimize MLlib LDA topic distribution query performance > --- > > Key: SPARK-10791 > URL: https://issues.apache.org/jira/browse/SPARK-10791 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 > Environment: Ubuntu 13.10, Oracle Java 8 >Reporter: Marko Asplund > > I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size > and ~3.4 M documents using EMLDAOptimizer. > Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit > training with the same data and on the same system set took ~5 minutes. > Loading the persisted model from disk (~2 minutes), as well as querying LDA > model topic distributions (~4 seconds for one document) are also quite slow > operations. > Our application is querying LDA model topic distribution (for one doc at a > time) as part of end-user operation execution flow, so a ~4 second execution > time is very problematic. > The log includes the following message, which AFAIK, should mean that > netlib-java is using machine optimised native implementation: > "com.github.fommil.jni.JniLoader - successfully loaded > /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so" > My test code can be found here: > https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57 > I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable > change in training performance. Model loading time was reduced to ~ 5 seconds > from ~ 2 minutes (now persisted as LocalLDAModel). However, query / > prediction time was unchanged. > Unfortunately, this is the critical performance characteristic in our case. > I did some profiling for my LDA prototype code that requests topic > distributions from a model. According to Java Mission Control more than 80 % > of execution time during sample interval is spent in the following methods: > - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07% > - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91% > - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50; > 6.98% > - java.lang.Double.valueOf(double); count: 31; 4.33% > Is there any way of using the API more optimally? > Are there any opportunities for optimising the "topicDistributions" code > path in MLlib? > My query test code looks like this essentially: > // executed once > val model = LocalLDAModel.load(ctx, ModelFileName) > // executed four times > val samples = Transformers.toSparseVectors(vocabularySize, > ctx.parallelize(Seq(input))) // fast > model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this > seems to take about 4 seconds to execute -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance
[ https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908448#comment-14908448 ] Joseph K. Bradley commented on SPARK-10791: --- Oh, OK, I'll comment there as needed. Thanks > Optimize MLlib LDA topic distribution query performance > --- > > Key: SPARK-10791 > URL: https://issues.apache.org/jira/browse/SPARK-10791 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 > Environment: Ubuntu 13.10, Oracle Java 8 >Reporter: Marko Asplund > > I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size > and ~3.4 M documents using EMLDAOptimizer. > Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit > training with the same data and on the same system set took ~5 minutes. > Loading the persisted model from disk (~2 minutes), as well as querying LDA > model topic distributions (~4 seconds for one document) are also quite slow > operations. > Our application is querying LDA model topic distribution (for one doc at a > time) as part of end-user operation execution flow, so a ~4 second execution > time is very problematic. > The log includes the following message, which AFAIK, should mean that > netlib-java is using machine optimised native implementation: > "com.github.fommil.jni.JniLoader - successfully loaded > /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so" > My test code can be found here: > https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57 > I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable > change in training performance. Model loading time was reduced to ~ 5 seconds > from ~ 2 minutes (now persisted as LocalLDAModel). However, query / > prediction time was unchanged. > Unfortunately, this is the critical performance characteristic in our case. > I did some profiling for my LDA prototype code that requests topic > distributions from a model. According to Java Mission Control more than 80 % > of execution time during sample interval is spent in the following methods: > - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07% > - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91% > - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50; > 6.98% > - java.lang.Double.valueOf(double); count: 31; 4.33% > Is there any way of using the API more optimally? > Are there any opportunities for optimising the "topicDistributions" code > path in MLlib? > My query test code looks like this essentially: > // executed once > val model = LocalLDAModel.load(ctx, ModelFileName) > // executed four times > val samples = Transformers.toSparseVectors(vocabularySize, > ctx.parallelize(Seq(input))) // fast > model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this > seems to take about 4 seconds to execute -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance
[ https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907975#comment-14907975 ] Marko Asplund commented on SPARK-10791: --- This performance issue was actually discussed on the spark mailing list. Please see full discussion here: https://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/browser My tests were performed on a single node. > Optimize MLlib LDA topic distribution query performance > --- > > Key: SPARK-10791 > URL: https://issues.apache.org/jira/browse/SPARK-10791 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 > Environment: Ubuntu 13.10, Oracle Java 8 >Reporter: Marko Asplund > > I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size > and ~3.4 M documents using EMLDAOptimizer. > Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit > training with the same data and on the same system set took ~5 minutes. > Loading the persisted model from disk (~2 minutes), as well as querying LDA > model topic distributions (~4 seconds for one document) are also quite slow > operations. > Our application is querying LDA model topic distribution (for one doc at a > time) as part of end-user operation execution flow, so a ~4 second execution > time is very problematic. > The log includes the following message, which AFAIK, should mean that > netlib-java is using machine optimised native implementation: > "com.github.fommil.jni.JniLoader - successfully loaded > /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so" > My test code can be found here: > https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57 > I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable > change in training performance. Model loading time was reduced to ~ 5 seconds > from ~ 2 minutes (now persisted as LocalLDAModel). However, query / > prediction time was unchanged. > Unfortunately, this is the critical performance characteristic in our case. > I did some profiling for my LDA prototype code that requests topic > distributions from a model. According to Java Mission Control more than 80 % > of execution time during sample interval is spent in the following methods: > - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07% > - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91% > - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50; > 6.98% > - java.lang.Double.valueOf(double); count: 31; 4.33% > Is there any way of using the API more optimally? > Are there any opportunities for optimising the "topicDistributions" code > path in MLlib? > My query test code looks like this essentially: > // executed once > val model = LocalLDAModel.load(ctx, ModelFileName) > // executed four times > val samples = Transformers.toSparseVectors(vocabularySize, > ctx.parallelize(Seq(input))) // fast > model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this > seems to take about 4 seconds to execute -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance
[ https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906776#comment-14906776 ] Joseph K. Bradley commented on SPARK-10791: --- This sounds like a question for the user list, not JIRA, but here are some thoughts: Was this run on a single machine or in parallel? MLlib is of course optimized to scale with parallelism, rather than be run on a single machine. I suspect you could speed up training some. Check out [SPARK-10808] for some thoughts. The topicDistributions method could be improved for your use case, if your "input" is a small set of documents. I just made [SPARK-10809] to track that. If you are using a big batch of documents, then parallelization should help. I'll close this for now since I think the JIRAs I just made should cover the issues. > Optimize MLlib LDA topic distribution query performance > --- > > Key: SPARK-10791 > URL: https://issues.apache.org/jira/browse/SPARK-10791 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 > Environment: Ubuntu 13.10, Oracle Java 8 >Reporter: Marko Asplund > > I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size > and ~3.4 M documents using EMLDAOptimizer. > Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit > training with the same data and on the same system set took ~5 minutes. > Loading the persisted model from disk (~2 minutes), as well as querying LDA > model topic distributions (~4 seconds for one document) are also quite slow > operations. > Our application is querying LDA model topic distribution (for one doc at a > time) as part of end-user operation execution flow, so a ~4 second execution > time is very problematic. > The log includes the following message, which AFAIK, should mean that > netlib-java is using machine optimised native implementation: > "com.github.fommil.jni.JniLoader - successfully loaded > /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so" > My test code can be found here: > https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57 > I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable > change in training performance. Model loading time was reduced to ~ 5 seconds > from ~ 2 minutes (now persisted as LocalLDAModel). However, query / > prediction time was unchanged. > Unfortunately, this is the critical performance characteristic in our case. > I did some profiling for my LDA prototype code that requests topic > distributions from a model. According to Java Mission Control more than 80 % > of execution time during sample interval is spent in the following methods: > - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07% > - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91% > - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50; > 6.98% > - java.lang.Double.valueOf(double); count: 31; 4.33% > Is there any way of using the API more optimally? > Are there any opportunities for optimising the "topicDistributions" code > path in MLlib? > My query test code looks like this essentially: > // executed once > val model = LocalLDAModel.load(ctx, ModelFileName) > // executed four times > val samples = Transformers.toSparseVectors(vocabularySize, > ctx.parallelize(Seq(input))) // fast > model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this > seems to take about 4 seconds to execute -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org