[jira] [Commented] (SPARK-1548) Add Partial Random Forest algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972474#comment-15972474 ] Mohamed Baddar commented on SPARK-1548: --- [~srowen] [~josephkb] any updates on the possibility of proceeding with this issue ? > Add Partial Random Forest algorithm to MLlib > > > Key: SPARK-1548 > URL: https://issues.apache.org/jira/browse/SPARK-1548 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde > > This task involves creating an alternate approximate random forest > implementation where each tree is constructed per partition. > The tasks involves: > - Justifying with theory and experimental results why this algorithm is a > good choice. > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1548) Add Partial Random Forest algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940130#comment-15940130 ] Mohamed Baddar commented on SPARK-1548: --- [~manishamde] [~sowen] [~josephkb] I have small experience in contributions on starter tasks in spark, and found this issue interesting. I was investigating regarding the partial implementation of RF, and found these resources: https://mahout.apache.org/users/classification/partial-implementation.html https://github.com/apache/mahout/blob/b5fe4aab22e7867ae057a6cdb1610cfa17555311/mr/src/main/java/org/apache/mahout/classifier/df/mapreduce/partial/package-info.java I thinks analyzing mahout implementation provides a good basis to start analyzing RF partial implementation in theory and practically. If this issue is still important to Spark, It would be great if I can start on it. I can start with creating analysis document for current mahout implementation to assess its performance > Add Partial Random Forest algorithm to MLlib > > > Key: SPARK-1548 > URL: https://issues.apache.org/jira/browse/SPARK-1548 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde > > This task involves creating an alternate approximate random forest > implementation where each tree is constructed per partition. > The tasks involves: > - Justifying with theory and experimental results why this algorithm is a > good choice. > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934864#comment-15934864 ] Mohamed Baddar commented on SPARK-7129: --- [~josephkb] [~sethah] [~meihuawu] [~mlnick] If now one is working on this. Can I start working on it, I have small experience in contributing with starter tasks in Spark. If no one working on it I would love to start reading the design docs mentioned in comments and start discussing next steps > Add generic boosting algorithm to spark.ml > -- > > Key: SPARK-7129 > URL: https://issues.apache.org/jira/browse/SPARK-7129 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > The Pipelines API will make it easier to create a generic Boosting algorithm > which can work with any Classifier or Regressor. Creating this feature will > require researching the possible variants and extensions of boosting which we > may want to support now and/or in the future, and planning an API which will > be properly extensible. > In particular, it will be important to think about supporting: > * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.) > * multiclass variants > * multilabel variants (which will probably be in a separate class and JIRA) > * For more esoteric variants, we should consider them but not design too much > around them: totally corrective boosting, cascaded models > Note: This may interact some with the existing tree ensemble methods, but it > should be largely separate since the tree ensemble APIs and implementations > are specialized for trees. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset
[ https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391082#comment-15391082 ] Mohamed Baddar commented on SPARK-3246: --- [~sheridanrawlins] Working on it soon, most probably on 1st of August > Support weighted SVMWithSGD for classification of unbalanced dataset > > > Key: SPARK-3246 > URL: https://issues.apache.org/jira/browse/SPARK-3246 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 0.9.0, 1.0.2 >Reporter: mahesh bhole > > Please support weighted SVMWithSGD for binary classification of unbalanced > dataset.Though other options like undersampling or oversampling can be > used,It will be good if we can have a way to assign weights to minority > class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala
[ https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268400#comment-15268400 ] Mohamed Baddar commented on SPARK-13073: [~samsudhin] I will work on it soon > creating R like summary for logistic Regression in Spark - Scala > > > Key: SPARK-13073 > URL: https://issues.apache.org/jira/browse/SPARK-13073 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Samsudhin >Priority: Minor > > Currently Spark ML provides Coefficients for logistic regression. To evaluate > the trained model tests like wald test, chi square tests and their results to > be summarized and display like GLM summary of R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14077) Support weighted instances in naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-14077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256247#comment-15256247 ] Mohamed Baddar commented on SPARK-14077: I suspended working on it for the time being > Support weighted instances in naive Bayes > - > > Key: SPARK-14077 > URL: https://issues.apache.org/jira/browse/SPARK-14077 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng > Labels: naive-bayes > > In naive Bayes, we expect inputs to be individual observations. In practice, > people may have the frequency table instead. It is useful for us to support > instance weights to handle this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14077) Support weighted instances in naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-14077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213481#comment-15213481 ] Mohamed Baddar commented on SPARK-14077: [~mengxr] [~josephkb] In sktlearn code , they implement the same feature by scaling the target variable after binarization. Here's the source code link https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/naive_bayes.py#L507. I think we can follow sktlearn implementation as a guideline and it will also help in the unit test. Any thoughts ? > Support weighted instances in naive Bayes > - > > Key: SPARK-14077 > URL: https://issues.apache.org/jira/browse/SPARK-14077 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng > Labels: naive-bayes > > In naive Bayes, we expect inputs to be individual observations. In practice, > people may have the frequency table instead. It is useful for us to support > instance weights to handle this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala
[ https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210259#comment-15210259 ] Mohamed Baddar commented on SPARK-13073: Thanks [~samsudhin] I noticed the difference in params. Do you have any other comments on my notes > creating R like summary for logistic Regression in Spark - Scala > > > Key: SPARK-13073 > URL: https://issues.apache.org/jira/browse/SPARK-13073 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Samsudhin >Priority: Minor > > Currently Spark ML provides Coefficients for logistic regression. To evaluate > the trained model tests like wald test, chi square tests and their results to > be summarized and display like GLM summary of R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14077) Support weighted instances in naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-14077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210069#comment-15210069 ] Mohamed Baddar commented on SPARK-14077: [~mengxr] If no body is working on this task , Can i work on it ? > Support weighted instances in naive Bayes > - > > Key: SPARK-14077 > URL: https://issues.apache.org/jira/browse/SPARK-14077 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng > Labels: naive-bayes > > In naive Bayes, we expect inputs to be individual observations. In practice, > people may have the frequency table instead. It is useful for us to support > instance weights to handle this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala
[ https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210054#comment-15210054 ] Mohamed Baddar commented on SPARK-13073: [~josephkb] Can any one of the admins verify this PR > creating R like summary for logistic Regression in Spark - Scala > > > Key: SPARK-13073 > URL: https://issues.apache.org/jira/browse/SPARK-13073 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Samsudhin >Priority: Minor > > Currently Spark ML provides Coefficients for logistic regression. To evaluate > the trained model tests like wald test, chi square tests and their results to > be summarized and display like GLM summary of R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1359) SGD implementation is not efficient
[ https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197062#comment-15197062 ] Mohamed Baddar commented on SPARK-1359: --- [~mengxr] If this issue is still of interest and nobody is working on it , I can start implementation. > SGD implementation is not efficient > --- > > Key: SPARK-1359 > URL: https://issues.apache.org/jira/browse/SPARK-1359 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 0.9.0, 1.0.0 >Reporter: Xiangrui Meng > > The SGD implementation samples a mini-batch to compute the stochastic > gradient. This is not efficient because examples are provided via an iterator > interface. We have to scan all of them to obtain a sample. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset
[ https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195242#comment-15195242 ] Mohamed Baddar commented on SPARK-3246: --- [~josephkb] If nobody working on that issue and it is still of interest , I can work on it > Support weighted SVMWithSGD for classification of unbalanced dataset > > > Key: SPARK-3246 > URL: https://issues.apache.org/jira/browse/SPARK-3246 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 0.9.0, 1.0.2 >Reporter: mahesh bhole > > Please support weighted SVMWithSGD for binary classification of unbalanced > dataset.Though other options like undersampling or oversampling can be > used,It will be good if we can have a way to assign weights to minority > class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9134) LDA Asymmetric topic-word prior
[ https://issues.apache.org/jira/browse/SPARK-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohamed Baddar updated SPARK-9134: -- Comment: was deleted (was: [~josephkb] [~fliang] If no body working on that , and there is an interest in that issue , can i start working on it ?) > LDA Asymmetric topic-word prior > --- > > Key: SPARK-9134 > URL: https://issues.apache.org/jira/browse/SPARK-9134 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Feynman Liang > > SPARK-8536 generalizes LDA to asymmetric document-topic priors, which > [Wallach et al|http://dirichlet.net/pdf/wallach09rethinking.pdf] proposes > offers greater utility in terms of asymmetric priors. > However, [Stanford > NLP|http://nlp.stanford.edu/software/tmt/tmt-0.2/scaladocs/scaladocs/edu/stanford/nlp/tmt/lda/LDA.html] > also permits asymmetric priors on the topic-word prior. We should not > support manually specifying the entire matrix (which has numTopics * > vocabSize entries); rather we should follow Stanford NLP and take a single > vector of length vocabSize for a prior over words and assume that all topics > share this prior (e.g. replicate it numTopics times to get the topic-word > prior matrix). > We are leaving this as todo; any users who have a need for this feature > should discuss on this JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9134) LDA Asymmetric topic-word prior
[ https://issues.apache.org/jira/browse/SPARK-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195217#comment-15195217 ] Mohamed Baddar commented on SPARK-9134: --- [~josephkb] [~fliang] If no body working on that , and there is an interest in that issue , can i start working on it ? > LDA Asymmetric topic-word prior > --- > > Key: SPARK-9134 > URL: https://issues.apache.org/jira/browse/SPARK-9134 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Feynman Liang > > SPARK-8536 generalizes LDA to asymmetric document-topic priors, which > [Wallach et al|http://dirichlet.net/pdf/wallach09rethinking.pdf] proposes > offers greater utility in terms of asymmetric priors. > However, [Stanford > NLP|http://nlp.stanford.edu/software/tmt/tmt-0.2/scaladocs/scaladocs/edu/stanford/nlp/tmt/lda/LDA.html] > also permits asymmetric priors on the topic-word prior. We should not > support manually specifying the entire matrix (which has numTopics * > vocabSize entries); rather we should follow Stanford NLP and take a single > vector of length vocabSize for a prior over words and assume that all topics > share this prior (e.g. replicate it numTopics times to get the topic-word > prior matrix). > We are leaving this as todo; any users who have a need for this feature > should discuss on this JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala
[ https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192290#comment-15192290 ] Mohamed Baddar edited comment on SPARK-13073 at 3/13/16 11:03 AM: -- [~josephkb] After more investigation in the code , and to make minimal changes on the code.My previous suggestion may not be suitable .I think we can implement toString version for BinaryLogisticRegressionSummary that give different information than R summary. It will create string representation for the following members : precision recall fmeasure is there any comment before i start the PR ? was (Author: mbaddar1): [~josephkb] After more investigation in the code , and to make minimal changes on the code.My previous suggestion may not be suitable .I think we can implement toString version for BinaryLogisticRegressionSummary that give different information than R summary. It will create string representation for the following members : precision recall fmeasure [~josephkb] is there any comment before i start the PR ? > creating R like summary for logistic Regression in Spark - Scala > > > Key: SPARK-13073 > URL: https://issues.apache.org/jira/browse/SPARK-13073 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Samsudhin >Priority: Minor > > Currently Spark ML provides Coefficients for logistic regression. To evaluate > the trained model tests like wald test, chi square tests and their results to > be summarized and display like GLM summary of R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala
[ https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192290#comment-15192290 ] Mohamed Baddar commented on SPARK-13073: [~josephkb] After more investigation in the code , and to make minimal changes on the code.My previous suggestion may not be suitable .I think we can implement toString version for BinaryLogisticRegressionSummary that give different information than R summary. It will create string representation for the following members : precision recall fmeasure [~josephkb] is there any comment before i start the PR ? > creating R like summary for logistic Regression in Spark - Scala > > > Key: SPARK-13073 > URL: https://issues.apache.org/jira/browse/SPARK-13073 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Samsudhin >Priority: Minor > > Currently Spark ML provides Coefficients for logistic regression. To evaluate > the trained model tests like wald test, chi square tests and their results to > be summarized and display like GLM summary of R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala
[ https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189283#comment-15189283 ] Mohamed Baddar commented on SPARK-13073: [~josephkb] After looking at source code of org.apache.spark.ml.classification.LogisticRegressionSummary and org.apache.spark.ml.classification.LogisticRegressionTrainingSummary and after running a sample GLM in R which has the following output Call: glm(formula = mpg ~ wt + hp + gear, family = gaussian(), data = mtcars) Deviance Residuals: Min 1Q Median 3Q Max -3.3712 -1.9017 -0.3444 0.9883 6.0655 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.013657 4.632264 6.911 1.64e-07 *** wt -3.197811 0.846546 -3.777 0.000761 *** hp -0.036786 0.009891 -3.719 0.000888 *** gear 1.019981 0.851408 1.198 0.240963 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 6.626347) Null deviance: 1126.05 on 31 degrees of freedom Residual deviance: 185.54 on 28 degrees of freedom AIC: 157.05 Number of Fisher Scoring iterations: 2 I have the following comments : 1-I think we should add the following member to LogisticRegressionSummary : coefficients and residuals 2-toString method should be overridden in the following classes : org.apache.spark.ml.classification.BinaryLogisticRegressionSummary and org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary Any other suggestions ? Please correct me if have missed something. > creating R like summary for logistic Regression in Spark - Scala > > > Key: SPARK-13073 > URL: https://issues.apache.org/jira/browse/SPARK-13073 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Samsudhin >Priority: Minor > > Currently Spark ML provides Coefficients for logistic regression. To evaluate > the trained model tests like wald test, chi square tests and their results to > be summarized and display like GLM summary of R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala
[ https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187935#comment-15187935 ] Mohamed Baddar commented on SPARK-13073: [~josephkb] Can you assign this to me as a starter task ? > creating R like summary for logistic Regression in Spark - Scala > > > Key: SPARK-13073 > URL: https://issues.apache.org/jira/browse/SPARK-13073 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Samsudhin >Priority: Minor > > Currently Spark ML provides Coefficients for logistic regression. To evaluate > the trained model tests like wald test, chi square tests and their results to > be summarized and display like GLM summary of R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala
[ https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohamed Baddar updated SPARK-13073: --- Comment: was deleted (was: [~josephkb] If no body is working on it , can i start working on that issue as a starter task ?) > creating R like summary for logistic Regression in Spark - Scala > > > Key: SPARK-13073 > URL: https://issues.apache.org/jira/browse/SPARK-13073 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Samsudhin >Priority: Minor > > Currently Spark ML provides Coefficients for logistic regression. To evaluate > the trained model tests like wald test, chi square tests and their results to > be summarized and display like GLM summary of R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala
[ https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183575#comment-15183575 ] Mohamed Baddar commented on SPARK-13073: [~josephkb] If no body is working on it , can i start working on that issue as a starter task ? > creating R like summary for logistic Regression in Spark - Scala > > > Key: SPARK-13073 > URL: https://issues.apache.org/jira/browse/SPARK-13073 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Samsudhin >Priority: Minor > > Currently Spark ML provides Coefficients for logistic regression. To evaluate > the trained model tests like wald test, chi square tests and their results to > be summarized and display like GLM summary of R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance
[ https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955408#comment-14955408 ] Mohamed Baddar commented on SPARK-10791: [~aspa] would you please clarify the specific thread in the link https://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/browser which discusses this performance issue as i am working on [SPARK-10808] > Optimize MLlib LDA topic distribution query performance > --- > > Key: SPARK-10791 > URL: https://issues.apache.org/jira/browse/SPARK-10791 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 > Environment: Ubuntu 13.10, Oracle Java 8 >Reporter: Marko Asplund > > I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size > and ~3.4 M documents using EMLDAOptimizer. > Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit > training with the same data and on the same system set took ~5 minutes. > Loading the persisted model from disk (~2 minutes), as well as querying LDA > model topic distributions (~4 seconds for one document) are also quite slow > operations. > Our application is querying LDA model topic distribution (for one doc at a > time) as part of end-user operation execution flow, so a ~4 second execution > time is very problematic. > The log includes the following message, which AFAIK, should mean that > netlib-java is using machine optimised native implementation: > "com.github.fommil.jni.JniLoader - successfully loaded > /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so" > My test code can be found here: > https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57 > I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable > change in training performance. Model loading time was reduced to ~ 5 seconds > from ~ 2 minutes (now persisted as LocalLDAModel). However, query / > prediction time was unchanged. > Unfortunately, this is the critical performance characteristic in our case. > I did some profiling for my LDA prototype code that requests topic > distributions from a model. According to Java Mission Control more than 80 % > of execution time during sample interval is spent in the following methods: > - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07% > - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91% > - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50; > 6.98% > - java.lang.Double.valueOf(double); count: 31; 4.33% > Is there any way of using the API more optimally? > Are there any opportunities for optimising the "topicDistributions" code > path in MLlib? > My query test code looks like this essentially: > // executed once > val model = LocalLDAModel.load(ctx, ModelFileName) > // executed four times > val samples = Transformers.toSparseVectors(vocabularySize, > ctx.parallelize(Seq(input))) // fast > model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this > seems to take about 4 seconds to execute -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10808) LDA user guide: discuss running time of LDA
[ https://issues.apache.org/jira/browse/SPARK-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908198#comment-14908198 ] Mohamed Baddar commented on SPARK-10808: Thanks [~josephkb] , working on it > LDA user guide: discuss running time of LDA > --- > > Key: SPARK-10808 > URL: https://issues.apache.org/jira/browse/SPARK-10808 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Based on feedback like [SPARK-10791], we should discuss the computational and > communication complexity of LDA and its optimizers in the MLlib Programming > Guide. E.g.: > * Online LDA can be faster than EM. > * To make online LDA run faster, you can use a smaller miniBatchFraction. > * Communication > ** For EM, communication on each iteration is on the order of # topics * > (vocabSize + # docs). > ** For online LDA, communication on each iteration is on the order of # > topics * vocabSize. > * Decreasing vocabSize and # topics can speed things up. It's often fine to > eliminate uncommon words, unless you are trying to create a very large number > of topics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10808) LDA user guide: discuss running time of LDA
[ https://issues.apache.org/jira/browse/SPARK-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906975#comment-14906975 ] Mohamed Baddar commented on SPARK-10808: Hello [~josephkb] , can i take this task . thanks > LDA user guide: discuss running time of LDA > --- > > Key: SPARK-10808 > URL: https://issues.apache.org/jira/browse/SPARK-10808 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Based on feedback like [SPARK-10791], we should discuss the computational and > communication complexity of LDA and its optimizers in the MLlib Programming > Guide. E.g.: > * Online LDA can be faster than EM. > * To make online LDA run faster, you can use a smaller miniBatchFraction. > * Communication > ** For EM, communication on each iteration is on the order of # topics * > (vocabSize + # docs). > ** For online LDA, communication on each iteration is on the order of # > topics * vocabSize. > * Decreasing vocabSize and # topics can speed things up. It's often fine to > eliminate uncommon words, unless you are trying to create a very large number > of topics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver
[ https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904471#comment-14904471 ] Mohamed Baddar edited comment on SPARK-9836 at 9/23/15 8:39 PM: Thanks a lot [~mengxr] , i will try one of the starter tasks , but seems they are all taken , if so , what should i do next ? was (Author: mbaddar): Thanks a lot , i will try one of the starter tasks , but seems they are all taken , if so , what should i do next ? > Provide R-like summary statistics for ordinary least squares via normal > equation solver > --- > > Key: SPARK-9836 > URL: https://issues.apache.org/jira/browse/SPARK-9836 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng > > In R, model fitting comes with summary statistics. We can provide most of > those via normal equation solver (SPARK-9834). If some statistics requires > additional passes to the dataset, we can expose an option to let users select > desired statistics before model fitting. > {code} > > summary(model) > Call: > glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris) > Deviance Residuals: > Min1QMedian3Q Max > -1.30711 -0.25713 -0.05325 0.19542 1.41253 > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 2.2514 0.3698 6.089 9.57e-09 *** > Sepal.Width 0.8036 0.1063 7.557 4.19e-12 *** > Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 *** > Speciesvirginica1.9468 0.1000 19.465 < 2e-16 *** > --- > Signif. codes: > 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > (Dispersion parameter for gaussian family taken to be 0.1918059) > Null deviance: 102.168 on 149 degrees of freedom > Residual deviance: 28.004 on 146 degrees of freedom > AIC: 183.94 > Number of Fisher Scoring iterations: 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver
[ https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904471#comment-14904471 ] Mohamed Baddar commented on SPARK-9836: --- Thanks a lot , i will try one of the starter tasks , but seems they are all taken , if so , what should i do next ? > Provide R-like summary statistics for ordinary least squares via normal > equation solver > --- > > Key: SPARK-9836 > URL: https://issues.apache.org/jira/browse/SPARK-9836 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng > > In R, model fitting comes with summary statistics. We can provide most of > those via normal equation solver (SPARK-9834). If some statistics requires > additional passes to the dataset, we can expose an option to let users select > desired statistics before model fitting. > {code} > > summary(model) > Call: > glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris) > Deviance Residuals: > Min1QMedian3Q Max > -1.30711 -0.25713 -0.05325 0.19542 1.41253 > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 2.2514 0.3698 6.089 9.57e-09 *** > Sepal.Width 0.8036 0.1063 7.557 4.19e-12 *** > Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 *** > Speciesvirginica1.9468 0.1000 19.465 < 2e-16 *** > --- > Signif. codes: > 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > (Dispersion parameter for gaussian family taken to be 0.1918059) > Null deviance: 102.168 on 149 degrees of freedom > Residual deviance: 28.004 on 146 degrees of freedom > AIC: 183.94 > Number of Fisher Scoring iterations: 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9798) CrossValidatorModel Documentation Improvements
[ https://issues.apache.org/jira/browse/SPARK-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904467#comment-14904467 ] Mohamed Baddar commented on SPARK-9798: --- Hello rerngvit I am also new to contribution , can we work together , or split this task into other subtasks , to help us both get involved Thanks > CrossValidatorModel Documentation Improvements > -- > > Key: SPARK-9798 > URL: https://issues.apache.org/jira/browse/SPARK-9798 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Feynman Liang >Priority: Minor > Labels: starter > > CrossValidatorModel's avgMetrics and bestModel need documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9835) Iteratively reweighted least squares solver for GLMs
[ https://issues.apache.org/jira/browse/SPARK-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900694#comment-14900694 ] Mohamed Baddar commented on SPARK-9835: --- Can I work on this issue Thanks > Iteratively reweighted least squares solver for GLMs > > > Key: SPARK-9835 > URL: https://issues.apache.org/jira/browse/SPARK-9835 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-9834, we can implement iteratively reweighted least squares > (IRLS) solver for GLMs with other families and link functions. It could > provide R-like summary statistics after training, but the number of features > cannot be very large, e.g. more than 4096. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver
[ https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900678#comment-14900678 ] Mohamed Baddar commented on SPARK-9836: --- Hello , Can i be assigned to This Task Thanks > Provide R-like summary statistics for ordinary least squares via normal > equation solver > --- > > Key: SPARK-9836 > URL: https://issues.apache.org/jira/browse/SPARK-9836 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng > > In R, model fitting comes with summary statistics. We can provide most of > those via normal equation solver (SPARK-9834). If some statistics requires > additional passes to the dataset, we can expose an option to let users select > desired statistics before model fitting. > {code} > > summary(model) > Call: > glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris) > Deviance Residuals: > Min1QMedian3Q Max > -1.30711 -0.25713 -0.05325 0.19542 1.41253 > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 2.2514 0.3698 6.089 9.57e-09 *** > Sepal.Width 0.8036 0.1063 7.557 4.19e-12 *** > Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 *** > Speciesvirginica1.9468 0.1000 19.465 < 2e-16 *** > --- > Signif. codes: > 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > (Dispersion parameter for gaussian family taken to be 0.1918059) > Null deviance: 102.168 on 149 degrees of freedom > Residual deviance: 28.004 on 146 degrees of freedom > AIC: 183.94 > Number of Fisher Scoring iterations: 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org