[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
[ https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530633#comment-16530633 ] Seth Hendrickson commented on SPARK-24579: -- Hmm... Am I the only one who cannot see comments on the doc? > SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks > > > Key: SPARK-24579 > URL: https://issues.apache.org/jira/browse/SPARK-24579 > Project: Spark > Issue Type: Epic > Components: ML, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen > Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange > between Apache Spark and DL%2FAI Frameworks .pdf > > > (see attached SPIP pdf for more details) > At the crossroads of big data and AI, we see both the success of Apache Spark > as a unified > analytics engine and the rise of AI frameworks like TensorFlow and Apache > MXNet (incubating). > Both big data and AI are indispensable components to drive business > innovation and there have > been multiple attempts from both communities to bring them together. > We saw efforts from AI community to implement data solutions for AI > frameworks like tf.data and tf.Transform. However, with 50+ data sources and > built-in SQL, DataFrames, and Streaming features, Spark remains the community > choice for big data. This is why we saw many efforts to integrate DL/AI > frameworks with Spark to leverage its power, for example, TFRecords data > source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project > Hydrogen, this SPIP takes a different angle at Spark + AI unification. > None of the integrations are possible without exchanging data between Spark > and external DL/AI frameworks. And the performance matters. However, there > doesn’t exist a standard way to exchange data and hence implementation and > performance optimization fall into pieces. For example, TensorFlowOnSpark > uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and > save data and pass the RDD records to TensorFlow in Python. And TensorFrames > converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s > Java API. How can we reduce the complexity? > The proposal here is to standardize the data exchange interface (or format) > between Spark and DL/AI frameworks and optimize data conversion from/to this > interface. So DL/AI frameworks can leverage Spark to load data virtually > from anywhere without spending extra effort building complex data solutions, > like reading features from a production data warehouse or streaming model > inference. Spark users can use DL/AI frameworks without learning specific > data APIs implemented there. And developers from both sides can work on > performance optimizations independently given the interface itself doesn’t > introduce big overhead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23704) PySpark access of individual trees in random forest is slow
[ https://issues.apache.org/jira/browse/SPARK-23704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520832#comment-16520832 ] Seth Hendrickson commented on SPARK-23704: -- Instead of {code:java} model.trees[0].transform(test_feat).select('rowNum','probability'){code} Can you try {code:java} trees = model.trees trees[0].transform(test_feat).select('rowNum','probability'){code} And time only the second line? The first line actually calls into the JVM and creates new trees in Python. > PySpark access of individual trees in random forest is slow > --- > > Key: SPARK-23704 > URL: https://issues.apache.org/jira/browse/SPARK-23704 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.1 > Environment: PySpark 2.2.1 / Windows 10 >Reporter: Julian King >Priority: Minor > > Making predictions from a randomForestClassifier PySpark is much faster than > making predictions from an individual tree contained within the .trees > attribute. > In fact, the model.transform call without an action is more than 10x slower > for an individual tree vs the model.transform call for the random forest > model. > See > [https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark] > for example with timing. > Ideally: > * Getting a prediction from a single tree should be comparable to or faster > than getting predictions from the whole tree > * Getting all the predictions from all the individual trees should be > comparable in speed to getting the predictions from the random forest > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3159) Check for reducible DecisionTree
[ https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson resolved SPARK-3159. - Resolution: Fixed Fix Version/s: 2.4.0 > Check for reducible DecisionTree > > > Key: SPARK-3159 > URL: https://issues.apache.org/jira/browse/SPARK-3159 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > Fix For: 2.4.0 > > > Improvement: test-time computation > Currently, pairs of leaf nodes with the same parent can both output the same > prediction. This happens since the splitting criterion (e.g., Gini) is not > the same as prediction accuracy/MSE; the splitting criterion can sometimes be > improved even when both children would still output the same prediction > (e.g., based on the majority label for classification). > We could check the tree and reduce it if possible after training. > Note: This happens with scikit-learn as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib
[ https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368109#comment-16368109 ] Seth Hendrickson commented on SPARK-23437: -- TBH, this seems like a pretty reasonable request. While I agree we do seem to tell people that the "standard" practice is to implement as a third party package and then integrate later, I don't see this happen in practice. I don't know that we've even validated that the "implement as third party package, then in Spark later on" approach even really works. Perhaps an even stronger reason for resisting new algorithms is just lack of reviewer/developer support on Spark ML. It's hard to predict if there will be anyone to review the PR within a reasonable amount of time, even if the code is well-designed. AFAIK, we haven't added any major algos since GeneralizedLinearRegression, which has to have been a couple years ago. That said, I think this is something to at least consider. We can start by discussing what algorithms exist, and why we'd choose a particular one. Strong arguments for why we need GPs in Spark ML are also beneficial. The fact that there isn't a non-parametric regression algo in Spark has some merit, but we don't write new algorithms just for the sake of filling in gaps - there needs to be user demand (which, unfortunately, is often hard to prove). It also helps to point to a package that already implements the algo you're proposing, but for example I don't believe scikit implements the linear-time version so we can't really leverage their experience. Providing more information on any/all of these categories will help make a stronger case, and I do think GPs can be a useful addition. Thanks for leading the discussion! > [ML] Distributed Gaussian Process Regression for MLlib > -- > > Key: SPARK-23437 > URL: https://issues.apache.org/jira/browse/SPARK-23437 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.2.1 >Reporter: Valeriy Avanesov >Priority: Major > > Gaussian Process Regression (GP) is a well known black box non-linear > regression approach [1]. For years the approach remained inapplicable to > large samples due to its cubic computational complexity, however, more recent > techniques (Sparse GP) allowed for only linear complexity. The field > continues to attracts interest of the researches – several papers devoted to > GP were present on NIPS 2017. > Unfortunately, non-parametric regression techniques coming with mllib are > restricted to tree-based approaches. > I propose to create and include an implementation (which I am going to work > on) of so-called robust Bayesian Committee Machine proposed and investigated > in [2]. > [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian > Processes for Machine Learning (Adaptive Computation and Machine Learning)_. > The MIT Press. > [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian > processes. In _Proceedings of the 32nd International Conference on > International Conference on Machine Learning - Volume 37_ (ICML'15), Francis > Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341213#comment-16341213 ] Seth Hendrickson commented on SPARK-17139: -- Good catch, apart from redesigning this patch, I'm not sure I see a way to avoid it either. > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Weichen Xu >Priority: Major > Fix For: 2.3.0 > > > Add model summary to multinomial logistic regression using same interface as > in other ML models. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23138) Add user guide example for multiclass logistic regression summary
[ https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329783#comment-16329783 ] Seth Hendrickson commented on SPARK-23138: -- I can submit a PR for this soon. > Add user guide example for multiclass logistic regression summary > - > > Key: SPARK-23138 > URL: https://issues.apache.org/jira/browse/SPARK-23138 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Priority: Minor > > We haven't updated the user guide to reflect the multiclass logistic > regression summary added in SPARK-17139. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23138) Add user guide example for multiclass logistic regression summary
Seth Hendrickson created SPARK-23138: Summary: Add user guide example for multiclass logistic regression summary Key: SPARK-23138 URL: https://issues.apache.org/jira/browse/SPARK-23138 Project: Spark Issue Type: Documentation Components: ML Affects Versions: 2.3.0 Reporter: Seth Hendrickson We haven't updated the user guide to reflect the multiclass logistic regression summary added in SPARK-17139. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22993) checkpointInterval param doc should be clearer
Seth Hendrickson created SPARK-22993: Summary: checkpointInterval param doc should be clearer Key: SPARK-22993 URL: https://issues.apache.org/jira/browse/SPARK-22993 Project: Spark Issue Type: Documentation Components: ML Affects Versions: 2.3.0 Reporter: Seth Hendrickson Priority: Trivial several algorithms use the shared parameter {{HasCheckpointInterval}} (ALS, LDA, GBT), each of which silently ignores the parameter when the checkpoint directory is not set on the spark context. This should be documented in the param doc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22461) Move Spark ML model summaries into a dedicated package
Seth Hendrickson created SPARK-22461: Summary: Move Spark ML model summaries into a dedicated package Key: SPARK-22461 URL: https://issues.apache.org/jira/browse/SPARK-22461 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: Seth Hendrickson Priority: Minor Summaries in ML right now do not adhere to a common abstraction, and are usually placed in the same file as the algorithm, which makes these files unwieldy. We can and should unify them under one hierarchy, perhaps in a new {{summary}} module. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related
[ https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16238137#comment-16238137 ] Seth Hendrickson commented on SPARK-22433: -- The main problem I see is that we put "r2" in the `RegressionEvaluator` class, which can be used for all types of regression - e.g. DecisionTreeRegressor, which is non-sensical. Removing it would break compatibility and is probably not worth it since the end user is responsible for using the tools appropriately anyway. I'm not sure there is much to do here. AFAIK using r2 on regularized models is a fuzzy area, but I don't think it's doing much harm to leave it and I don't think we'd be concerned about our test cases. Certainly unit tests don't imply an endorsement of the methodology anyway. > Linear regression R^2 train/test terminology related > - > > Key: SPARK-22433 > URL: https://issues.apache.org/jira/browse/SPARK-22433 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Teng Peng >Priority: Minor > > Traditional statistics is traditional statistics. Their goal, framework, and > terminologies are not the same as ML. However, in linear regression related > components, this distinction is not clear, which is reflected: > 1. regressionMetric + regressionEvaluator : > * R2 shouldn't be there. > * A better name "regressionPredictionMetric". > 2. LinearRegressionSuite: > * Shouldn't test R2 and residuals on test data. > * There is no train set and test set in this setting. > 3. Terminology: there is no "linear regression with L1 regularization". > Linear regression is linear. Adding a penalty term, then it is no longer > linear. Just call it "LASSO", "ElasticNet". > There are more. I am working on correcting them. > They are not breaking anything, but it does not make one feel good to see the > basic distinction is blurred. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16179311#comment-16179311 ] Seth Hendrickson commented on SPARK-17136: -- Ping [~yanboliang]. This is relevant since we have recently been making attempts to provide added features/optimizers/algorithms around linear/logistic regression. This would be a good step toward building interfaces that can be extended in Spark ML. Could you elaborate on mimicking Spark SQL? One concern I have is that, under the current proposal, we'd have a parameter `setMinimizer` that uses a generic Scala class that can't be easily serialized to Python, etc... It wouldn't be compatible. Maybe we could use reflection like Spark SQL does, but you'd still have to implement custom optimizers in Scala. Anyway, I think this, and work related to this, would be really beneficial to Spark ML. > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163106#comment-16163106 ] Seth Hendrickson commented on SPARK-19634: -- Is there a plan for moving the linear algorithms that use the summarizer to this new implementation? > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > Fix For: 2.3.0 > > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21245) Resolve code duplication for classification/regression summarizers
[ https://issues.apache.org/jira/browse/SPARK-21245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson updated SPARK-21245: - Labels: starter (was: ) Priority: Minor (was: Major) > Resolve code duplication for classification/regression summarizers > -- > > Key: SPARK-21245 > URL: https://issues.apache.org/jira/browse/SPARK-21245 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.1 >Reporter: Seth Hendrickson >Priority: Minor > Labels: starter > > In several places (LogReg, LinReg, SVC) in Spark ML, we collect summary > information about training data using {{MultivariateOnlineSummarizer}} and > {{MulticlassSummarizer}}. We have the same code appearing in several places > (and including test suites). We can eliminate this by creating a common > implementation somewhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21405) Add LBFGS solver for GeneralizedLinearRegression
[ https://issues.apache.org/jira/browse/SPARK-21405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16087418#comment-16087418 ] Seth Hendrickson commented on SPARK-21405: -- Good point, Nick. Though conveniently the machinery to deal with this is already in place: https://github.com/apache/spark/pull/15930 > Add LBFGS solver for GeneralizedLinearRegression > > > Key: SPARK-21405 > URL: https://issues.apache.org/jira/browse/SPARK-21405 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson > > GeneralizedLinearRegression in Spark ML currently only allows 4096 features > because it uses IRLS, and hence WLS, as an optimizer which relies on > collecting the covariance matrix to the driver. GLMs can also be fit by > simple gradient based methods like LBFGS. > The new API from > [SPARK-19762|https://issues.apache.org/jira/browse/SPARK-19762] makes this > easy to add. I've already prototyped it, and it works pretty well. This > change would allow an arbitrary number of features (up to what can fit on a > single node) as in Linear/Logistic regression. > For reference, other GLM packages also support this - e.g. statsmodels, H2O. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21406) Add logLikelihood to GLR families
Seth Hendrickson created SPARK-21406: Summary: Add logLikelihood to GLR families Key: SPARK-21406 URL: https://issues.apache.org/jira/browse/SPARK-21406 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.3.0 Reporter: Seth Hendrickson Priority: Minor To be able to implement the typical gradient based aggregator for GLR, we'd need to add a {{logLikelihood(y: Double, mu: Double, weight: Double)}} method to GLR {{Family}} class. One possible hiccup - Tweedie family log likelihood is not computationally feasible [link| http://support.sas.com/documentation/cdl/en/stathpug/67524/HTML/default/viewer.htm#stathpug_hpgenselect_details16.htm]. H2O gets around this by using the deviance instead. We could leave it unimplemented initially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21405) Add LBFGS solver for GeneralizedLinearRegression
[ https://issues.apache.org/jira/browse/SPARK-21405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086071#comment-16086071 ] Seth Hendrickson commented on SPARK-21405: -- cc [~yanboliang] [~actuaryzhang] I'm happy to work on it, but wanted to get your opinions here. Thoughts? > Add LBFGS solver for GeneralizedLinearRegression > > > Key: SPARK-21405 > URL: https://issues.apache.org/jira/browse/SPARK-21405 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson > > GeneralizedLinearRegression in Spark ML currently only allows 4096 features > because it uses IRLS, and hence WLS, as an optimizer which relies on > collecting the covariance matrix to the driver. GLMs can also be fit by > simple gradient based methods like LBFGS. > The new API from > [SPARK-19762|https://issues.apache.org/jira/browse/SPARK-19762] makes this > easy to add. I've already prototyped it, and it works pretty well. This > change would allow an arbitrary number of features (up to what can fit on a > single node) as in Linear/Logistic regression. > For reference, other GLM packages also support this - e.g. statsmodels, H2O. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21405) Add LBFGS solver for GeneralizedLinearRegression
Seth Hendrickson created SPARK-21405: Summary: Add LBFGS solver for GeneralizedLinearRegression Key: SPARK-21405 URL: https://issues.apache.org/jira/browse/SPARK-21405 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: Seth Hendrickson GeneralizedLinearRegression in Spark ML currently only allows 4096 features because it uses IRLS, and hence WLS, as an optimizer which relies on collecting the covariance matrix to the driver. GLMs can also be fit by simple gradient based methods like LBFGS. The new API from [SPARK-19762|https://issues.apache.org/jira/browse/SPARK-19762] makes this easy to add. I've already prototyped it, and it works pretty well. This change would allow an arbitrary number of features (up to what can fit on a single node) as in Linear/Logistic regression. For reference, other GLM packages also support this - e.g. statsmodels, H2O. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21245) Resolve code duplication for classification/regression summarizers
Seth Hendrickson created SPARK-21245: Summary: Resolve code duplication for classification/regression summarizers Key: SPARK-21245 URL: https://issues.apache.org/jira/browse/SPARK-21245 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.2.1 Reporter: Seth Hendrickson In several places (LogReg, LinReg, SVC) in Spark ML, we collect summary information about training data using {{MultivariateOnlineSummarizer}} and {{MulticlassSummarizer}}. We have the same code appearing in several places (and including test suites). We can eliminate this by creating a common implementation somewhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21152) Use level 3 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-21152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061158#comment-16061158 ] Seth Hendrickson commented on SPARK-21152: -- [~yanboliang] I can do performance testing and post the results for sure. Still, do you have any thoughts about the caching issues? I wanted to see if it was a deal-breaker before getting so far as conducting exhaustive performance tests. > Use level 3 BLAS operations in LogisticAggregator > - > > Key: SPARK-21152 > URL: https://issues.apache.org/jira/browse/SPARK-21152 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.1 >Reporter: Seth Hendrickson > > In logistic regression gradient update, we currently compute by each > individual row. If we blocked the rows together, we can do a blocked gradient > update which leverages the BLAS GEMM operation. > On high dimensional dense datasets, I've observed ~10x speedups. The problem > here, though, is that it likely won't improve the sparse case so we need to > keep both implementations around, and this blocked algorithm will require > caching a new dataset of type: > {code} > BlockInstance(label: Vector, weight: Vector, features: Matrix) > {code} > We have avoided caching anything beside the original dataset passed to train > in the past because it adds memory overhead if the user has cached this > original dataset for other reasons. Here, I'd like to discuss whether we > think this patch would be worth the investment, given that it only improves a > subset of the use cases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21152) Use level 3 BLAS operations in LogisticAggregator
Seth Hendrickson created SPARK-21152: Summary: Use level 3 BLAS operations in LogisticAggregator Key: SPARK-21152 URL: https://issues.apache.org/jira/browse/SPARK-21152 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.1.1 Reporter: Seth Hendrickson In logistic regression gradient update, we currently compute by each individual row. If we blocked the rows together, we can do a blocked gradient update which leverages the BLAS GEMM operation. On high dimensional dense datasets, I've observed ~10x speedups. The problem here, though, is that it likely won't improve the sparse case so we need to keep both implementations around, and this blocked algorithm will require caching a new dataset of type: {code} BlockInstance(label: Vector, weight: Vector, features: Matrix) {code} We have avoided caching anything beside the original dataset passed to train in the past because it adds memory overhead if the user has cached this original dataset for other reasons. Here, I'd like to discuss whether we think this patch would be worth the investment, given that it only improves a subset of the use cases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21152) Use level 3 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-21152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056025#comment-16056025 ] Seth Hendrickson commented on SPARK-21152: -- cc [~dbtsai] [~mlnick] [~srowen] BTW, I've been working on this. DB, you and I discussed the caching issue in the past. Here's a comment from DB for reference: "In the old mllib implementation, I just decided to have a copy of entire standardized dataset and had it cached for simplicity. After talking to couple people for their use cases, many times, they're training models on the same cached dataset for different regularizations, and then the old mllib will cache them again and again which will result pressure on GC and waste some memory space." > Use level 3 BLAS operations in LogisticAggregator > - > > Key: SPARK-21152 > URL: https://issues.apache.org/jira/browse/SPARK-21152 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.1 >Reporter: Seth Hendrickson > > In logistic regression gradient update, we currently compute by each > individual row. If we blocked the rows together, we can do a blocked gradient > update which leverages the BLAS GEMM operation. > On high dimensional dense datasets, I've observed ~10x speedups. The problem > here, though, is that it likely won't improve the sparse case so we need to > keep both implementations around, and this blocked algorithm will require > caching a new dataset of type: > {code} > BlockInstance(label: Vector, weight: Vector, features: Matrix) > {code} > We have avoided caching anything beside the original dataset passed to train > in the past because it adds memory overhead if the user has cached this > original dataset for other reasons. Here, I'd like to discuss whether we > think this patch would be worth the investment, given that it only improves a > subset of the use cases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20988) Convert logistic regression to new aggregator framework
[ https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048489#comment-16048489 ] Seth Hendrickson commented on SPARK-20988: -- I've already started it a bit. Would you mind doing the same thing for LinearSVC instead? It should mostly orthogonal, though I think some of the unit tests will need to share code. > Convert logistic regression to new aggregator framework > --- > > Key: SPARK-20988 > URL: https://issues.apache.org/jira/browse/SPARK-20988 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Priority: Minor > > Use the hierarchy from SPARK-19762 for logistic regression optimization -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20988) Convert logistic regression to new aggregator framework
Seth Hendrickson created SPARK-20988: Summary: Convert logistic regression to new aggregator framework Key: SPARK-20988 URL: https://issues.apache.org/jira/browse/SPARK-20988 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.3.0 Reporter: Seth Hendrickson Priority: Minor Use the hierarchy from SPARK-19762 for logistic regression optimization -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944030#comment-15944030 ] Seth Hendrickson edited comment on SPARK-19634 at 3/27/17 10:23 PM: I'm coming to this a bit late, but I'm finding things a bit hard to follow. Reading the design doc, it seems that the original plan was to implement two interfaces - an RDD one that provides the same performance as current {{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. from design doc: "...In the meantime, there will be a (possibly faster) RDD interface and a (more flexible) Dataframe interface." Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the it was pivoted away from UDAF, but the design doc does not reflect that. Also, if there is to be an RDD interface, what is the JIRA for it and what will it look like? Also, there are several concerns raised in the design doc about this Catalyst aggregate approach being less efficient, and the consensus seemed to be: provide an initial API with a "slow" implementation that will be improved upon in the future. Is that correct? I'm not that familiar with the Catalyst optimizer, but are we sure there is a good way to implement the tree-reduce type aggregation, and if so could we document that? I'd prefer to get the details hashed out further rather than rushing to provide an API and initial slow implementation, that way we can make sure that we get this correct in the long-term. I really appreciate some clarification and my apologies if I have missed any of the details/discussion. was (Author: sethah): I'm coming to this a bit late, but I'm finding things a bit hard to follow. Reading the design doc, it seems that the original plan was to implement two interfaces - an RDD one that provides the same performance as current {{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. from design doc: "...In the meantime, there will be a (possibly faster) RDD interface and a (more flexible) Dataframe interface." Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the it was pivoted away from UDAF, but the design doc does not reflect that. Also, if there is to be an RDD interface, what is the JIRA for it and what will it look like? Also, there are several concerns raised in the design doc about this Catalyst aggregate approach being less efficient, and the consensus seemed to be: provide an initial API with a "slow" implementation that will be improved upon in the future. Is that correct? I'm not that familiar with the Catalyst optimizer, but are we sure there is a good way to implement the tree-reduce type aggregation, and if so could we document that? If this is still targeted at 2.2, why? I'd prefer to get the details hashed out further rather than rushing to provide an API and initial slow implementation, that way we can make sure that we get this correct in the long-term. I really appreciate some clarification and my apologies if I have missed any of the details/discussion. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944030#comment-15944030 ] Seth Hendrickson commented on SPARK-19634: -- I'm coming to this a bit late, but I'm finding things a bit hard to follow. Reading the design doc, it seems that the original plan was to implement two interfaces - an RDD one that provides the same performance as current {{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. from design doc: "...In the meantime, there will be a (possibly faster) RDD interface and a (more flexible) Dataframe interface." Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the it was pivoted away from UDAF, but the design doc does not reflect that. Also, if there is to be an RDD interface, what is the JIRA for it and what will it look like? Also, there are several concerns raised in the design doc about this Catalyst aggregate approach being less efficient, and the consensus seemed to be: provide an initial API with a "slow" implementation that will be improved upon in the future. Is that correct? I'm not that familiar with the Catalyst optimizer, but are we sure there is a good way to implement the tree-reduce type aggregation, and if so could we document that? If this is still targeted at 2.2, why? I'd prefer to get the details hashed out further rather than rushing to provide an API and initial slow implementation, that way we can make sure that we get this correct in the long-term. I really appreciate some clarification and my apologies if I have missed any of the details/discussion. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20083) Change matrix toArray to not create a new array when matrix is already column major
[ https://issues.apache.org/jira/browse/SPARK-20083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943873#comment-15943873 ] Seth Hendrickson commented on SPARK-20083: -- Yes, that would be the intention. We have to take care to change the existing code when we require a new array from {{toArray}} when we implement this change. > Change matrix toArray to not create a new array when matrix is already column > major > --- > > Key: SPARK-20083 > URL: https://issues.apache.org/jira/browse/SPARK-20083 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Seth Hendrickson >Priority: Minor > > {{toArray}} always creates a new array in column major format, even when the > resulting array is the same as the backing values. We should change this to > just return a reference to the values array when it is already column major. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
[ https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15941114#comment-15941114 ] Seth Hendrickson commented on SPARK-17137: -- I can make a PR for using this inside the MLOR code, but I probably won't have time to do performance tests within the next couple of days (since code freeze has already passed). [~dbtsai] Do you think we need to do performance tests before this patch goes in? > Add compressed support for multinomial logistic regression coefficients > --- > > Key: SPARK-17137 > URL: https://issues.apache.org/jira/browse/SPARK-17137 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > For sparse coefficients in MLOR, such as when high L1 regularization, it may > be more efficient to store coefficients in compressed format. We can add this > option to MLOR and perhaps to do some performance tests to verify > improvements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20083) Change matrix toArray to not create a new array when matrix is already column major
Seth Hendrickson created SPARK-20083: Summary: Change matrix toArray to not create a new array when matrix is already column major Key: SPARK-20083 URL: https://issues.apache.org/jira/browse/SPARK-20083 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: Seth Hendrickson Priority: Minor {{toArray}} always creates a new array in column major format, even when the resulting array is the same as the backing values. We should change this to just return a reference to the values array when it is already column major. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15935112#comment-15935112 ] Seth Hendrickson commented on SPARK-17136: -- The reason to support setting them in both places would be backwards compatibility mainly. If we still allow users to set {{maxIter}} on the estimator then we won't break code that previously did this. Specifying the optimizer, either one built into Spark or a custom one, would be optional and something mostly advanced users would do. About grid-based CV, this would be a point that we need to carefully consider and make sure that we get it right. We'd still allow users to search over grids of {{maxIter}}, {{tol}} etc... since those params are still there, but additionally users could search over different optimizers and optimizers with different parameters themselves. I think that could be a bit clunky, but it's open for design discussion. e.g. {code} val paramGrid = new ParamGridBuilder() .addGrid(lr.minimizer, Array(new LBFGS(), new OWLQN(), new LBFGSB(lb, ub))) .build() {code} Yes, there are cases where users could supply conflicting grids, but AFAICT this problem already exists, e.g. {code} val paramGrid = new ParamGridBuilder() .addGrid(lr.solver, Array("normal", "l-bfgs")) .addGrid(lr.maxIter, Array(10, 20)) // maxIter is ignored when solver is normal .build() {code} About your suggestion of mimicking Spark SQL - would you mind elaborating here or on the design doc? I'm not as familiar with it, so if you have some design in mind it would be great to hear that. > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934960#comment-15934960 ] Seth Hendrickson commented on SPARK-7129: - I don't think anyone is working on it. Though I'm afraid it is probably not a good use of time to spend on this task, for a couple of reasons. We still don't have weight support in trees and there is extremely limited bandwidth of reviewers/committers in Spark ML at the moment. Further, there are many more important tasks that need to be done in ML so I would rate this as low priority, which also means it is less likely to be reviewed or see much progress. Finally, given the recent success of things like xgboost/lightGBM, we may want to rethink/rewrite the existing boosting framework to see if we can get similar performance. If anything, I think we need to think about how we'd like to proceed improving the boosting libraries in Spark from an overall point of view, but that is a large task that is likely a few releases away. I'd be curious to hear others' thoughts of course, but this is the state of things AFAIK. I guess I don't see this as a priority, but it could become one given enough community interest. > Add generic boosting algorithm to spark.ml > -- > > Key: SPARK-7129 > URL: https://issues.apache.org/jira/browse/SPARK-7129 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > The Pipelines API will make it easier to create a generic Boosting algorithm > which can work with any Classifier or Regressor. Creating this feature will > require researching the possible variants and extensions of boosting which we > may want to support now and/or in the future, and planning an API which will > be properly extensible. > In particular, it will be important to think about supporting: > * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.) > * multiclass variants > * multilabel variants (which will probably be in a separate class and JIRA) > * For more esoteric variants, we should consider them but not design too much > around them: totally corrective boosting, cascaded models > Note: This may interact some with the existing tree ensemble methods, but it > should be largely separate since the tree ensemble APIs and implementations > are specialized for trees. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19762) Implement aggregator/loss function hierarchy and apply to linear regression
Seth Hendrickson created SPARK-19762: Summary: Implement aggregator/loss function hierarchy and apply to linear regression Key: SPARK-19762 URL: https://issues.apache.org/jira/browse/SPARK-19762 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.2.0 Reporter: Seth Hendrickson Priority: Minor Creating this subtask as a first step for consolidating ML aggregators. We can start by just applying this change to linear regression, to keep the PR more manageable in scope. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators
[ https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885185#comment-15885185 ] Seth Hendrickson commented on SPARK-19747: -- BTW, I have a rough prototype which at least indicates this is do-able. Still some kinks to work out though. I would like to work on this task if that's alright. > Consolidate code in ML aggregators > -- > > Key: SPARK-19747 > URL: https://issues.apache.org/jira/browse/SPARK-19747 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Seth Hendrickson >Priority: Minor > > Many algorithms in Spark ML are posed as optimization of a differentiable > loss function over a parameter vector. We implement these by having a loss > function accumulate the gradient using an Aggregator class which has methods > that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm > that obeys this form implements a cost function class and an aggregator > class, which are completely separate from one another but share probably 80% > of the same code. > I think it is important to clean things like this up, and if we can do it > properly it will make the code much more maintainable, readable, and bug > free. It will also help reduce the overhead of future implementations. > The design is of course open for discussion, but I think we should aim to: > 1. Have all aggregators share parent classes, so that they only need to > implement the {{add}} function. This is really the only difference in the > current aggregators. > 2. Have a single, generic cost function that is parameterized by the > aggregator type. This reduces the many places we implement cost functions and > greatly reduces the amount of duplicated code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19747) Consolidate code in ML aggregators
Seth Hendrickson created SPARK-19747: Summary: Consolidate code in ML aggregators Key: SPARK-19747 URL: https://issues.apache.org/jira/browse/SPARK-19747 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: Seth Hendrickson Priority: Minor Many algorithms in Spark ML are posed as optimization of a differentiable loss function over a parameter vector. We implement these by having a loss function accumulate the gradient using an Aggregator class which has methods that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm that obeys this form implements a cost function class and an aggregator class, which are completely separate from one another but share probably 80% of the same code. I think it is important to clean things like this up, and if we can do it properly it will make the code much more maintainable, readable, and bug free. It will also help reduce the overhead of future implementations. The design is of course open for discussion, but I think we should aim to: 1. Have all aggregators share parent classes, so that they only need to implement the {{add}} function. This is really the only difference in the current aggregators. 2. Have a single, generic cost function that is parameterized by the aggregator type. This reduces the many places we implement cost functions and greatly reduces the amount of duplicated code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19746) LogisticAggregator is inefficient in indexing
Seth Hendrickson created SPARK-19746: Summary: LogisticAggregator is inefficient in indexing Key: SPARK-19746 URL: https://issues.apache.org/jira/browse/SPARK-19746 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.1.0 Reporter: Seth Hendrickson The following code occurs in the `LogisticAggregator.add` method, which is a performance critical path. {code} val localCoefficients = bcCoefficients.value features.foreachActive { (index, value) => val stdValue = value / localFeaturesStd(index) var j = 0 while (j < numClasses) { margins(j) += localCoefficients(index * numClasses + j) * stdValue j += 1 } } {code} `llocalCoefficients(index * numClasses + j)` calls the `apply` method on `Vector`, which dispatches to `asBreeze(index * numClasses + j)` which creates a new Breeze vector, and then indexes it. This is very inefficient, creates a lot of unnecessary garbage, and we can avoid it by indexing the underlying array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19745) SVCAggregator serializes coefficients
Seth Hendrickson created SPARK-19745: Summary: SVCAggregator serializes coefficients Key: SPARK-19745 URL: https://issues.apache.org/jira/browse/SPARK-19745 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: Seth Hendrickson Similar to [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], the SVC aggregator captures the coefficients in the class closure, and therefore ships them around during optimization. We can prevent this with a bit of reorganization of the aggregator class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15867049#comment-15867049 ] Seth Hendrickson commented on SPARK-18392: -- I would pretty strongly prefer to focus on adding AND-amplification before adding anything else to LSH. That is more of a missing part of the functionality, where as other things are enhancements. Curious to hear others' thoughts on this. > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issue
[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865079#comment-15865079 ] Seth Hendrickson commented on SPARK-9478: - [~josephkb] Done. Thanks for your feedback on sampling! > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19591) Add sample weights to decision trees
Seth Hendrickson created SPARK-19591: Summary: Add sample weights to decision trees Key: SPARK-19591 URL: https://issues.apache.org/jira/browse/SPARK-19591 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.1.0 Reporter: Seth Hendrickson Add sample weights to decision trees -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858438#comment-15858438 ] Seth Hendrickson commented on SPARK-17139: -- Seems like a reasonable way to solve a messy problem - so I think we should go ahead with it. > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Add model summary to multinomial logistic regression using same interface as > in other ML models. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857230#comment-15857230 ] Seth Hendrickson commented on SPARK-17139: -- [~josephkb] Is [this more or less what you had in mind|https://gist.github.com/sethah/83c57fd77385979579cb44f3d5730e67]? > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Add model summary to multinomial logistic regression using same interface as > in other ML models. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19313) GaussianMixture throws cryptic error when number of features is too high
Seth Hendrickson created SPARK-19313: Summary: GaussianMixture throws cryptic error when number of features is too high Key: SPARK-19313 URL: https://issues.apache.org/jira/browse/SPARK-19313 Project: Spark Issue Type: Bug Components: ML, MLlib Reporter: Seth Hendrickson Priority: Minor The following fails {code} val df = Seq( Vectors.sparse(46400, Array(0, 4), Array(3.0, 8.0)), Vectors.sparse(46400, Array(1, 5), Array(4.0, 9.0))) .map(Tuple1.apply).toDF("features") val gm = new GaussianMixture() gm.fit(df) {code} It fails because GMMs allocate an array of size {{numFeatures * numFeatures}} and in this case we'll get integer overflow. We should limit the number of features appropriately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15819062#comment-15819062 ] Seth Hendrickson commented on SPARK-17136: -- I'm interested in working on this task including both driving the discussion and submitting an initial PR when it is time. I have the beginnings of a design document constructed [here|https://docs.google.com/document/d/1ynyTwlNw4b6DovG6m8okd3fD2PVZKCEq5rFfsg5Ba1k/edit?usp=sharing], and I'd like to open it up for community feedback and input. We do see requests from time to time for users to use their own optimizers in Spark ML algorithms and we have not supported it in Spark ML. With fairly minimal added code, we can make Spark ML optimizers pluggable which provides a tangible benefit to users. Potentially, we can design an API that has benefits beyond just that, and I'm interested to hear some of the other needs/wants people have. cc [~dbtsai] [~yanboliang] [~WeichenXu123] [~josephkb] [~srowen] > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1581#comment-1581 ] Seth Hendrickson commented on SPARK-10078: -- As a part of [SPARK-17136|https://issues.apache.org/jira/browse/SPARK-17136] I am working on a generic optimization interface for Spark, which would allow users to easily plug in their own optimizers in place of built-in ones. Because of this, I have also been looking into how we can create an interface that allows optimization with both local and distributed vector types in a single interface. I have a branch that I have been doing some prototyping on [here|https://github.com/sethah/spark/tree/spark-vlbfgs]. Actually, I was able to get Yanbo's VLogisticRegression class working (on a very small dataset) using the VLBFGS optimizer in my branch, which also works with local vector types. Maybe you can let me know if this lines up at all with what you were thinking? Thinking about this interface without adding VL-BFGS, we can avoid any code duplication with Breeze to start because we can simply plug in the Breeze code to our abstraction (in my branch, that is what is done for LBFGS and OWLQN). Adding VL-BFGS is a bit trickier. The problems I see are that we need an abstraction that will allow us to persist and unpersist the parameter vectors during optimization as needed. Adding "persist" and "unpersist" methods to a vector space, for example, seems a leaky abstraction. It might make sense to add this to Breeze itself if we can avoid leaking RDD details into the interface. However, one benefit of SPARK-17136 is that we could potentially eliminate our dependence on Breeze in the future. I think it might make sense to implement our own VL-BFGS interface, even if there is some duplication. Actually, I think this is part of an important discussion that will happen as part of the optimization interface design. I hope to post a detailed design document for that JIRA sometime in the next few days. Finally, can you provide more detail on your proposed changes to DiffFunction? DiffFunction in Breeze is already abstract in it's parameter type... > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15805366#comment-15805366 ] Seth Hendrickson commented on SPARK-10078: -- As a part of [SPARK-17136|https://issues.apache.org/jira/browse/SPARK-17136]. I am looking into a design for generic optimizer interface for Spark.ML. This should ideally, be abstracted such that, as Yanbo mentioned, users can switch between them easily. I don't think adding this to Breeze is important since we hope to add our own interface directly into Spark. > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15793488#comment-15793488 ] Seth Hendrickson commented on SPARK-10078: -- [~yanboliang] I was a bit confused by the following comment under new requirements for VL-BFGS: "API consistency with Breeze L-BFGS so we can migrate existing code smoothly." What existing code are we migrating, and to where/what? Are we planning to replace the use of the Breeze LBFGS solvers with this VL-BFGS implementation? If so, what about the numerous use cases that do not need to partition by features? Thanks! > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18705) Docs for one-pass solver for linear regression with L1 and elastic-net penalties
[ https://issues.apache.org/jira/browse/SPARK-18705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15720892#comment-15720892 ] Seth Hendrickson commented on SPARK-18705: -- Yeah, I'll do it today :) > Docs for one-pass solver for linear regression with L1 and elastic-net > penalties > > > Key: SPARK-18705 > URL: https://issues.apache.org/jira/browse/SPARK-18705 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Yanbo Liang >Priority: Minor > > Add document for SPARK-17748 at [{{Normal equation solver for weighted least > squares}}|http://spark.apache.org/docs/latest/ml-advanced.html#normal-equation-solver-for-weighted-least-squares] > session. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17772) Add helper testing methods for instance weighting
[ https://issues.apache.org/jira/browse/SPARK-17772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15687500#comment-15687500 ] Seth Hendrickson commented on SPARK-17772: -- Please do, thanks! > Add helper testing methods for instance weighting > - > > Key: SPARK-17772 > URL: https://issues.apache.org/jira/browse/SPARK-17772 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > > More and more ML algos are accepting instance weights. We keep replicating > code to test instance weighting in every test suite, which will get out of > hand rather quickly. We can and should implement some generic instance weight > test helper methods so that we can reduce duplicated code and standardize > these tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675770#comment-15675770 ] Seth Hendrickson commented on SPARK-9478: - I'm going to work on submitting a PR for adding sample weights for 2.2. That pr is for adding class weights, which I think we decided against. > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson updated SPARK-9478: Summary: Add sample weights to Random Forest (was: Add class weights to Random Forest) > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18456) Use matrix abstraction for LogisitRegression coefficients during training
Seth Hendrickson created SPARK-18456: Summary: Use matrix abstraction for LogisitRegression coefficients during training Key: SPARK-18456 URL: https://issues.apache.org/jira/browse/SPARK-18456 Project: Spark Issue Type: Improvement Components: ML Reporter: Seth Hendrickson Priority: Minor This is a follow up from [SPARK-18060|https://issues.apache.org/jira/browse/SPARK-18060]. The current code for logistic regression relies on manually indexing flat arrays of column major coefficients, which can be messy and is hard to maintain. We can use a matrix abstraction instead of a flat array to simplify things. This will make the code easier to read and maintain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15665233#comment-15665233 ] Seth Hendrickson commented on SPARK-18392: -- Thank you for clarifying, I see it now. > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18321) ML 2.1 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-18321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15658288#comment-15658288 ] Seth Hendrickson commented on SPARK-18321: -- So I generated the API docs between 2.0 and 2.1, and looked at everything that had changed. I didn't find anything in the way of type signatures not checking out. Again, the biggest items are LSH and new clustering summaries, the other things are mostly params added or edited. If anyone has other suggestions of what to do here, please let me know. I am reasonably sure there are no major Java incompatibilities based on the evidence above. > ML 2.1 QA: API: Java compatibility, docs > > > Key: SPARK-18321 > URL: https://issues.apache.org/jira/browse/SPARK-18321 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15658063#comment-15658063 ] Seth Hendrickson commented on SPARK-18392: -- [~josephkb] I wasn't sure where to ask this, but I saw you suggested adding a self-type reference to the LSH class: {code} private[ml] abstract class LSH[T <: LSHModel[T]] extends Estimator[T] with LSHParams with DefaultParamsWritable { self: Estimator[T] => {code} And I'm not sure I can see why it's needed. What was the intent? > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To u
[jira] [Commented] (SPARK-18321) ML 2.1 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-18321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654524#comment-15654524 ] Seth Hendrickson commented on SPARK-18321: -- In the current Spark Java docs here: http://spark.apache.org/docs/latest/api/java/, I see some classes showing up that are private in Scala, e.g. LogisticAggregator and LogisticCostFun. I checked older releases and this problem is not new... > ML 2.1 QA: API: Java compatibility, docs > > > Key: SPARK-18321 > URL: https://issues.apache.org/jira/browse/SPARK-18321 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18369) Deprecate runs in Pyspark mllib KMeans
[ https://issues.apache.org/jira/browse/SPARK-18369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654450#comment-15654450 ] Seth Hendrickson commented on SPARK-18369: -- There is a deprecation note for Python docs, but I realize now that we cannot deprecate the method since we can't overload methods in Python. Let's close this as no issue. > Deprecate runs in Pyspark mllib KMeans > -- > > Key: SPARK-18369 > URL: https://issues.apache.org/jira/browse/SPARK-18369 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: Seth Hendrickson >Priority: Minor > > We should deprecate runs in pyspark mllib kmeans algo as we have done in > Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18369) Deprecate runs in Pyspark mllib KMeans
[ https://issues.apache.org/jira/browse/SPARK-18369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson resolved SPARK-18369. -- Resolution: Not A Problem > Deprecate runs in Pyspark mllib KMeans > -- > > Key: SPARK-18369 > URL: https://issues.apache.org/jira/browse/SPARK-18369 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: Seth Hendrickson >Priority: Minor > > We should deprecate runs in pyspark mllib kmeans algo as we have done in > Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18320) ML 2.1 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15649171#comment-15649171 ] Seth Hendrickson edited comment on SPARK-18320 at 11/8/16 11:23 PM: I scanned through the {{@Since("2.1.0")}} tags in ml/mllib. The major things that were added were LSH and clustering summaries, which are linked and have JIRAs. I made JIRAs for a couple other minor things as well and linked them. was (Author: sethah): I scanned through the {{@Since("2.1.0") tags in ml/mllib}}. The major things that were added were LSH and clustering summaries, which are linked and have JIRAs. I made JIRAs for a couple other minor things as well and linked them. > ML 2.1 QA: API: Python API coverage > --- > > Key: SPARK-18320 > URL: https://issues.apache.org/jira/browse/SPARK-18320 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18320) ML 2.1 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15649171#comment-15649171 ] Seth Hendrickson commented on SPARK-18320: -- I scanned through the {{@Since("2.1.0") tags in ml/mllib}}. The major things that were added were LSH and clustering summaries, which are linked and have JIRAs. I made JIRAs for a couple other minor things as well and linked them. > ML 2.1 QA: API: Python API coverage > --- > > Key: SPARK-18320 > URL: https://issues.apache.org/jira/browse/SPARK-18320 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18369) Deprecate runs in Pyspark mllib KMeans
Seth Hendrickson created SPARK-18369: Summary: Deprecate runs in Pyspark mllib KMeans Key: SPARK-18369 URL: https://issues.apache.org/jira/browse/SPARK-18369 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Seth Hendrickson Priority: Minor We should deprecate runs in pyspark mllib kmeans algo as we have done in Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
[ https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson updated SPARK-18366: - Component/s: PySpark ML > Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer > --- > > Key: SPARK-18366 > URL: https://issues.apache.org/jira/browse/SPARK-18366 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Seth Hendrickson >Priority: Minor > > We should add the new {{handleInvalid}} param for these transformers to > Python to maintain API parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
Seth Hendrickson created SPARK-18366: Summary: Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer Key: SPARK-18366 URL: https://issues.apache.org/jira/browse/SPARK-18366 Project: Spark Issue Type: New Feature Reporter: Seth Hendrickson Priority: Minor We should add the new {{handleInvalid}} param for these transformers to Python to maintain API parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18321) ML 2.1 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-18321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15648529#comment-15648529 ] Seth Hendrickson commented on SPARK-18321: -- I've taken a look at the new LSH additions as well as the clustering summaries which were both added since 2.0. They seem ok. I'd appreciate some guidance: is the main item here to comb through API docs for Java and see that type signatures check out, as well as matching Java and Scala APIs? What other tools are there? > ML 2.1 QA: API: Java compatibility, docs > > > Key: SPARK-18321 > URL: https://issues.apache.org/jira/browse/SPARK-18321 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * comparing with the Scala doc > * verifying that Java docs are not messed up by Scala type incompatibilities. > Some items to look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. > * If needed for complex issues, create small Java unit tests which execute > each method. (The correctness can be checked in Scala.) > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18316) Spark MLlib, GraphX 2.1 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-18316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15645454#comment-15645454 ] Seth Hendrickson commented on SPARK-18316: -- Much appreciated [~josephkb]! > Spark MLlib, GraphX 2.1 QA umbrella > --- > > Key: SPARK-18316 > URL: https://issues.apache.org/jira/browse/SPARK-18316 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate: [SPARK-18329].* > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > * Major new algorithms: MinHash, RandomProjection > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18282) Add model summaries for Python GMM and BisectingKMeans
Seth Hendrickson created SPARK-18282: Summary: Add model summaries for Python GMM and BisectingKMeans Key: SPARK-18282 URL: https://issues.apache.org/jira/browse/SPARK-18282 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Seth Hendrickson Priority: Minor GaussianMixtureModel and BisectingKMeansModel in python do not have model summaries, but they are implemented in Scala. We should add them for API parity before 2.1 release. After the QA Jiras are created, this can be linked as a subtask. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18276) Some ML training summaries are not copied when {{copy()}} is called.
Seth Hendrickson created SPARK-18276: Summary: Some ML training summaries are not copied when {{copy()}} is called. Key: SPARK-18276 URL: https://issues.apache.org/jira/browse/SPARK-18276 Project: Spark Issue Type: Improvement Components: ML Reporter: Seth Hendrickson Priority: Minor GaussianMixture, KMeans, BisectingKMeans, and GeneralizedLinearRegression models do not copy their training summaries inside the {{copy}} method. In contrast, Linear/Logistic regression models do. They should all be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide
[ https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637424#comment-15637424 ] Seth Hendrickson commented on SPARK-18081: -- No worries, just wanted to check in to see if you had bandwidth to do it. You can get a preview of the user guide by building the docs with jekyll {{SKIP_API=1 jekyll build}} inside the docs directory. For more detail, please see [the readme|https://github.com/apache/spark/tree/master/docs] > Locality Sensitive Hashing (LSH) User Guide > --- > > Key: SPARK-18081 > URL: https://issues.apache.org/jira/browse/SPARK-18081 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yun Ni > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide
[ https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636757#comment-15636757 ] Seth Hendrickson commented on SPARK-18081: -- [~yunn] Do you have a status update on this? It would be great to have this for 2.1 > Locality Sensitive Hashing (LSH) User Guide > --- > > Key: SPARK-18081 > URL: https://issues.apache.org/jira/browse/SPARK-18081 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yun Ni > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634354#comment-15634354 ] Seth Hendrickson commented on SPARK-15581: -- I think the points you mention are very important to get right moving forward. We can certainly debate about what should go on the roadmap, but regardless I think it would be helpful to maintain a specific subset of JIRAs that we expect to get done for the next release cycle. Particularly: - We should maintain a list of items that we WILL get done for the next release, and we should deliver on nearly every one, barring unforeseen circumstances. If we don't get some of the items done, we should understand why and adjust accordingly until we can reach a list of items that we can consistently deliver on. - The list of items should be small and targeted, and should take into account things like committer/reviewer bandwidth. MLlib does not have a ton of active committers right now, like SQL might have, and the roadmap should reflect that. We need to be realistic. - We should make every effort to be as specific as possible. Linking to umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. Some of the umbrella tickets contain items that are longer term or have little interest (nice-to-haves), but realistically won't get implemented (in a timely manner). For example, I looked at the tree umbrellas and I see some items that are high priority and can be done in one release cycle, but also other items that have been around for a long time and seem to have little interest. The list should contain only the items that we expect to get done. -As you say, every item should have a committer linked to it that is capable of merging it. They do not have to be the primary reviewer, but they should have sufficient expertise such that they feel comfortable merging it after it has been appropriately reviewed. One interesting example to be wary of is that there seem to be a LOT of tree related items on the roadmap, but Joseph has traditionally been the only (at least the main) committer involved in tree-related JIRAs. I don't think it's realistic to target all of these tree improvements when we have limited committers available to review/merge them. We can trim them down to a realistic subset. I propose a revised roadmap that contains two classifications of items: 1. JIRAs that will be done by the next relase 2. JIRAs that will be done at some point before the next major relase (e.g. 3.0) JIRAs that are still up for debate (e.g. adding a factorization machine) should not be on the roadmap. That does not mean they will not get done, but they are not necessarily "planned" for any particular timeframe. IMO this revised roadmap can/will provide a lot more transparency, and appropriately set review expectations. If it's on the list of "will do by next minor release," then contributors should expect it to be reviewed. What does everyone else think? Also, I took a bit of time to aggregate lists of specific JIRAs that I think fit into the two categories I listed above [here|https://docs.google.com/spreadsheets/d/1nNvbGoarRvhsMkYaFiU6midyHrndPBYQTcKKNOF5xcs/edit?usp=sharing] (note: does not contain SparkR items). I am not (necessarily) proposing to move the list to this google doc, and I understand this is still undergoing discussion. I just wanted to provide an example of what the above might look like. > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > Fix For: 2.1.0 > > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA
[jira] [Comment Edited] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634354#comment-15634354 ] Seth Hendrickson edited comment on SPARK-15581 at 11/3/16 9:28 PM: --- I think the points you mention are very important to get right moving forward. We can certainly debate about what should go on the roadmap, but regardless I think it would be helpful to maintain a specific subset of JIRAs that we expect to get done for the next release cycle. Particularly: - We should maintain a list of items that we WILL get done for the next release, and we should deliver on nearly every one, barring unforeseen circumstances. If we don't get some of the items done, we should understand why and adjust accordingly until we can reach a list of items that we can consistently deliver on. - The list of items should be small and targeted, and should take into account things like committer/reviewer bandwidth. MLlib does not have a ton of active committers right now, like SQL might have, and the roadmap should reflect that. We need to be realistic. - We should make every effort to be as specific as possible. Linking to umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. Some of the umbrella tickets contain items that are longer term or have little interest (nice-to-haves), but realistically won't get implemented (in a timely manner). For example, I looked at the tree umbrellas and I see some items that are high priority and can be done in one release cycle, but also other items that have been around for a long time and seem to have little interest. The list should contain only the items that we expect to get done. - As you say, every item should have a committer linked to it that is capable of merging it. They do not have to be the primary reviewer, but they should have sufficient expertise such that they feel comfortable merging it after it has been appropriately reviewed. One interesting example to be wary of is that there seem to be a LOT of tree related items on the roadmap, but Joseph has traditionally been the only (at least the main) committer involved in tree-related JIRAs. I don't think it's realistic to target all of these tree improvements when we have limited committers available to review/merge them. We can trim them down to a realistic subset. I propose a revised roadmap that contains two classifications of items: 1. JIRAs that will be done by the next release 2. JIRAs that will be done at some point before the next major release (e.g. 3.0) JIRAs that are still up for debate (e.g. adding a factorization machine) should not be on the roadmap. That does not mean they will not get done, but they are not necessarily "planned" for any particular timeframe. IMO this revised roadmap can/will provide a lot more transparency, and appropriately set review expectations. If it's on the list of "will do by next minor release," then contributors should expect it to be reviewed. What does everyone else think? Also, I took a bit of time to aggregate lists of specific JIRAs that I think fit into the two categories I listed above [here|https://docs.google.com/spreadsheets/d/1nNvbGoarRvhsMkYaFiU6midyHrndPBYQTcKKNOF5xcs/edit?usp=sharing] (note: does not contain SparkR items). I am not (necessarily) proposing to move the list to this google doc, and I understand this is still undergoing discussion. I just wanted to provide an example of what the above might look like. was (Author: sethah): I think the points you mention are very important to get right moving forward. We can certainly debate about what should go on the roadmap, but regardless I think it would be helpful to maintain a specific subset of JIRAs that we expect to get done for the next release cycle. Particularly: - We should maintain a list of items that we WILL get done for the next release, and we should deliver on nearly every one, barring unforeseen circumstances. If we don't get some of the items done, we should understand why and adjust accordingly until we can reach a list of items that we can consistently deliver on. - The list of items should be small and targeted, and should take into account things like committer/reviewer bandwidth. MLlib does not have a ton of active committers right now, like SQL might have, and the roadmap should reflect that. We need to be realistic. - We should make every effort to be as specific as possible. Linking to umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. Some of the umbrella tickets contain items that are longer term or have little interest (nice-to-haves), but realistically won't get implemented (in a timely manner). For example, I looked at the tree umbrellas and I see some items that are high priority and can be done in one release cycle, but also other items that have been
[jira] [Commented] (SPARK-17138) Python API for multinomial logistic regression
[ https://issues.apache.org/jira/browse/SPARK-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633627#comment-15633627 ] Seth Hendrickson commented on SPARK-17138: -- [~yanboliang] Can you mark this as resolved? > Python API for multinomial logistic regression > -- > > Key: SPARK-17138 > URL: https://issues.apache.org/jira/browse/SPARK-17138 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Once [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159] is merged, > we should make a Python API for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18253) ML Instrumentation logging requires too much manual implementation
Seth Hendrickson created SPARK-18253: Summary: ML Instrumentation logging requires too much manual implementation Key: SPARK-18253 URL: https://issues.apache.org/jira/browse/SPARK-18253 Project: Spark Issue Type: Improvement Components: ML Reporter: Seth Hendrickson Priority: Minor [SPARK-14567|https://issues.apache.org/jira/browse/SPARK-14567] introduced an {{Instrumentation}} class for standardized logging of ML training sessions. Right now, we manually log individual params for each algorithm, partly because we don't want to log all params since some params can be huge in size, and we could flood the logs. We should find a more sustainable way of logging params in ML algos. The current approach does not seem sustainable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15623661#comment-15623661 ] Seth Hendrickson commented on SPARK-15784: -- This seems like it fits the framework of a feature transformer. We could generate a real-valued feature column using PIC algorithm where the values are just the components of the pseudo-eigenvector. Alternatively we could pipeline a KMeans clustering on the end, but I think it makes more sense to let users do that themselves - but that's up for debate. > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18060) Avoid unnecessary standardization in multinomial logistic regression training
Seth Hendrickson created SPARK-18060: Summary: Avoid unnecessary standardization in multinomial logistic regression training Key: SPARK-18060 URL: https://issues.apache.org/jira/browse/SPARK-18060 Project: Spark Issue Type: Sub-task Components: ML Reporter: Seth Hendrickson The MLOR implementation in spark.ml trains the model in the standardized feature space by dividing the feature values by the column standard deviation in each iteration. We perform this computation many time more than is necessary in order to achieve sequential memory access pattern when computing the gradients. We can have both - sequential access patterns and reduced computation - if we use a column major layout for the coefficients. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18036) Decision Trees do not handle edge cases
Seth Hendrickson created SPARK-18036: Summary: Decision Trees do not handle edge cases Key: SPARK-18036 URL: https://issues.apache.org/jira/browse/SPARK-18036 Project: Spark Issue Type: Bug Components: ML, MLlib Reporter: Seth Hendrickson Priority: Minor Decision trees/GBT/RF do not handle edge cases such as constant features or empty features. For example: {code} val dt = new DecisionTreeRegressor() val data = Seq(LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))).toDF() dt.fit(data) java.lang.UnsupportedOperationException: empty.max at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229) at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234) at org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:207) at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105) at org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:93) at org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:46) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) ... 52 elided {code} as well as {code} val dt = new DecisionTreeRegressor() val data = Seq(LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))).toDF() dt.fit(data) java.lang.UnsupportedOperationException: empty.maxBy at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236) at scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37) at org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:846) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18019) Log instrumentation in GBTs
Seth Hendrickson created SPARK-18019: Summary: Log instrumentation in GBTs Key: SPARK-18019 URL: https://issues.apache.org/jira/browse/SPARK-18019 Project: Spark Issue Type: Sub-task Reporter: Seth Hendrickson Sub-task for adding instrumentation to GBTs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet
Seth Hendrickson created SPARK-17941: Summary: Logistic regression test suites should use weights when comparing to glmnet Key: SPARK-17941 URL: https://issues.apache.org/jira/browse/SPARK-17941 Project: Spark Issue Type: Test Components: ML Reporter: Seth Hendrickson Priority: Minor Logistic regression suite currently has many test cases comparing to R's glmnet. Both libraries support weights, and to make the testing of weights in Spark LOR more robust, we should add weights to all the test cases. The current weight testing is quite minimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17906) MulticlassClassificationEvaluator support target label
[ https://issues.apache.org/jira/browse/SPARK-17906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572154#comment-15572154 ] Seth Hendrickson commented on SPARK-17906: -- We are adding model summaries that would expose some of this behavior. For example, see [https://github.com/apache/spark/pull/15435]. That PR will likely expose some of the functionality being requested here. > MulticlassClassificationEvaluator support target label > -- > > Key: SPARK-17906 > URL: https://issues.apache.org/jira/browse/SPARK-17906 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: zhengruifeng >Priority: Minor > > In practice, I sometime only focus on metric of one special label. > For example, in CTR prediction, I usually only mind F1 of positive class. > In sklearn, this is supported: > {code} > >>> from sklearn.metrics import classification_report > >>> y_true = [0, 1, 2, 2, 2] > >>> y_pred = [0, 0, 2, 2, 1] > >>> target_names = ['class 0', 'class 1', 'class 2'] > >>> print(classification_report(y_true, y_pred, target_names=target_names)) > precisionrecall f1-score support > class 0 0.50 1.00 0.67 1 > class 1 0.00 0.00 0.00 1 > class 2 1.00 0.67 0.80 3 > avg / total 0.70 0.60 0.61 5 > {code} > Now, ml only support `weightedXXX`. So I think there may be a point to > improve. > The API may be designed like this: > {code} > val dataset = ... > val evaluator = new MulticlassClassificationEvaluator > evaluator.setMetricName("f1") > evaluator.evaluate(dataset) // weightedF1 of all classes > evaluator.setTarget(0.0).setMetricName("f1") > evaluator.evaluate(dataset) // F1 of class "0" > {code} > what's your opinion? [~yanboliang][~josephkb][~sethah][~srowen] > If this is useful and acceptable, I'm happy to work on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17772) Add helper testing methods for instance weighting
[ https://issues.apache.org/jira/browse/SPARK-17772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564174#comment-15564174 ] Seth Hendrickson commented on SPARK-17772: -- I'm working on this. > Add helper testing methods for instance weighting > - > > Key: SPARK-17772 > URL: https://issues.apache.org/jira/browse/SPARK-17772 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > More and more ML algos are accepting instance weights. We keep replicating > code to test instance weighting in every test suite, which will get out of > hand rather quickly. We can and should implement some generic instance weight > test helper methods so that we can reduce duplicated code and standardize > these tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add class weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15563919#comment-15563919 ] Seth Hendrickson commented on SPARK-9478: - I'm going to revive this, and hopefully submit a PR soon. > Add class weights to Random Forest > -- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15563729#comment-15563729 ] Seth Hendrickson commented on SPARK-17139: -- [~WeichenXu123] Status? > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Add model summary to multinomial logistic regression using same interface as > in other ML models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17140) Add initial model to MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson resolved SPARK-17140. -- Resolution: Invalid MultinomialLogisticRegression was elminated in SPARK-[17163|https://issues.apache.org/jira/browse/SPARK-17163] > Add initial model to MultinomialLogisticRegression > -- > > Key: SPARK-17140 > URL: https://issues.apache.org/jira/browse/SPARK-17140 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > > We should add initial model support to Multinomial logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17824) QR solver for WeightedLeastSquares
[ https://issues.apache.org/jira/browse/SPARK-17824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15556237#comment-15556237 ] Seth Hendrickson commented on SPARK-17824: -- Thank you for clarifying > QR solver for WeightedLeastSquares > -- > > Key: SPARK-17824 > URL: https://issues.apache.org/jira/browse/SPARK-17824 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Cholesky decomposition is unstable (for near-singular and rank deficient > matrices) and only works on positive definite matrices which can not be > guaranteed in all cases, it was often used when matrix A is very large and > sparse due to faster calculation. QR decomposition has better numerical > properties than Cholesky and can works on matrices which are not positive > definite. Spark MLlib {{WeightedLeastSquares}} use Cholesky decomposition to > solve normal equation currently, we should also support or move to QR solver > for better stability. I'm preparing to send a PR. > cc [~dbtsai] [~sethah] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17824) QR solver for WeightedLeastSquares
[ https://issues.apache.org/jira/browse/SPARK-17824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1385#comment-1385 ] Seth Hendrickson edited comment on SPARK-17824 at 10/7/16 3:42 PM: --- [~yanboliang] Can you please post your design plans? This is almost certainly going to conflict with the PR I'm about to send for [SPARK-17748|https://issues.apache.org/jira/browse/SPARK-17748]. In that PR, I have implemented a pluggable solver for the normal equations, I posted a bit of detail on the JIRA. In fact, if it gets merged we will be able to deal with singular matrices by running L-BFGS on the normal equations on the driver (one-pass). It may not be the most elegant solution, but it is a byproduct of implementing the OWL-QN solver. I'd like to hear more about your patch to understand how the two fit together, what conflicts there are, and how we need to coordinate. In fact, I may have already written some of the test cases you will need to write, so maybe we can share them :) Thanks! was (Author: sethah): [~yanboliang] Can you please post your design plans? This is almost certainly going to conflict with the PR I'm about to send for [SPARK-17748|https://issues.apache.org/jira/browse/SPARK-17748]. In that PR, I have implemented a pluggable solver for the normal equations, I posted a bit of detail on the JIRA. In fact, if it gets merged we will be able to deal with singular matrices by running L-BFGS on the normal equations on the driver (one-pass). It may not be the most elegant solution, but it is a byproduct of implementing the OWL-QN solver. I'd like to hear more about your patch to understand how the two fit together, what conflicts there are, and how we need to coordinate. Thanks! > QR solver for WeightedLeastSquares > -- > > Key: SPARK-17824 > URL: https://issues.apache.org/jira/browse/SPARK-17824 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Cholesky decomposition is unstable (for near-singular and rank deficient > matrices) and only works on positive definite matrices which can not be > guaranteed in all cases, it was often used when matrix A is very large and > sparse due to faster calculation. QR decomposition has better numerical > properties than Cholesky and can works on matrices which are not positive > definite. Spark MLlib {{WeightedLeastSquares}} use Cholesky decomposition to > solve normal equation currently, we should also support or move to QR solver > for better stability. I'm preparing to send a PR. > cc [~dbtsai] [~sethah] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17824) QR solver for WeightedLeastSquares
[ https://issues.apache.org/jira/browse/SPARK-17824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1385#comment-1385 ] Seth Hendrickson commented on SPARK-17824: -- [~yanboliang] Can you please post your design plans? This is almost certainly going to conflict with the PR I'm about to send for [SPARK-17748|https://issues.apache.org/jira/browse/SPARK-17748]. In that PR, I have implemented a pluggable solver for the normal equations, I posted a bit of detail on the JIRA. In fact, if it gets merged we will be able to deal with singular matrices by running L-BFGS on the normal equations on the driver (one-pass). It may not be the most elegant solution, but it is a byproduct of implementing the OWL-QN solver. I'd like to hear more about your patch to understand how the two fit together, what conflicts there are, and how we need to coordinate. Thanks! > QR solver for WeightedLeastSquares > -- > > Key: SPARK-17824 > URL: https://issues.apache.org/jira/browse/SPARK-17824 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Cholesky decomposition is unstable (for near-singular and rank deficient > matrices) and only works on positive definite matrices which can not be > guaranteed in all cases, it was often used when matrix A is very large and > sparse due to faster calculation. QR decomposition has better numerical > properties than Cholesky and can works on matrices which are not positive > definite. Spark MLlib {{WeightedLeastSquares}} use Cholesky decomposition to > solve normal equation currently, we should also support or move to QR solver > for better stability. I'm preparing to send a PR. > cc [~dbtsai] [~sethah] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17789) Don't force users to set k for KMeans if initial model is set
[ https://issues.apache.org/jira/browse/SPARK-17789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550601#comment-15550601 ] Seth Hendrickson commented on SPARK-17789: -- When the model is fit, the initial model may have some number of centers (say, 5), but k defaults to 1, so the check in the fit method will throw an exception. > Don't force users to set k for KMeans if initial model is set > - > > Key: SPARK-17789 > URL: https://issues.apache.org/jira/browse/SPARK-17789 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > In the initial implementation of initalModel, we allow users to set the > initial model with a KMeansModel that has a different {{k}} than the current > model. We throw an error at train time if the two are mismatched. This means > that the following code throws a runtime exception: > {code} > val kmeansModel = new KMeans().setInitialModel(model).fit(df) > {code} > We should discuss this behavior, and decide if we should enforce users to set > both the initial model and k, or if we should alter k when the initial model > is set, or if we should keep the current behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17789) Don't force users to set k for KMeans if initial model is set
[ https://issues.apache.org/jira/browse/SPARK-17789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson updated SPARK-17789: - Description: In the initial implementation of initalModel, we allow users to set the initial model with a KMeansModel that has a different {{k}} than the current model. We throw an error at train time if the two are mismatched. This means that the following code throws a runtime exception: {code} val kmeansModel = new KMeans().setInitialModel(model).fit(df) {code} We should discuss this behavior, and decide if we should enforce users to set both the initial model and k, or if we should alter k when the initial model is set, or if we should keep the current behavior. was: In the initial implementation of initalModel, we allow users to set the initial model with a KMeansModel that has a different {{k}} than the current model. We throw an error at train time if the two are mismatched. This means that the following code throws a runtime exception: {{code}} val kmeansModel = new KMeans().setInitialModel(model).fit(df) {{code}} We should discuss this behavior, and decide if we should enforce users to set both the initial model and k, or if we should alter k when the initial model is set, or if we should keep the current behavior. > Don't force users to set k for KMeans if initial model is set > - > > Key: SPARK-17789 > URL: https://issues.apache.org/jira/browse/SPARK-17789 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > In the initial implementation of initalModel, we allow users to set the > initial model with a KMeansModel that has a different {{k}} than the current > model. We throw an error at train time if the two are mismatched. This means > that the following code throws a runtime exception: > {code} > val kmeansModel = new KMeans().setInitialModel(model).fit(df) > {code} > We should discuss this behavior, and decide if we should enforce users to set > both the initial model and k, or if we should alter k when the initial model > is set, or if we should keep the current behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17792) L-BFGS solver for linear regression does not accept general numeric label column types
Seth Hendrickson created SPARK-17792: Summary: L-BFGS solver for linear regression does not accept general numeric label column types Key: SPARK-17792 URL: https://issues.apache.org/jira/browse/SPARK-17792 Project: Spark Issue Type: Bug Components: ML Reporter: Seth Hendrickson Priority: Minor There's a bug in accepting numeric types for linear regression. We cast the label to {{DoubleType}} in one spot where we use normal solver, but not for the l-bfgs solver. The following can reproduce the problem: {code} import org.apache.spark.ml.feature.LabeledPoint import org.apache.spark.ml.linalg.{Vector, DenseVector, Vectors} import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.sql.types._ val df = Seq(LabeledPoint(1.0, Vectors.dense(1.0))).toDF().withColumn("weight", lit(1.0).cast(LongType)) val lr = new LinearRegression().setSolver("l-bfgs").setWeightCol("weight") lr.fit(df) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17792) L-BFGS solver for linear regression does not accept general numeric label column types
[ https://issues.apache.org/jira/browse/SPARK-17792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550249#comment-15550249 ] Seth Hendrickson commented on SPARK-17792: -- I'll have a PR shortly. > L-BFGS solver for linear regression does not accept general numeric label > column types > -- > > Key: SPARK-17792 > URL: https://issues.apache.org/jira/browse/SPARK-17792 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > There's a bug in accepting numeric types for linear regression. We cast the > label to {{DoubleType}} in one spot where we use normal solver, but not for > the l-bfgs solver. The following can reproduce the problem: > {code} > import org.apache.spark.ml.feature.LabeledPoint > import org.apache.spark.ml.linalg.{Vector, DenseVector, Vectors} > import org.apache.spark.ml.regression.LinearRegression > import org.apache.spark.sql.types._ > val df = Seq(LabeledPoint(1.0, > Vectors.dense(1.0))).toDF().withColumn("weight", lit(1.0).cast(LongType)) > val lr = new LinearRegression().setSolver("l-bfgs").setWeightCol("weight") > lr.fit(df) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17789) Don't force users to set k for KMeans if initial model is set
Seth Hendrickson created SPARK-17789: Summary: Don't force users to set k for KMeans if initial model is set Key: SPARK-17789 URL: https://issues.apache.org/jira/browse/SPARK-17789 Project: Spark Issue Type: Improvement Components: ML Reporter: Seth Hendrickson Priority: Minor In the initial implementation of initalModel, we allow users to set the initial model with a KMeansModel that has a different {{k}} than the current model. We throw an error at train time if the two are mismatched. This means that the following code throws a runtime exception: {{code}} val kmeansModel = new KMeans().setInitialModel(model).fit(df) {{code}} We should discuss this behavior, and decide if we should enforce users to set both the initial model and k, or if we should alter k when the initial model is set, or if we should keep the current behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17772) Add helper testing methods for instance weighting
Seth Hendrickson created SPARK-17772: Summary: Add helper testing methods for instance weighting Key: SPARK-17772 URL: https://issues.apache.org/jira/browse/SPARK-17772 Project: Spark Issue Type: Test Components: ML Reporter: Seth Hendrickson Priority: Minor More and more ML algos are accepting instance weights. We keep replicating code to test instance weighting in every test suite, which will get out of hand rather quickly. We can and should implement some generic instance weight test helper methods so that we can reduce duplicated code and standardize these tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties
[ https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536781#comment-15536781 ] Seth Hendrickson edited comment on SPARK-17748 at 9/30/16 7:16 PM: --- I am working on this currently. The basic plan is to refactor WLS so that it has a pluggable solver for the normal equations. We can implement a new interface like {code:java} trait NormalEquationSolver { def solve( bBar: Double, bbBar: Double, abBar: DenseVector, aaBar: DenseVector, aBar: DenseVector): NormalEquationSolution } class CholeskySolver extends NormalEquationsSolver class QuasiNewtonSolver extends NormalEquationSolver {code} If others have thoughts on the design please comment, otherwise I will continue working on this and submit a PR reasonably soon. cc [~srowen] [~yanboliang] [~dbtsai] was (Author: sethah): I am working on this currently. The basic plan is to refactor WLS so that it has a pluggable solver for the normal equations. We can implement a new interface like {code:java} trait NormalEquationSolver { def solve( bBar: Double, bbBar: Double, abBar: DenseVector, aaBar: DenseVector, aBar: DenseVector): NormalEquationSolution } class CholeskySolver extends NormalEquationsSolver class QuasiNewtonSolver extends NormalEquationSolver {code} If others have thoughts on the design please comment, otherwise I will continue working on this and submit a PR reasonably soon. cc [~srowen] [~yanboliang] > One-pass algorithm for linear regression with L1 and elastic-net penalties > -- > > Key: SPARK-17748 > URL: https://issues.apache.org/jira/browse/SPARK-17748 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Seth Hendrickson > > Currently linear regression uses weighted least squares to solve the normal > equations locally on the driver when the dimensionality is small (<4096). > Weighted least squares uses a Cholesky decomposition to solve the problem > with L2 regularization (which has a closed-form solution). We can support > L1/elasticnet penalties by solving the equations locally using OWL-QN solver. > Also note that Cholesky does not handle singular covariance matrices, but > L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch > can also add support for solving singular covariance matrices by also adding > L-BFGS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties
[ https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536781#comment-15536781 ] Seth Hendrickson edited comment on SPARK-17748 at 9/30/16 7:16 PM: --- I am working on this currently. The basic plan is to refactor WLS so that it has a pluggable solver for the normal equations. We can implement a new interface like {code:java} trait NormalEquationSolver { def solve( bBar: Double, bbBar: Double, abBar: DenseVector, aaBar: DenseVector, aBar: DenseVector): NormalEquationSolution } class CholeskySolver extends NormalEquationsSolver class QuasiNewtonSolver extends NormalEquationSolver {code} If others have thoughts on the design please comment, otherwise I will continue working on this and submit a PR reasonably soon. cc [~srowen] [~yanboliang] was (Author: sethah): I am working on this currently. The basic plan is to refactor WLS so that it has a pluggable solver for the normal equations. We can implement a new interface like {code:java} trait NormalEquationSolver { def solve( bBar: Double, bbBar: Double, abBar: DenseVector, aaBar: DenseVector, aBar: DenseVector): NormalEquationSolution } class CholeskySolver extends NormalEquationsSolver class QuasiNewtonSolver extends NormalEquationSolver {code} If others have thoughts on the design please comment, otherwise I will continue working on this and submit a PR reasonably soon. > One-pass algorithm for linear regression with L1 and elastic-net penalties > -- > > Key: SPARK-17748 > URL: https://issues.apache.org/jira/browse/SPARK-17748 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Seth Hendrickson > > Currently linear regression uses weighted least squares to solve the normal > equations locally on the driver when the dimensionality is small (<4096). > Weighted least squares uses a Cholesky decomposition to solve the problem > with L2 regularization (which has a closed-form solution). We can support > L1/elasticnet penalties by solving the equations locally using OWL-QN solver. > Also note that Cholesky does not handle singular covariance matrices, but > L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch > can also add support for solving singular covariance matrices by also adding > L-BFGS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties
[ https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536781#comment-15536781 ] Seth Hendrickson commented on SPARK-17748: -- I am working on this currently. The basic plan is to refactor WLS so that it has a pluggable solver for the normal equations. We can implement a new interface like {code:java} trait NormalEquationSolver { def solve( bBar: Double, bbBar: Double, abBar: DenseVector, aaBar: DenseVector, aBar: DenseVector): NormalEquationSolution } class CholeskySolver extends NormalEquationsSolver class QuasiNewtonSolver extends NormalEquationSolver {code} If others have thoughts on the design please comment, otherwise I will continue working on this and submit a PR reasonably soon. > One-pass algorithm for linear regression with L1 and elastic-net penalties > -- > > Key: SPARK-17748 > URL: https://issues.apache.org/jira/browse/SPARK-17748 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Seth Hendrickson > > Currently linear regression uses weighted least squares to solve the normal > equations locally on the driver when the dimensionality is small (<4096). > Weighted least squares uses a Cholesky decomposition to solve the problem > with L2 regularization (which has a closed-form solution). We can support > L1/elasticnet penalties by solving the equations locally using OWL-QN solver. > Also note that Cholesky does not handle singular covariance matrices, but > L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch > can also add support for solving singular covariance matrices by also adding > L-BFGS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties
Seth Hendrickson created SPARK-17748: Summary: One-pass algorithm for linear regression with L1 and elastic-net penalties Key: SPARK-17748 URL: https://issues.apache.org/jira/browse/SPARK-17748 Project: Spark Issue Type: Bug Components: ML Reporter: Seth Hendrickson Currently linear regression uses weighted least squares to solve the normal equations locally on the driver when the dimensionality is small (<4096). Weighted least squares uses a Cholesky decomposition to solve the problem with L2 regularization (which has a closed-form solution). We can support L1/elasticnet penalties by solving the equations locally using OWL-QN solver. Also note that Cholesky does not handle singular covariance matrices, but L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch can also add support for solving singular covariance matrices by also adding L-BFGS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15515529#comment-15515529 ] Seth Hendrickson edited comment on SPARK-17134 at 9/23/16 6:09 AM: --- This makes sense. In my initial testing I found that having to standardize the features in every iteration takes a non-trivial amount of time. Still, you mentioned the desire to not cache the standardized dataset since it can create unnecessary memory overhead. One solution is to allow the users to specify that their data has already been standardized, and then we don't have to perform the extra divisions in the update method. Alternatively, we could do as you suggest above, but store the coefficients in column major order in order to still maximize cache hits. We'll need some testing for both cases to truly understand this. was (Author: sethah): This makes sense. In my initial testing I found that having to standardize the features in every iteration takes a non-trivial amount of time. Still, you mentioned the desire to not cache the standardized dataset since it can create unnecessary memory overhead. One solution is to allow the users to specify that there data has already been standardized, and then we don't have to perform the extra divisions in the update method. Alternatively, we could do as you suggest above, but store the coefficients in column major order in order to still maximize cache hits. We'll need some testing for both cases to truly understand this. > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15515529#comment-15515529 ] Seth Hendrickson commented on SPARK-17134: -- This makes sense. In my initial testing I found that having to standardize the features in every iteration takes a non-trivial amount of time. Still, you mentioned the desire to not cache the standardized dataset since it can create unnecessary memory overhead. One solution is to allow the users to specify that there data has already been standardized, and then we don't have to perform the extra divisions in the update method. Alternatively, we could do as you suggest above, but store the coefficients in column major order in order to still maximize cache hits. We'll need some testing for both cases to truly understand this. > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15510198#comment-15510198 ] Seth Hendrickson commented on SPARK-17134: -- Hmm, it would be nice to see this vs the old mlor in rdd API, just as a sanity check. I conducted performance testing against mllib initially, though, so there shouldn't be any regressions. > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class
[ https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15484775#comment-15484775 ] Seth Hendrickson commented on SPARK-17471: -- [~yanboliang] Do you have any updates on this? We need to make implementation of the {{compressed}} method for matrices high priority. I can look into implementing it, but I don't want to overlap work. Thanks! > Add compressed method for Matrix class > -- > > Key: SPARK-17471 > URL: https://issues.apache.org/jira/browse/SPARK-17471 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Seth Hendrickson > > Vectors in Spark have a {{compressed}} method which selects either sparse or > dense representation by minimizing storage requirements. Matrices should also > have this method, which is now explicitly needed in {{LogisticRegression}} > since we have implemented multiclass regression. > The compressed method should also give the option to store row major or > column major, and if nothing is specified should select the lower storage > representation (for sparse). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17476) Proper handling for unseen labels in logistic regression training.
Seth Hendrickson created SPARK-17476: Summary: Proper handling for unseen labels in logistic regression training. Key: SPARK-17476 URL: https://issues.apache.org/jira/browse/SPARK-17476 Project: Spark Issue Type: New Feature Components: ML Reporter: Seth Hendrickson Now that logistic regression supports multiclass, it is possible to train on data that has {{K}} classes, but one or more of the classes does not appear in training. For example, {code} (0.0, x1) (2.0, x2) ... {code} Currently, logistic regression assumes that the outcome classes in the above dataset have three levels: {{0, 1, 2}}. Since label 1 never appears, it should never be predicted. In theory, the coefficients should be zero and the intercept should be negative infinity. This can cause problems since we center the intercepts after training. We should discuss whether or not the intercepts actually tend to -infinity in practice, and whether or not we should even include them in training. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class
[ https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15477841#comment-15477841 ] Seth Hendrickson commented on SPARK-17471: -- [~yanboliang] I guess it can be seen as a duplicate, but really there are two separate tasks. 1.) Add a `compressed` method to the matrix library in spark, which is non-trivial. 2.) Adding a mechanism inside of MLOR to use the compressed method, and how to deal with flattening the sparse matrix into a sparse vector when binomial family is used. We can keep the JIRAs separate, or do them both together. I see them as separate tasks. > Add compressed method for Matrix class > -- > > Key: SPARK-17471 > URL: https://issues.apache.org/jira/browse/SPARK-17471 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Seth Hendrickson > > Vectors in Spark have a {{compressed}} method which selects either sparse or > dense representation by minimizing storage requirements. Matrices should also > have this method, which is now explicitly needed in {{LogisticRegression}} > since we have implemented multiclass regression. > The compressed method should also give the option to store row major or > column major, and if nothing is specified should select the lower storage > representation (for sparse). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org