[jira] [Commented] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526510#comment-15526510 ] Yanbo Liang commented on SPARK-17692: - cc [~mengxr] [~josephkb] [~dbtsai] [~mlnick] [~srowen] > Document ML/MLlib behavior changes in Spark 2.1 > --- > > Key: SPARK-17692 > URL: https://issues.apache.org/jira/browse/SPARK-17692 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Labels: 2.1.0 > > This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can > note those changes (if any) in the user guide's Migration Guide section. If > you found one, please comment below and link the corresponding JIRA here. > * SPARK-17389: Reduce KMeans default k-means|| init steps to 2 from 5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17692: Description: This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can note those changes (if any) in the user guide's Migration Guide section. If you found one, please comment below and link the corresponding JIRA here. * SPARK-17389: Reduce KMeans default k-means|| init steps to 2 from 5. was: This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can note those changes (if any) in the user guide's Migration Guide section. If you found one, please comment below and link the corresponding JIRA here. * SPARK-17389 Reduce KMeans default k-means|| init steps to 2 from 5. > Document ML/MLlib behavior changes in Spark 2.1 > --- > > Key: SPARK-17692 > URL: https://issues.apache.org/jira/browse/SPARK-17692 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Labels: 2.1.0 > > This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can > note those changes (if any) in the user guide's Migration Guide section. If > you found one, please comment below and link the corresponding JIRA here. > * SPARK-17389: Reduce KMeans default k-means|| init steps to 2 from 5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17692: Description: This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can note those changes (if any) in the user guide's Migration Guide section. If you found one, please comment below and link the corresponding JIRA here. * SPARK-17389 Reduce KMeans default k-means|| init steps to 2 from 5. was: This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can note those changes (if any) in the user guide's Migration Guide section. If you found one, please comment below and link the corresponding JIRA here. * > Document ML/MLlib behavior changes in Spark 2.1 > --- > > Key: SPARK-17692 > URL: https://issues.apache.org/jira/browse/SPARK-17692 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Labels: 2.1.0 > > This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can > note those changes (if any) in the user guide's Migration Guide section. If > you found one, please comment below and link the corresponding JIRA here. > * SPARK-17389 Reduce KMeans default k-means|| init steps to 2 from 5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17692: Description: This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can note those changes (if any) in the user guide's Migration Guide section. If you found one, please comment below and link the corresponding JIRA here. * was:This JIRA keeps a list of MLlib behavior changes in Spark 2.1. So we can remember to add them to the migration guide / release notes. > Document ML/MLlib behavior changes in Spark 2.1 > --- > > Key: SPARK-17692 > URL: https://issues.apache.org/jira/browse/SPARK-17692 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Labels: 2.1.0 > > This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can > note those changes (if any) in the user guide's Migration Guide section. If > you found one, please comment below and link the corresponding JIRA here. > * -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1
Yanbo Liang created SPARK-17692: --- Summary: Document ML/MLlib behavior changes in Spark 2.1 Key: SPARK-17692 URL: https://issues.apache.org/jira/browse/SPARK-17692 Project: Spark Issue Type: Documentation Components: ML, MLlib Reporter: Yanbo Liang Assignee: Yanbo Liang This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can remember to add them to the migration guide / release notes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17692: Description: This JIRA keeps a list of MLlib behavior changes in Spark 2.1. So we can remember to add them to the migration guide / release notes. (was: This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can remember to add them to the migration guide / release notes.) > Document ML/MLlib behavior changes in Spark 2.1 > --- > > Key: SPARK-17692 > URL: https://issues.apache.org/jira/browse/SPARK-17692 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > This JIRA keeps a list of MLlib behavior changes in Spark 2.1. So we can > remember to add them to the migration guide / release notes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-17428. - Resolution: Done Assignee: Yanbo Liang > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16356) Add testImplicits for ML unit tests and promote toDF()
[ https://issues.apache.org/jira/browse/SPARK-16356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-16356. - Resolution: Fixed Fix Version/s: 2.1.0 > Add testImplicits for ML unit tests and promote toDF() > -- > > Key: SPARK-16356 > URL: https://issues.apache.org/jira/browse/SPARK-16356 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > This was suggested in > https://github.com/apache/spark/commit/101663f1ae222a919fc40510aa4f2bad22d1be6f#commitcomment-17114968 > Currently, implicits such as {{toDF()}} are not available in > {{MLlibTestSparkContext}}. > It might be great if this class has this and {{toDF()}} can be used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17281) Add treeAggregateDepth parameter for AFTSurvivalRegression
[ https://issues.apache.org/jira/browse/SPARK-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-17281. - Resolution: Fixed Fix Version/s: 2.1.0 > Add treeAggregateDepth parameter for AFTSurvivalRegression > -- > > Key: SPARK-17281 > URL: https://issues.apache.org/jira/browse/SPARK-17281 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > Fix For: 2.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Add treeAggregateDepth parameter for AFTSurvivalRegression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17281) Add treeAggregateDepth parameter for AFTSurvivalRegression
[ https://issues.apache.org/jira/browse/SPARK-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17281: Assignee: Weichen Xu > Add treeAggregateDepth parameter for AFTSurvivalRegression > -- > > Key: SPARK-17281 > URL: https://issues.apache.org/jira/browse/SPARK-17281 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > Fix For: 2.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Add treeAggregateDepth parameter for AFTSurvivalRegression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16356) Add testImplicits for ML unit tests and promote toDF()
[ https://issues.apache.org/jira/browse/SPARK-16356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-16356: Shepherd: Yanbo Liang > Add testImplicits for ML unit tests and promote toDF() > -- > > Key: SPARK-16356 > URL: https://issues.apache.org/jira/browse/SPARK-16356 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > This was suggested in > https://github.com/apache/spark/commit/101663f1ae222a919fc40510aa4f2bad22d1be6f#commitcomment-17114968 > Currently, implicits such as {{toDF()}} are not available in > {{MLlibTestSparkContext}}. > It might be great if this class has this and {{toDF()}} can be used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16356) Add testImplicits for ML unit tests and promote toDF()
[ https://issues.apache.org/jira/browse/SPARK-16356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-16356: Assignee: Hyukjin Kwon > Add testImplicits for ML unit tests and promote toDF() > -- > > Key: SPARK-16356 > URL: https://issues.apache.org/jira/browse/SPARK-16356 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > This was suggested in > https://github.com/apache/spark/commit/101663f1ae222a919fc40510aa4f2bad22d1be6f#commitcomment-17114968 > Currently, implicits such as {{toDF()}} are not available in > {{MLlibTestSparkContext}}. > It might be great if this class has this and {{toDF()}} can be used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14709) spark.ml API for linear SVM
[ https://issues.apache.org/jira/browse/SPARK-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512085#comment-15512085 ] Yanbo Liang edited comment on SPARK-14709 at 9/22/16 4:27 AM: -- [~yuhaoyan] Any update about this? I think providing DataFrame-based SVM algorithm is very important to users, so it's better we can get it in ASAP. I'd like to get in the implementation with OWLQN and Hinge loss firstly, and to discuss SMO version later. Like [~mlnick] said, it's better to get more performance number and user case of SMO impl. And it's not very hard to add a new internal implementation after we have the basic SVM API. I saw you have an implementation with OWLQN and Hinge loss already, could you send the PR? If you are busy with other things, I can help and you are still the primary author of this PR. Thanks! was (Author: yanboliang): [~yuhaoyan] Any update about this? I think providing DataFrame-based SVM algorithm is very important to users, so it's better we can get it in ASAP. I'd like to get in the implementation with OWLQN and Hinge loss firstly, and to discuss SMO version later. Like [~mlnick] said, it's better to get more performance number and user case of SMO impl. And it's not very hard to add a new internal implementation after we have the basic SVM API. I saw you have a implementation with OWLQN and Hinge loss already, could you send the PR? If you are busy with other things, I can help and you are still the primary author of this PR. Thanks! > spark.ml API for linear SVM > --- > > Key: SPARK-14709 > URL: https://issues.apache.org/jira/browse/SPARK-14709 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > Provide API for SVM algorithm for DataFrames. I would recommend using > OWL-QN, rather than wrapping spark.mllib's SGD-based implementation. > The API should mimic existing spark.ml.classification APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14709) spark.ml API for linear SVM
[ https://issues.apache.org/jira/browse/SPARK-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512085#comment-15512085 ] Yanbo Liang commented on SPARK-14709: - [~yuhaoyan] Any update about this? I think providing DataFrame-based SVM algorithm is very important to users, so it's better we can get it in ASAP. I'd like to get in the implementation with OWLQN and Hinge loss firstly, and to discuss SMO version later. Like [~mlnick] said, it's better to get more performance number and user case of SMO impl. And it's not very hard to add a new internal implementation after we have the basic SVM API. I saw you have a implementation with OWLQN and Hinge loss already, could you send the PR? If you are busy with other things, I can help and you are still the primary author of this PR. Thanks! > spark.ml API for linear SVM > --- > > Key: SPARK-14709 > URL: https://issues.apache.org/jira/browse/SPARK-14709 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > Provide API for SVM algorithm for DataFrames. I would recommend using > OWL-QN, rather than wrapping spark.mllib's SGD-based implementation. > The API should mimic existing spark.ml.classification APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17577) SparkR support add files to Spark job and get by executors
[ https://issues.apache.org/jira/browse/SPARK-17577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-17577. - Resolution: Fixed Assignee: Yanbo Liang Fix Version/s: 2.1.0 > SparkR support add files to Spark job and get by executors > -- > > Key: SPARK-17577 > URL: https://issues.apache.org/jira/browse/SPARK-17577 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 2.1.0 > > > Scala/Python users can add files to Spark job by submit options {{--files}} > or {{SparkContext.addFile()}}. Meanwhile, users can get the added file by > {{SparkFiles.get(filename)}}. > We should also support this function for SparkR users, since they also have > the requirements for some shared dependency files. For example, SparkR users > can download third party R packages to driver firstly, add these files to the > Spark job as dependency by this API and then each executor can install these > packages by {{install.packages}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively
[ https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-17585: --- Assignee: Yanbo Liang > PySpark SparkContext.addFile supports adding files recursively > -- > > Key: SPARK-17585 > URL: https://issues.apache.org/jira/browse/SPARK-17585 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.1.0 > > > Users would like to add a directory as dependency in some cases, they can use > {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add > all files under the directory by using Scala. But Python users can only add > file not directory, we should also make it supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively
[ https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-17585. - Resolution: Fixed Fix Version/s: 2.1.0 > PySpark SparkContext.addFile supports adding files recursively > -- > > Key: SPARK-17585 > URL: https://issues.apache.org/jira/browse/SPARK-17585 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.1.0 > > > Users would like to add a directory as dependency in some cases, they can use > {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add > all files under the directory by using Scala. But Python users can only add > file not directory, we should also make it supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17588) java.lang.AssertionError: assertion failed: lapack.dppsv returned 105. when running glm using gaussian link function.
[ https://issues.apache.org/jira/browse/SPARK-17588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509001#comment-15509001 ] Yanbo Liang commented on SPARK-17588: - [~sowen] See my comments at SPARK-11918. Thanks. > java.lang.AssertionError: assertion failed: lapack.dppsv returned 105. when > running glm using gaussian link function. > - > > Key: SPARK-17588 > URL: https://issues.apache.org/jira/browse/SPARK-17588 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: sai pavan kumar chitti >Assignee: Sean Owen >Priority: Minor > > hi, > i am getting java.lang.AssertionError error when running glm, using gaussian > link function, on a dataset with 109 columns and 81318461 rows > Below is the call trace. Can someone please tell me what the issues is > related to and how to go about resolving it. Is it because native > acceleration is not working as i am also seeing following warning messages. > WARN netlib.BLAS: Failed to load implementation from: > com.github.fommil.netlib.NativeRefBLAS > WARN netlib.LAPACK: Failed to load implementation from: > com.github.fommil.netlib.NativeSystemLAPACK > WARN netlib.LAPACK: Failed to load implementation from: > com.github.fommil.netlib.NativeRefLAPACK > 16/09/17 13:08:13 ERROR r.RBackendHandler: fit on > org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.AssertionError: assertion failed: lapack.dppsv returned 105. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:140) > at > org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:265) > at > org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:139) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149) > at > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:145) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.sc > thanks, > pavan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508996#comment-15508996 ] Yanbo Liang commented on SPARK-11918: - Cholesky decomposition is unstable (for near-singular and rank deficient matrices), but it was often used when matrix A is very large and sparse due to faster calculation. QR decomposition is more stable than Cholesky, I think we should switch to it in the future. I will take a look at this issue. For temporary fix, I think throwing a better exception to let users know the failure cause is OK. Thanks. > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Priority: Minor > Labels: starter > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively
[ https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17585: Description: Users would like to add a directory as dependency in some cases, they can use {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add all files under the directory by using Scala. But Python users can only add file not directory, we should also make it supported. (was: PySpark {{SparkContext.addFile}} should support adding files recursively under a directory similar with Scala. Users would like to add a directory as dependency in some cases, they can use {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add all files under the directory by using Scala. But Python users can only add file not directory, we should also make it supported.) > PySpark SparkContext.addFile supports adding files recursively > -- > > Key: SPARK-17585 > URL: https://issues.apache.org/jira/browse/SPARK-17585 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: Yanbo Liang >Priority: Minor > > Users would like to add a directory as dependency in some cases, they can use > {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add > all files under the directory by using Scala. But Python users can only add > file not directory, we should also make it supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively
[ https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17585: Description: PySpark {{SparkContext.addFile}} should support adding files recursively under a directory similar with Scala. Users would like to add a directory as dependency in some cases, they can use {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add all files under the directory by using Scala. But Python users can only add file not directory, we should also make it supported. was:PySpark {{SparkContext.addFile}} should support adding files recursively under a directory similar with Scala. > PySpark SparkContext.addFile supports adding files recursively > -- > > Key: SPARK-17585 > URL: https://issues.apache.org/jira/browse/SPARK-17585 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: Yanbo Liang >Priority: Minor > > PySpark {{SparkContext.addFile}} should support adding files recursively > under a directory similar with Scala. > Users would like to add a directory as dependency in some cases, they can use > {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add > all files under the directory by using Scala. But Python users can only add > file not directory, we should also make it supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17577) SparkR support add files to Spark job and get by executors
[ https://issues.apache.org/jira/browse/SPARK-17577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17577: Description: Scala/Python users can add files to Spark job by submit options {{--files}} or {{SparkContext.addFile()}}. Meanwhile, users can get the added file by {{SparkFiles.get(filename)}}. We should also support this function for SparkR users, since they also have the requirements for some shared dependency files. For example, SparkR users can download third party R packages to driver firstly, add these files to the Spark job as dependency by this API and then each executor can install these packages by {{install.packages}}. was: Scala/Python users can add files to Spark job by submit options {{--files}} or {{SparkContext.addFile()}}. Meanwhile, users can get the added file by {{SparkFiles.get(filename)}}. We should also support this function for SparkR users, since SparkR users should can use shared files for each executors. For examples, SparkR users can download third party R packages to driver firstly, add these files to the Spark job by this API and then each executor can install these packages by {{install.packages}}. > SparkR support add files to Spark job and get by executors > -- > > Key: SPARK-17577 > URL: https://issues.apache.org/jira/browse/SPARK-17577 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Yanbo Liang > > Scala/Python users can add files to Spark job by submit options {{--files}} > or {{SparkContext.addFile()}}. Meanwhile, users can get the added file by > {{SparkFiles.get(filename)}}. > We should also support this function for SparkR users, since they also have > the requirements for some shared dependency files. For example, SparkR users > can download third party R packages to driver firstly, add these files to the > Spark job as dependency by this API and then each executor can install these > packages by {{install.packages}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17577) SparkR support add files to Spark job and get by executors
[ https://issues.apache.org/jira/browse/SPARK-17577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17577: Description: Scala/Python users can add files to Spark job by submit options {{--files}} or {{SparkContext.addFile()}}. Meanwhile, users can get the added file by {{SparkFiles.get(filename)}}. We should also support this function for SparkR users, since SparkR users should can use shared files for each executors. For examples, SparkR users can download third party R packages to driver firstly, add these files to the Spark job by this API and then each executor can install these packages by {{install.packages}}. was: Scala/Python users can add files to Spark job by submit options {{--files}} or {{SparkContext.addFile()}}. Meanwhile, users can get the added file by {{SparkFiles.get(filename)}}. We should also support this function for SparkR users, since SparkR users may install third party R packages on each executors. For examples, SparkR users can download third party R packages to driver firstly, add these files to the Spark job by this API and each executor can install these packages by {{install.packages}}. > SparkR support add files to Spark job and get by executors > -- > > Key: SPARK-17577 > URL: https://issues.apache.org/jira/browse/SPARK-17577 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Yanbo Liang > > Scala/Python users can add files to Spark job by submit options {{--files}} > or {{SparkContext.addFile()}}. Meanwhile, users can get the added file by > {{SparkFiles.get(filename)}}. > We should also support this function for SparkR users, since SparkR users > should can use shared files for each executors. For examples, SparkR users > can download third party R packages to driver firstly, add these files to the > Spark job by this API and then each executor can install these packages by > {{install.packages}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively
[ https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17585: Component/s: Spark Core > PySpark SparkContext.addFile supports adding files recursively > -- > > Key: SPARK-17585 > URL: https://issues.apache.org/jira/browse/SPARK-17585 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: Yanbo Liang >Priority: Minor > > PySpark {{SparkContext.addFile}} should support adding files recursively > under a directory similar with Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively
Yanbo Liang created SPARK-17585: --- Summary: PySpark SparkContext.addFile supports adding files recursively Key: SPARK-17585 URL: https://issues.apache.org/jira/browse/SPARK-17585 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Yanbo Liang Priority: Minor PySpark {{SparkContext.addFile}} should support adding files recursively under a directory similar with Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17577) SparkR support add files to Spark job and get by executors
Yanbo Liang created SPARK-17577: --- Summary: SparkR support add files to Spark job and get by executors Key: SPARK-17577 URL: https://issues.apache.org/jira/browse/SPARK-17577 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Yanbo Liang Scala/Python users can add files to Spark job by submit options {{--files}} or {{SparkContext.addFile()}}. Meanwhile, users can get the added file by {{SparkFiles.get(filename)}}. We should also support this function for SparkR users, since SparkR users may install third party R packages on each executors. For examples, SparkR users can download third party R packages to driver firstly, add these files to the Spark job by this API and each executor can install these packages by {{install.packages}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class
[ https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487035#comment-15487035 ] Yanbo Liang commented on SPARK-17471: - [~sethah] I'm sorry that I have some emergent affairs to deal with in these days, so please feel free to take over this task. Thanks! > Add compressed method for Matrix class > -- > > Key: SPARK-17471 > URL: https://issues.apache.org/jira/browse/SPARK-17471 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Seth Hendrickson > > Vectors in Spark have a {{compressed}} method which selects either sparse or > dense representation by minimizing storage requirements. Matrices should also > have this method, which is now explicitly needed in {{LogisticRegression}} > since we have implemented multiclass regression. > The compressed method should also give the option to store row major or > column major, and if nothing is specified should select the lower storage > representation (for sparse). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17471) Add compressed method for Matrix class
[ https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477376#comment-15477376 ] Yanbo Liang edited comment on SPARK-17471 at 9/9/16 3:46 PM: - [~sethah] I think this task is duplicated with SPARK-17137 which will add compressed support for multinomial logistic regression coefficients. I'm working on that one and have some {{Matrix}} compression performance test results. I will post them here for discussion as soon as possible. Thanks! was (Author: yanboliang): [~sethah] I think this task is duplicated with SPARK-17137 which will add compressed support for multinomial logistic regression coefficients. I'm working on that one and have some {{Matrix}} compression performance test result. I will post them here for discussion as soon as possible. Thanks! > Add compressed method for Matrix class > -- > > Key: SPARK-17471 > URL: https://issues.apache.org/jira/browse/SPARK-17471 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Seth Hendrickson > > Vectors in Spark have a {{compressed}} method which selects either sparse or > dense representation by minimizing storage requirements. Matrices should also > have this method, which is now explicitly needed in {{LogisticRegression}} > since we have implemented multiclass regression. > The compressed method should also give the option to store row major or > column major, and if nothing is specified should select the lower storage > representation (for sparse). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class
[ https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477376#comment-15477376 ] Yanbo Liang commented on SPARK-17471: - [~sethah] I think this task is duplicated with SPARK-17137 which will add compressed support for multinomial logistic regression coefficients. I'm working on that one and have some {{Matrix}} compression performance test result. I will post them here for discussion as soon as possible. Thanks! > Add compressed method for Matrix class > -- > > Key: SPARK-17471 > URL: https://issues.apache.org/jira/browse/SPARK-17471 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Seth Hendrickson > > Vectors in Spark have a {{compressed}} method which selects either sparse or > dense representation by minimizing storage requirements. Matrices should also > have this method, which is now explicitly needed in {{LogisticRegression}} > since we have implemented multiclass regression. > The compressed method should also give the option to store row major or > column major, and if nothing is specified should select the lower storage > representation (for sparse). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477150#comment-15477150 ] Yanbo Liang edited comment on SPARK-17428 at 9/9/16 2:14 PM: - Yeah, I agree to start with something simple and iterate later. I will do some experiments to verify whether it works well for my use case. Thanks for all your help! [~shivaram] [~sunrui] [~felixcheung] [~zjffdu] was (Author: yanboliang): Yeah, I agree to start with something simple and iterate later. I will do some experiments to verify whether it works well for the my use case. Thanks for all your help! [~shivaram] [~sunrui] [~felixcheung] [~zjffdu] > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477150#comment-15477150 ] Yanbo Liang commented on SPARK-17428: - Yeah, I agree to start with something simple and iterate later. I will do some experiments to verify whether it works well for the my use case. Thanks for all your help! [~shivaram] [~sunrui] [~felixcheung] [~zjffdu] > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17464) SparkR spark.als arguments reg should be 0.1 by default
[ https://issues.apache.org/jira/browse/SPARK-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-17464. - Resolution: Fixed Assignee: Yanbo Liang Fix Version/s: 2.1.0 > SparkR spark.als arguments reg should be 0.1 by default > --- > > Key: SPARK-17464 > URL: https://issues.apache.org/jira/browse/SPARK-17464 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.1.0 > > > SparkR spark.als arguments {{reg}} should be 0.1 by default, which need to be > consistent with ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17456) Utility for parsing Spark versions
[ https://issues.apache.org/jira/browse/SPARK-17456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-17456. - Resolution: Fixed Fix Version/s: 2.1.0 > Utility for parsing Spark versions > -- > > Key: SPARK-17456 > URL: https://issues.apache.org/jira/browse/SPARK-17456 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > Fix For: 2.1.0 > > > There are many hacks within Spark's codebase to identify and compare Spark > versions. We should add a simple utility to standardize these code paths, > especially since there have been mistakes made in the past. This will let us > add unit tests as well. This initial patch will only add methods for > extracting major and minor versions as Int types in Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17464) SparkR spark.als arguments reg should be 0.1 by default
Yanbo Liang created SPARK-17464: --- Summary: SparkR spark.als arguments reg should be 0.1 by default Key: SPARK-17464 URL: https://issues.apache.org/jira/browse/SPARK-17464 Project: Spark Issue Type: Bug Components: ML, SparkR Reporter: Yanbo Liang Priority: Minor SparkR spark.als arguments {{reg}} should be 0.1 by default, which need to be consistent with ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473643#comment-15473643 ] Yanbo Liang edited comment on SPARK-17428 at 9/8/16 11:46 AM: -- [~sunrui] [~shivaram] [~felixcheung] Thanks for your reply. Yes, we can compile packages at driver and send them to executors. But it involves some issues: * Usually the Spark job is not run as root, but we need root privilege to install R packages on executors which is not permitted. * After we run a SparkR job, the executors' R libraries will be polluted. And when another job was running on that executor, it may failed due to some conflict. * The architecture of driver and executor may different, so the packages compiled on driver may not work well when it was sending to executors if it dependent on some architecture-related code. These issues can not solved by SparkR currently. I investigated and found packrat can help us on this direction, but may need more experiment and study to verify. If this proposal make sense, I can work on this feature. Please feel free to let me know what you concern about. Thanks! was (Author: yanboliang): [~sunrui] [~shivaram] [~felixcheung] Thanks for your reply. Yes, we can compile packages at driver and send them to executors. But it involves some issues: * Usually the Spark job is not run as root, but we need root privilege to install R packages on executors which is not permitted. * After we run a SparkR job, the executors' R libraries will be polluted. And when another job was running on that executor, it may failed due to some conflict. * The architecture of driver and executor may different, so the packages compiled on driver may not work well when it was sending to executors if it dependent on some architecture-related code. These issues can not solved by SparkR currently. I investigated and found packrat can help us on this direction, but may be need more experiments and study. If this proposal make sense, I can work on this feature. Please feel free to let me know what you concern about. Thanks! > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473643#comment-15473643 ] Yanbo Liang edited comment on SPARK-17428 at 9/8/16 11:45 AM: -- [~sunrui] [~shivaram] [~felixcheung] Thanks for your reply. Yes, we can compile packages at driver and send them to executors. But it involves some issues: * Usually the Spark job is not run as root, but we need root privilege to install R packages on executors which is not permitted. * After we run a SparkR job, the executors' R libraries will be polluted. And when another job was running on that executor, it may failed due to some conflict. * The architecture of driver and executor may different, so the packages compiled on driver may not work well when it was sending to executors if it dependent on some architecture-related code. These issues can not solved by SparkR currently. I investigated and found packrat can help us on this direction, but may be need more experiments and study. If this proposal make sense, I can work on this feature. Please feel free to let me know what you concern about. Thanks! was (Author: yanboliang): [~sunrui] [~shivaram] [~felixcheung] Thanks for your reply. Yes, we can compile packages at driver and send them to executors. But it involves some issues: * Usually the Spark job is not run as root, but we need root privilege to install R packages on executors which is not permitted. * After we run a SparkR job, the executors' R libraries will be polluted. And when another job was running on that executor, it may failed due to some conflict. * The architecture of driver and executor may different, so the packages compiled on driver may not work well when it was sending to executors if it dependent on some architecture-related code. These issues can not solved by SparkR currently. I investigated and found packrat can help us on this direction, but may be need more experiments. If this proposal make sense, I can work on this feature. Please feel free to let me know what you concern about. Thanks! > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473643#comment-15473643 ] Yanbo Liang commented on SPARK-17428: - [~sunrui] [~shivaram] [~felixcheung] Thanks for your reply. Yes, we can compile packages at driver and send them to executors. But it involves some issues: * Usually the Spark job is not run as root, but we need root privilege to install R packages on executors which is not permitted. * After we run a SparkR job, the executors' R libraries will be polluted. And when another job was running on that executor, it may failed due to some conflict. * The architecture of driver and executor may different, so the packages compiled on driver may not work well when it was sending to executors if it dependent on some architecture-related code. These issues can not solved by SparkR currently. I investigated and found packrat can help us on this direction, but may be need more experiments. If this proposal make sense, I can work on this feature. Please feel free to let me know what you concern about. Thanks! > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469736#comment-15469736 ] Yanbo Liang edited comment on SPARK-17428 at 9/7/16 6:40 AM: - cc [~shivaram] [~felixcheung] [~sunrui] was (Author: yanboliang): cc [~shivaram] [~felixcheung] > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469736#comment-15469736 ] Yanbo Liang commented on SPARK-17428: - cc [~shivaram] [~felixcheung] > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17428: Description: Many users have requirements to use third party R packages in executors/workers, but SparkR can not satisfy this requirements elegantly. For example, you should to mess with the IT/administrators of the cluster to deploy these R packages on each executors/workers node which is very inflexible. I think we should support third party R packages for SparkR users as what we do for jar packages in the following two scenarios: 1, Users can install R packages from CRAN or custom CRAN-like repository for each executors. 2, Users can load their local R packages and install them on each executors. To achieve this goal, the first thing is to make SparkR executors support virtualenv like Python conda. I have investigated and found packrat(http://rstudio.github.io/packrat/) is one of the candidates to support virtualenv for R. Packrat is a dependency management system for R and can isolate the dependent R packages in its own private package space. Then SparkR users can install third party packages in the application scope(destroy after the application exit) and don’t need to bother IT/administrators to install these packages manually. I would like to know whether it make sense. was: Many users have requirements to use third party R packages in executors/workers, but SparkR can not satisfy this requirements elegantly. For example, you should to mess with the IT/administrators of the cluster to deploy these R packages on each executors/workers node which is very inflexible. I think we should support third party R packages for SparkR users as what we do for jar packages in the following two scenarios: 1, Users can install R packages from CRAN or custom CRAN-like repository for each executors. 2, Users can load their local R packages and install them on each executors. To achieve this goal, the first thing is to make SparkR executors support virtualenv like Python conda. I have investigated and found packrat is one of the candidates to support virtualenv for R. Packrat is a dependency management system for R and can isolate the dependent R packages in its own private package space. Then SparkR users can install third party packages in the application scope(destroy after the application exit) and don’t need to bother IT/administrators to install these packages manually. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17428) SparkR executors/workers support virtualenv
Yanbo Liang created SPARK-17428: --- Summary: SparkR executors/workers support virtualenv Key: SPARK-17428 URL: https://issues.apache.org/jira/browse/SPARK-17428 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Yanbo Liang Many users have requirements to use third party R packages in executors/workers, but SparkR can not satisfy this requirements elegantly. For example, you should to mess with the IT/administrators of the cluster to deploy these R packages on each executors/workers node which is very inflexible. I think we should support third party R packages for SparkR users as what we do for jar packages in the following two scenarios: 1, Users can install R packages from CRAN or custom CRAN-like repository for each executors. 2, Users can load their local R packages and install them on each executors. To achieve this goal, the first thing is to make SparkR executors support virtualenv like Python conda. I have investigated and found packrat is one of the candidates to support virtualenv for R. Packrat is a dependency management system for R and can isolate the dependent R packages in its own private package space. Then SparkR users can install third party packages in the application scope(destroy after the application exit) and don’t need to bother IT/administrators to install these packages manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17197) PySpark LiR/LoR supports tree aggregation level configurable
[ https://issues.apache.org/jira/browse/SPARK-17197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-17197. - Resolution: Fixed Assignee: Yanbo Liang Fix Version/s: 2.1.0 > PySpark LiR/LoR supports tree aggregation level configurable > > > Key: SPARK-17197 > URL: https://issues.apache.org/jira/browse/SPARK-17197 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.1.0 > > > SPARK-17090 make tree aggregation level in LiR/LoR configurable, this jira is > used to make PySpark support this function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-14378. - Resolution: Done > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436529#comment-15436529 ] Yanbo Liang commented on SPARK-14378: - Yes, I think we can resolve this as DONE. Thanks! > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8519) Blockify distance computation in k-means
[ https://issues.apache.org/jira/browse/SPARK-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-8519: --- Comment: was deleted (was: User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/10306) > Blockify distance computation in k-means > > > Key: SPARK-8519 > URL: https://issues.apache.org/jira/browse/SPARK-8519 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Labels: advanced > > The performance of pairwise distance computation in k-means can benefit from > BLAS Level 3 matrix-matrix multiplications. It requires we update the > implementation to use blocks. Even for sparse data, we might be able to see > some performance gain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14381) Review spark.ml parity for feature transformers
[ https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14381: Fix Version/s: (was: 2.1.0) > Review spark.ml parity for feature transformers > --- > > Key: SPARK-14381 > URL: https://issues.apache.org/jira/browse/SPARK-14381 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Xusen Yin > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers
[ https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436268#comment-15436268 ] Yanbo Liang commented on SPARK-14381: - Resolved this, thanks for working on it. > Review spark.ml parity for feature transformers > --- > > Key: SPARK-14381 > URL: https://issues.apache.org/jira/browse/SPARK-14381 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Xusen Yin > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14381) Review spark.ml parity for feature transformers
[ https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-14381. - Resolution: Done Assignee: Xusen Yin Fix Version/s: 2.1.0 > Review spark.ml parity for feature transformers > --- > > Key: SPARK-14381 > URL: https://issues.apache.org/jira/browse/SPARK-14381 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Xusen Yin > Fix For: 2.1.0 > > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264 ] Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:30 AM: -- * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** PMML SPARK-11239 ** toString: print summary * IsotonicRegressionModel ** single-row prediction SPARK-10413 * StreamingLinearRegression was (Author: yanboliang): * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** PMML SPARK-11237 ** toString: print summary * IsotonicRegressionModel ** single-row prediction SPARK-10413 * StreamingLinearRegression > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264 ] Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:29 AM: -- * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** PMML SPARK-11237 ** toString: print summary * IsotonicRegressionModel ** single-row prediction SPARK-10413 * StreamingLinearRegression was (Author: yanboliang): * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** toString: print summary * IsotonicRegressionModel ** single-row prediction SPARK-10413 > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264 ] Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:26 AM: -- * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** toString: print summary * IsotonicRegressionModel ** single-row prediction SPARK-10413 was (Author: yanboliang): * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** toString: print summary SPARK-14712 * IsotonicRegressionModel ** single-row prediction SPARK-10413 > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-14378: --- Assignee: Yanbo Liang > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264 ] Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:25 AM: -- * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** toString: print summary SPARK-14712 * IsotonicRegressionModel ** single-row prediction SPARK-10413 was (Author: yanboliang): * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** toString: print summary SPARK-14712 * IsotonicRegressionModel ** single-row prediction SPARK-10413 > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264 ] Yanbo Liang commented on SPARK-14378: - * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** toString: print summary SPARK-14712 * IsotonicRegressionModel ** single-row prediction SPARK-10413 > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191 ] Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:22 AM: -- Exposing a {{family}} or similar parameter sounds good to me. One question: {quote} When the family is set to "binomial" we produce normal logistic regression with pivoting and when it is set to "multinomial" (default) it produces logistic regression with pivoting. {quote} Should it be {{when it is set to "multinomial" (default) it produces logistic regression {color:red}without{color} pivoting}} ? Thanks! was (Author: yanboliang): Exposing a {{family}} or similar parameter sounds good to me. > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191 ] Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:22 AM: -- Exposing a {{family}} or similar parameter sounds good to me. One more question: {quote} When the family is set to "binomial" we produce normal logistic regression with pivoting and when it is set to "multinomial" (default) it produces logistic regression with pivoting. {quote} Should it be {{when it is set to "multinomial" (default) it produces logistic regression {color:red}without{color} pivoting}} ? Thanks! was (Author: yanboliang): Exposing a {{family}} or similar parameter sounds good to me. One question: {quote} When the family is set to "binomial" we produce normal logistic regression with pivoting and when it is set to "multinomial" (default) it produces logistic regression with pivoting. {quote} Should it be {{when it is set to "multinomial" (default) it produces logistic regression {color:red}without{color} pivoting}} ? Thanks! > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191 ] Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:14 AM: -- Exposing a {{family}} or similar parameter sounds good to me. was (Author: yanboliang): Exposing a {{family}} or similar parameter to control pivoting sounds good to me. > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412 ] Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:12 AM: -- I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will be consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. FYI: SPARK-11834 and SPARK-11543. * Model store/load compatibility. Here we have two choice: consolidate them which will introduce breaking change; or keep them separately. -I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly hold my opinion if you have better proposal. Thanks!- was (Author: yanboliang): I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will be consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce big breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. FYI: SPARK-11834 and SPARK-11543. * Model store/load compatibility. Here we have two choice: consolidate them which will introduce breaking change; or keep them separately. -I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly hold my opinion if you have better proposal. Thanks!- > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191 ] Yanbo Liang commented on SPARK-17163: - Exposing a {{family}} or similar parameter to control pivoting sounds good to me. > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434798#comment-15434798 ] Yanbo Liang edited comment on SPARK-17163 at 8/24/16 12:12 PM: --- Think more about this problem, I change my mind to support consolidate MLOR and LOR into one since I saw there are lots of duplicated code between them. I think it's worth to make the breaking change, otherwise, it will require extra efforts to maintain them. Thanks! was (Author: yanboliang): Think more about this problem, I change my mind to support consolidate MLOR and LOR into one since I saw there are lots of duplicated code between them. I think it's worth to make the breaking change, otherwise, it will require efforts to maintain them. Thanks! > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412 ] Yanbo Liang edited comment on SPARK-17163 at 8/24/16 12:11 PM: --- I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will be consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce big breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. FYI: SPARK-11834 and SPARK-11543. * Model store/load compatibility. Here we have two choice: consolidate them which will introduce breaking change; or keep them separately. -I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly hold my opinion if you have better proposal. Thanks!- was (Author: yanboliang): I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will more or less consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce big breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. FYI: SPARK-11834 and SPARK-11543. * Model store/load compatibility. Here we have two choice: consolidate them which will introduce breaking change; or keep them separately. -I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly hold my opinion if you have better proposal. Thanks!- > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412 ] Yanbo Liang edited comment on SPARK-17163 at 8/24/16 12:10 PM: --- I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will more or less consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce big breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. FYI: SPARK-11834 and SPARK-11543. * Model store/load compatibility. Here we have two choice: consolidate them which will introduce breaking change; or keep them separately. -I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly hold my opinion if you have better proposal. Thanks!- was (Author: yanboliang): I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will more or less consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce big breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. FYI: SPARK-11834 and SPARK-11543. * Model store/load compatibility. I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly hold my opinion if you have better proposal. Thanks! > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434798#comment-15434798 ] Yanbo Liang commented on SPARK-17163: - Think more about this problem, I change my mind to support consolidate MLOR and LOR into one since I saw there are lots of duplicated code between them. I think it's worth to make the breaking change, otherwise, it will require efforts to maintain them. Thanks! > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412 ] Yanbo Liang edited comment on SPARK-17163 at 8/24/16 7:54 AM: -- I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will more or less consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce big breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. FYI: SPARK-11834 and SPARK-11543. * Model store/load compatibility. I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly hold my opinion if you have better proposal. Thanks! was (Author: yanboliang): I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will more or less consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce big breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. * Model store/load compatibility. I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly hold my opinion if you have better proposal. Thanks! > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412 ] Yanbo Liang edited comment on SPARK-17163 at 8/24/16 7:52 AM: -- I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will more or less consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce big breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. * Model store/load compatibility. I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly hold my opinion if you have better proposal. Thanks! was (Author: yanboliang): I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will more or less consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce big breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. * Model store/load compatibility. I'm more prefer to keep LOR and MLOR in different APIs, but not very strongly hold my opinion if you have better proposal. Thanks! > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412 ] Yanbo Liang edited comment on SPARK-17163 at 8/24/16 7:50 AM: -- I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will more or less consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce big breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. * Model store/load compatibility. I'm more prefer to keep LOR and MLOR in different APIs, but not very strongly hold my opinion if you have better proposal. Thanks! was (Author: yanboliang): I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce big breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. * Model store/load compatibility. I'm more prefer to keep LOR and MLOR in different APIs, but not very strongly hold my opinion if you have better proposal. Thanks! > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412 ] Yanbo Liang edited comment on SPARK-17163 at 8/24/16 7:49 AM: -- I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce big breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. * Model store/load compatibility. I'm more prefer to keep LOR and MLOR in different APIs, but not very strongly hold my opinion if you have better proposal. Thanks! was (Author: yanboliang): I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. * Model store/load compatibility. I'm more prefer to keep LOR and MLOR in different APIs, but not very strongly hold my opinion if you have better proposal. Thanks! > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412 ] Yanbo Liang commented on SPARK-17163: - I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. * Model store/load compatibility. I'm more prefer to keep LOR and MLOR in different APIs, but not very strongly hold my opinion if you have better proposal. Thanks! > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17197) PySpark LiR/LoR supports tree aggregation level configurable
[ https://issues.apache.org/jira/browse/SPARK-17197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17197: Priority: Minor (was: Major) > PySpark LiR/LoR supports tree aggregation level configurable > > > Key: SPARK-17197 > URL: https://issues.apache.org/jira/browse/SPARK-17197 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > SPARK-17090 make tree aggregation level in LiR/LoR configurable, this jira is > used to make PySpark support this function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17197) PySpark LiR/LoR supports tree aggregation level configurable
Yanbo Liang created SPARK-17197: --- Summary: PySpark LiR/LoR supports tree aggregation level configurable Key: SPARK-17197 URL: https://issues.apache.org/jira/browse/SPARK-17197 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang SPARK-17090 make tree aggregation level in LiR/LoR configurable, this jira is used to make PySpark support this function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11215) Add multiple columns support to StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-11215: --- Assignee: Yanbo Liang > Add multiple columns support to StringIndexer > - > > Key: SPARK-11215 > URL: https://issues.apache.org/jira/browse/SPARK-11215 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Add multiple columns support to StringIndexer, then users can transform > multiple input columns to multiple output columns simultaneously. See > discussion SPARK-8418. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17169) To use scala macros to update code when SharedParamsCodeGen.scala changed
[ https://issues.apache.org/jira/browse/SPARK-17169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430282#comment-15430282 ] Yanbo Liang commented on SPARK-17169: - Meanwhile, it's better we can do compile time code-gen for python params as well, that is to say run {{python _shared_params_code_gen.py > shared.py}} automatically. > To use scala macros to update code when SharedParamsCodeGen.scala changed > - > > Key: SPARK-17169 > URL: https://issues.apache.org/jira/browse/SPARK-17169 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Qian Huang >Priority: Minor > > As commented in the file SharedParamsCodeGen.scala, we have to manually run > build/sbt "mllib/runMain org.apache.spark.ml.param.shared.SharedParamsCodeGen" > to generate and update it. > It could be better to do compile time code-gen for this using scala macros > rather than running the script as described above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data
[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430009#comment-15430009 ] Yanbo Liang commented on SPARK-17086: - We should not throw exception in this case. If the number of distinct input data is less than {{numBuckets}}, we will simply return an array with distinct elements as splits. But we should not actually compute the number of distinct input elements which is very expensive, we can collapse adjacent splits produced by {{approxQuantile}} that are equal. > QuantileDiscretizer throws InvalidArgumentException (parameter splits given > invalid value) on valid data > > > Key: SPARK-17086 > URL: https://issues.apache.org/jira/browse/SPARK-17086 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > > I discovered this bug when working with a build from the master branch (which > I believe is 2.1.0). This used to work fine when running spark 1.6.2. > I have a dataframe with an "intData" column that has values like > {code} > 1 3 2 1 1 2 3 2 2 2 1 3 > {code} > I have a stage in my pipeline that uses the QuantileDiscretizer to produce > equal weight splits like this > {code} > new QuantileDiscretizer() > .setInputCol("intData") > .setOutputCol("intData_bin") > .setNumBuckets(10) > .fit(df) > {code} > But when that gets run it (incorrectly) throws this error: > {code} > parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, > 3.0, Infinity] > {code} > I don't think that there should be duplicate splits generated should there be? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15018) PySpark ML Pipeline raises unclear error when no stages set
[ https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-15018. - Resolution: Fixed Fix Version/s: 2.1.0 > PySpark ML Pipeline raises unclear error when no stages set > --- > > Key: SPARK-15018 > URL: https://issues.apache.org/jira/browse/SPARK-15018 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > Fix For: 2.1.0 > > > When fitting a PySpark Pipeline with no stages, it should work as an identity > transformer. Instead the following error is raised: > {noformat} > Traceback (most recent call last): > File "./spark/python/pyspark/ml/base.py", line 64, in fit > return self._fit(dataset) > File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit > for stage in stages: > TypeError: 'NoneType' object is not iterable > {noformat} > The param {{stages}} needs to be an empty list and {{getStages}} should call > {{getOrDefault}}. > Also, since the default value is {{None}} is then changed to and empty list > {{[]}}, this never changes the value if passed in as a keyword argument. > Instead, the {{kwargs}} value should be changed directly if {{stages is > None}}. > For example > {noformat} > if stages is None: > stages = [] > {noformat} > should be this > {noformat} > if stages is None: > kwargs['stages'] = [] > {noformat} > However, since there is no default value in the Scala implementation, > assigning a default here is not needed and should be cleaned up. The pydocs > should better indicate that stages is required to be a list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17138) Python API for multinomial logistic regression
[ https://issues.apache.org/jira/browse/SPARK-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429258#comment-15429258 ] Yanbo Liang commented on SPARK-17138: - [~WeichenXu123] Please hold on this task, since SPARK-17163 discuss to unify multinomial and binary logistic regression interfaces which may affect the Python API. Please wait for SPARK-17163 get merged firstly. Thanks! > Python API for multinomial logistic regression > -- > > Key: SPARK-17138 > URL: https://issues.apache.org/jira/browse/SPARK-17138 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Once [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159] is merged, > we should make a Python API for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
[ https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429255#comment-15429255 ] Yanbo Liang commented on SPARK-17137: - Yes, I will do some performance test to weigh the trade-off. Thanks. > Add compressed support for multinomial logistic regression coefficients > --- > > Key: SPARK-17137 > URL: https://issues.apache.org/jira/browse/SPARK-17137 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > For sparse coefficients in MLOR, such as when high L1 regularization, it may > be more efficient to store coefficients in compressed format. We can add this > option to MLOR and perhaps to do some performance tests to verify > improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429253#comment-15429253 ] Yanbo Liang commented on SPARK-17136: - Yes, only first order optimizer can scale well in number of features, so only this case should be taken into consideration. I recently worked on SPARK-10078 to support vector-free L-BFGS as optimizer for Spark which also involves the design of optimizer interface. So I can give a try for this issue too. I will make a investigation on how other packages in Python/R/Matlab defining the interface firstly, post the findings here and then we can discuss how to design the optimizer interface for Spark. Thanks! > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429196#comment-15429196 ] Yanbo Liang commented on SPARK-17134: - [~qhuang] Please feel free to take this task and do the performance investigation. Thanks! > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN
[ https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-17141. - Resolution: Fixed Fix Version/s: 2.1.0 > MinMaxScaler behaves weird when min and max have the same value and some > values are NaN > --- > > Key: SPARK-17141 > URL: https://issues.apache.org/jira/browse/SPARK-17141 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.2, 2.0.0 > Environment: Databrick's Community, Spark 2.0 + Scala 2.10 >Reporter: Alberto Bonsanto >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.1.0 > > > When you have a {{DataFrame}} with a column named {{features}}, which is a > {{DenseVector}} and the *maximum* and *minimum* and some values are > {{Double.NaN}} they get replaced by 0.5, and they should remain with the same > value, I believe. > I know how to fix it, but I haven't ever made a pull request. You can check > the bug in this > [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN
[ https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-17141: --- Assignee: Yanbo Liang > MinMaxScaler behaves weird when min and max have the same value and some > values are NaN > --- > > Key: SPARK-17141 > URL: https://issues.apache.org/jira/browse/SPARK-17141 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.2, 2.0.0 > Environment: Databrick's Community, Spark 2.0 + Scala 2.10 >Reporter: Alberto Bonsanto >Assignee: Yanbo Liang >Priority: Minor > > When you have a {{DataFrame}} with a column named {{features}}, which is a > {{DenseVector}} and the *maximum* and *minimum* and some values are > {{Double.NaN}} they get replaced by 0.5, and they should remain with the same > value, I believe. > I know how to fix it, but I haven't ever made a pull request. You can check > the bug in this > [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN
[ https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427860#comment-15427860 ] Yanbo Liang commented on SPARK-17141: - In the existing code, {{MinMaxScaler}} handle NaN value indeterminately. * If a column has identity value, that is max == min, {{MinMaxScalerModel}} transformation will output 0.5 for all rows even the original value is NaN. * Otherwise, it will remain NaN after transformation. I think we should unify the behavior by remaining NaN value at any condition, since we don't know how to transform a NaN value. In Python sklearn, it will throw exception when there is NaN in the dataset. > MinMaxScaler behaves weird when min and max have the same value and some > values are NaN > --- > > Key: SPARK-17141 > URL: https://issues.apache.org/jira/browse/SPARK-17141 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.2, 2.0.0 > Environment: Databrick's Community, Spark 2.0 + Scala 2.10 >Reporter: Alberto Bonsanto >Priority: Minor > > When you have a {{DataFrame}} with a column named {{features}}, which is a > {{DenseVector}} and the *maximum* and *minimum* and some values are > {{Double.NaN}} they get replaced by 0.5, and they should remain with the same > value, I believe. > I know how to fix it, but I haven't ever made a pull request. You can check > the bug in this > [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN
[ https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427860#comment-15427860 ] Yanbo Liang edited comment on SPARK-17141 at 8/19/16 9:01 AM: -- In the existing code, {{MinMaxScaler}} handle NaN value indeterminately. * If a column has identity value, that is max == min, {{MinMaxScalerModel}} transformation will output 0.5 for all rows even the original value is NaN. * Otherwise, it will remain NaN after transformation. I think we should unify the behavior by remaining NaN value at any condition, since we don't know how to transform a NaN value. In Python sklearn, it will throw exception when there is NaN in the dataset. was (Author: yanboliang): In the existing code, {{MinMaxScaler}} handle NaN value indeterminately. * If a column has identity value, that is max == min, {{MinMaxScalerModel}} transformation will output 0.5 for all rows even the original value is NaN. * Otherwise, it will remain NaN after transformation. I think we should unify the behavior by remaining NaN value at any condition, since we don't know how to transform a NaN value. In Python sklearn, it will throw exception when there is NaN in the dataset. > MinMaxScaler behaves weird when min and max have the same value and some > values are NaN > --- > > Key: SPARK-17141 > URL: https://issues.apache.org/jira/browse/SPARK-17141 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.2, 2.0.0 > Environment: Databrick's Community, Spark 2.0 + Scala 2.10 >Reporter: Alberto Bonsanto >Priority: Minor > > When you have a {{DataFrame}} with a column named {{features}}, which is a > {{DenseVector}} and the *maximum* and *minimum* and some values are > {{Double.NaN}} they get replaced by 0.5, and they should remain with the same > value, I believe. > I know how to fix it, but I haven't ever made a pull request. You can check > the bug in this > [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN
[ https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17141: Priority: Minor (was: Trivial) > MinMaxScaler behaves weird when min and max have the same value and some > values are NaN > --- > > Key: SPARK-17141 > URL: https://issues.apache.org/jira/browse/SPARK-17141 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.2, 2.0.0 > Environment: Databrick's Community, Spark 2.0 + Scala 2.10 >Reporter: Alberto Bonsanto >Priority: Minor > > When you have a {{DataFrame}} with a column named {{features}}, which is a > {{DenseVector}} and the *maximum* and *minimum* and some values are > {{Double.NaN}} they get replaced by 0.5, and they should remain with the same > value, I believe. > I know how to fix it, but I haven't ever made a pull request. You can check > the bug in this > [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15018) PySpark ML Pipeline fails when no stages set
[ https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-15018: Shepherd: Yanbo Liang Assignee: Bryan Cutler > PySpark ML Pipeline fails when no stages set > > > Key: SPARK-15018 > URL: https://issues.apache.org/jira/browse/SPARK-15018 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler > > When fitting a PySpark Pipeline with no stages, it should work as an identity > transformer. Instead the following error is raised: > {noformat} > Traceback (most recent call last): > File "./spark/python/pyspark/ml/base.py", line 64, in fit > return self._fit(dataset) > File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit > for stage in stages: > TypeError: 'NoneType' object is not iterable > {noformat} > The param {{stages}} should be added to the default param list and > {{getStages}} should call {{getOrDefault}}. > Also, since the default value is {{None}} is then changed to and empty list > {{[]}}, this never changes the value if passed in as a keyword argument. > Instead, the {{kwargs}} value should be changed directly if {{stages is > None}}. > For example > {noformat} > if stages is None: > stages = [] > {noformat} > should be this > {noformat} > if stages is None: > kwargs['stages'] = [] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
[ https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427563#comment-15427563 ] Yanbo Liang commented on SPARK-17137: - I think we should provide transparent interface to users rather than exposing a param to control whether output dense/sparse coefficients. Spark MLlib {{Vector.compressed}} returns a vector in either dense or sparse format, whichever uses less storage. I would like to do the performance tests for this issue. Thanks! > Add compressed support for multinomial logistic regression coefficients > --- > > Key: SPARK-17137 > URL: https://issues.apache.org/jira/browse/SPARK-17137 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > For sparse coefficients in MLOR, such as when high L1 regularization, it may > be more efficient to store coefficients in compressed format. We can add this > option to MLOR and perhaps to do some performance tests to verify > improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427539#comment-15427539 ] Yanbo Liang commented on SPARK-17136: - I would like to know that users' own optimizers have some standard API similar with breeze {{LBFGS}} or others? > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427529#comment-15427529 ] Yanbo Liang edited comment on SPARK-17134 at 8/19/16 3:04 AM: -- This is interesting. We also trying to use BLAS to accelerate linear algebra operations in other algorithms such as {{KMeans/ALS}} and I have some basic performance test result. I would like to contribute to this task after SPARK-7159 finished. Thanks! was (Author: yanboliang): This is interesting. We also trying to use BLAS to accelerate linear algebra operations in other algorithms such as {{KMeans/ALS}} and I have some basic performance test result. I would like to contribute to this task. Thanks! > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427529#comment-15427529 ] Yanbo Liang commented on SPARK-17134: - This is interesting. We also trying to use BLAS to accelerate linear algebra operations in other algorithms such as {{KMeans/ALS}} and I have some basic performance test result. I would like to contribute to this task. Thanks! > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data
[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426254#comment-15426254 ] Yanbo Liang edited comment on SPARK-17086 at 8/18/16 10:49 AM: --- [~sowen] The bucket defined by [1.0, 1.0) will only receive the value 1.0, I think this scenario is OK. But if we provide the splits as {{[-Infinity, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, Infinity]}}, it will output {{[-Infinity, 1.0), 1.0, 1.0, [1.0, 2.0), 2.0, 2.0, [2.0, 3.0), 3.0, [3.0, Infinity]}}. >From the document, {{QuantileDiscretizer}} takes a column with continuous >features and outputs a column with binned categorical features. So I think it >does not make sense if we put the same continuous value into different >categorical features. Thanks. was (Author: yanboliang): [~sowen] The bucket defined by [1.0, 1.0) will only receive the value 1.0, I think this scenario is OK. But if we provide the splits as {{[-Infinity, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, Infinity]}}, it will output {{[-Infinity, 1.0), 1.0, 1.0, [1.0, 2.0), 2.0, 2.0, [2.0, 3.0), [3.0, 3.0), [3.0, Infinity]}}. >From the document, {{QuantileDiscretizer}} takes a column with continuous >features and outputs a column with binned categorical features. So I think it >does not make sense if we put the same continuous value into different >categorical features. Thanks. > QuantileDiscretizer throws InvalidArgumentException (parameter splits given > invalid value) on valid data > > > Key: SPARK-17086 > URL: https://issues.apache.org/jira/browse/SPARK-17086 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > > I discovered this bug when working with a build from the master branch (which > I believe is 2.1.0). This used to work fine when running spark 1.6.2. > I have a dataframe with an "intData" column that has values like > {code} > 1 3 2 1 1 2 3 2 2 2 1 3 > {code} > I have a stage in my pipeline that uses the QuantileDiscretizer to produce > equal weight splits like this > {code} > new QuantileDiscretizer() > .setInputCol("intData") > .setOutputCol("intData_bin") > .setNumBuckets(10) > .fit(df) > {code} > But when that gets run it (incorrectly) throws this error: > {code} > parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, > 3.0, Infinity] > {code} > I don't think that there should be duplicate splits generated should there be? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data
[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426254#comment-15426254 ] Yanbo Liang commented on SPARK-17086: - [~sowen] The bucket defined by [1.0, 1.0) will only receive the value 1.0, I think this scenario is OK. But if we provide the splits as {{[-Infinity, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, Infinity]}}, it will output {{[-Infinity, 1.0), 1.0, 1.0, [1.0, 2.0), 2.0, 2.0, [2.0, 3.0), [3.0, 3.0), [3.0, Infinity]}}. >From the document, {{QuantileDiscretizer}} takes a column with continuous >features and outputs a column with binned categorical features. So I think it >does not make sense if we put the same continuous value into different >categorical features. Thanks. > QuantileDiscretizer throws InvalidArgumentException (parameter splits given > invalid value) on valid data > > > Key: SPARK-17086 > URL: https://issues.apache.org/jira/browse/SPARK-17086 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > > I discovered this bug when working with a build from the master branch (which > I believe is 2.1.0). This used to work fine when running spark 1.6.2. > I have a dataframe with an "intData" column that has values like > {code} > 1 3 2 1 1 2 3 2 2 2 1 3 > {code} > I have a stage in my pipeline that uses the QuantileDiscretizer to produce > equal weight splits like this > {code} > new QuantileDiscretizer() > .setInputCol("intData") > .setOutputCol("intData_bin") > .setNumBuckets(10) > .fit(df) > {code} > But when that gets run it (incorrectly) throws this error: > {code} > parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, > 3.0, Infinity] > {code} > I don't think that there should be duplicate splits generated should there be? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data
[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425949#comment-15425949 ] Yanbo Liang commented on SPARK-17086: - If the number of distinct input data is less than {{numBuckets}}, it should not split the data into buckets. We should figure out a proper way to identify this condition and throw corresponding exception. > QuantileDiscretizer throws InvalidArgumentException (parameter splits given > invalid value) on valid data > > > Key: SPARK-17086 > URL: https://issues.apache.org/jira/browse/SPARK-17086 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > > I discovered this bug when working with a build from the master branch (which > I believe is 2.1.0). This used to work fine when running spark 1.6.2. > I have a dataframe with an "intData" column that has values like > {code} > 1 3 2 1 1 2 3 2 2 2 1 3 > {code} > I have a stage in my pipeline that uses the QuantileDiscretizer to produce > equal weight splits like this > {code} > new QuantileDiscretizer() > .setInputCol("intData") > .setOutputCol("intData_bin") > .setNumBuckets(10) > .fit(df) > {code} > But when that gets run it (incorrectly) throws this error: > {code} > parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, > 3.0, Infinity] > {code} > I don't think that there should be duplicate splits generated should there be? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17090) Make tree aggregation level in linear/logistic regression configurable
[ https://issues.apache.org/jira/browse/SPARK-17090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425913#comment-15425913 ] Yanbo Liang commented on SPARK-17090: - Making aggregation depth configurable is necessary when Linear/Logistic Regression scaling to high dimension. I vote to expose an expert param to make it configurable. > Make tree aggregation level in linear/logistic regression configurable > -- > > Key: SPARK-17090 > URL: https://issues.apache.org/jira/browse/SPARK-17090 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > Linear/logistic regression use treeAggregate with default aggregation depth > for collecting coefficient gradient updates to the driver. For high > dimensional problems, this can case OOM error on the driver. We should make > it configurable, perhaps via an expert param, so that users can avoid this > problem if their data has many features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16993) model.transform without label column in random forest regression
[ https://issues.apache.org/jira/browse/SPARK-16993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422384#comment-15422384 ] Yanbo Liang commented on SPARK-16993: - [~dulajrajitha] I can not reproduce your reported issue, the following code works well. {code} val data = spark.read.format("libsvm").load("/Users/yliang/data/trunk0/spark/data/mllib/sample_libsvm_data.txt") val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(4) .fit(data) val trainingData = data val testData = data.drop("label") val rf = new RandomForestRegressor() .setLabelCol("label") .setFeaturesCol("indexedFeatures") val pipeline = new Pipeline() .setStages(Array(featureIndexer, rf)) val model = pipeline.fit(trainingData) val predictions = model.transform(testData) predictions.select("prediction", "features").show(5) {code} Could you tell me whether this code snippet coincide with your issues? If yes, I think it's not a bug. Thanks! > model.transform without label column in random forest regression > > > Key: SPARK-16993 > URL: https://issues.apache.org/jira/browse/SPARK-16993 > Project: Spark > Issue Type: Question > Components: Java API, ML >Reporter: Dulaj Rajitha > > I need to use a separate data set to prediction (Not as show in example's > training data split). > But those data do not have the label column. (Since these data are the data > that needs to be predict the label). > but model.transform is informing label column is missing. > org.apache.spark.sql.AnalysisException: cannot resolve 'label' given input > columns: [id,features,prediction] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17048) ML model read for custom transformers in a pipeline does not work
[ https://issues.apache.org/jira/browse/SPARK-17048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422365#comment-15422365 ] Yanbo Liang commented on SPARK-17048: - [~taras.matyashov...@gmail.com] Would you mind to share your code or provide a simple example to make others can help you diagnose this issue? Thanks! > ML model read for custom transformers in a pipeline does not work > -- > > Key: SPARK-17048 > URL: https://issues.apache.org/jira/browse/SPARK-17048 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0 > Environment: Spark 2.0.0 > Java API >Reporter: Taras Matyashovskyy > Labels: easyfix, features > Original Estimate: 2h > Remaining Estimate: 2h > > 0. Use Java API :( > 1. Create any custom ML transformer > 2. Make it MLReadable and MLWritable > 3. Add to pipeline > 4. Evaluate model, e.g. CrossValidationModel, and save results to disk > 5. For custom transformer you can use DefaultParamsReader and > DefaultParamsWriter, for instance > 6. Load model from saved directory > 7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, > Evaluator, etc. > 8. Your custom transformer will fail with NPE > Reason: > ReadWrite.scala:447 > cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path) > In Java this only works for static methods. > As we are implementing MLReadable or MLWritable, then this call should be > instance method call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance
[ https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-17033. - Resolution: Fixed Assignee: Yanbo Liang Fix Version/s: 2.1.0 > GaussianMixture should use treeAggregate to improve performance > --- > > Key: SPARK-17033 > URL: https://issues.apache.org/jira/browse/SPARK-17033 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.1.0 > > > {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to > improve performance and scalability. In my test of dataset with 200 features > and 1M instance, I found there is 20% increased performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16934) Update LogisticCostAggregator serialization code to make it consistent with LinearRegression
[ https://issues.apache.org/jira/browse/SPARK-16934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-16934. - Resolution: Fixed Assignee: Weichen Xu Fix Version/s: 2.1.0 Target Version/s: 2.1.0 > Update LogisticCostAggregator serialization code to make it consistent with > LinearRegression > > > Key: SPARK-16934 > URL: https://issues.apache.org/jira/browse/SPARK-16934 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu >Assignee: Weichen Xu > Fix For: 2.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Update LogisticCostAggregator serialization code to make it consistent with > LinearRegression -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance
[ https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17033: Description: {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there are 20% increased performance. (was: {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there are 15% increased performance.) > GaussianMixture should use treeAggregate to improve performance > --- > > Key: SPARK-17033 > URL: https://issues.apache.org/jira/browse/SPARK-17033 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang >Priority: Minor > > {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to > improve performance and scalability. In my test of dataset with 200 features > and 1M instance, I found there are 20% increased performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance
[ https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17033: Description: {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there are 15% increased performance. (was: {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there are 20% increased performance.) > GaussianMixture should use treeAggregate to improve performance > --- > > Key: SPARK-17033 > URL: https://issues.apache.org/jira/browse/SPARK-17033 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang >Priority: Minor > > {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to > improve performance and scalability. In my test of dataset with 200 features > and 1M instance, I found there are 15% increased performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance
[ https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17033: Description: {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance. (was: {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there are 20% increased performance.) > GaussianMixture should use treeAggregate to improve performance > --- > > Key: SPARK-17033 > URL: https://issues.apache.org/jira/browse/SPARK-17033 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang >Priority: Minor > > {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to > improve performance and scalability. In my test of dataset with 200 features > and 1M instance, I found there is 20% increased performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance
Yanbo Liang created SPARK-17033: --- Summary: GaussianMixture should use treeAggregate to improve performance Key: SPARK-17033 URL: https://issues.apache.org/jira/browse/SPARK-17033 Project: Spark Issue Type: Improvement Reporter: Yanbo Liang Priority: Minor {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there are 20% increased performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org