[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62299039 @shivaram > IMHO it would be good to have the developer API updates as well and test a couple of more pipelines before we push this out. I'll try to get a branch based on this PR ready next week for feedback. Not sure if we want to do a mega-PR though; hopefully it can be kept as a separate follow-up. > Also I am not sure I fully understand the difference between the User API and the Developer API These are loose terms; part of the "Developer" API will actually be public. E.g., Classifier will be public since it will be needed for the boosting API. But most users won't have to worry about these abstract classes, and the classes will include some private[ml] methods to make developers' lives easier. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62242562 I will make one under `org.apache.spark.examples` (though it is still under `spark`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62241597 This is a good point, maybe we should have an example or a test class that is not in the org.apache.spark.ml package and tries to create a pipeline stage. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user tomerk commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62234569 Is there anything that can be done about these two lines that appear in every transform and fit function: ```scala import dataset.sqlContext._ val map = this.paramMap ++ paramMap ``` Beyond being boilerplate, the second line is an issue because at least last I checked, ++ in ParamMap is marked as private[ml], so it can't be used by custom pipeline stages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62230172 @shivaram @tomerk I updated this PR and removed `modelParams`. The param traits are marked as `private[ml]`. As I mentioned, I have no particular preference here. I also added a simple text classification pipeline with (simple) tokenizer and hashingTF. For the codegen of getters/setters, I asked @heathermiller and @marmbrust . The answer is no at this time. The `@BeanProperty` logic is hard coded in the Scala compiler. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62227735 [Test build #23077 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23077/consoleFull) for PR 3099 at commit [`7772430`](https://github.com/apache/spark/commit/77724301621df221517cd0211c06ba6cafc9f661). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class Estimator[M <: Model] extends PipelineStage with Params ` * `abstract class Evaluator extends Identifiable ` * `trait Identifiable extends Serializable ` * `abstract class Model extends Transformer ` * `class Pipeline extends Estimator[PipelineModel] ` * `class PipelineModel(` * `abstract class Transformer extends PipelineStage with Params ` * `trait CrossValidatorParams extends Params ` * `class CrossValidator extends Estimator[CrossValidatorModel] with CrossValidatorParams with Logging ` * `class HashingTF extends Transformer with HasInputCol with HasOutputCol ` * `class LogisticRegression extends Estimator[LogisticRegressionModel] with LogisticRegressionParams ` * `class StandardScaler extends Estimator[StandardScalerModel] with StandardScalerParams ` * `class Tokenizer extends Transformer with HasInputCol with HasOutputCol ` * `class Param[T] (` * `class DoubleParam(parent: Params, name: String, doc: String, default: Option[Double] = None)` * `class IntParam(parent: Params, name: String, doc: String, default: Option[Int] = None)` * `class FloatParam(parent: Params, name: String, doc: String, default: Option[Float] = None)` * `class LongParam(parent: Params, name: String, doc: String, default: Option[Long] = None)` * `class BooleanParam(parent: Params, name: String, doc: String, default: Option[Boolean] = None)` * `case class ParamPair[T](param: Param[T], value: T)` * `trait Params extends Identifiable ` * `class ParamGridBuilder ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62227719 [Test build #23077 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23077/consoleFull) for PR 3099 at commit [`7772430`](https://github.com/apache/spark/commit/77724301621df221517cd0211c06ba6cafc9f661). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62229656 Thanks for the replies. I'm summarizing the discussion so far and including some replies. I think we are getting close to a good API ! IMHO it would be good to have the developer API updates as well and test a couple of more pipelines before we push this out. Dataset 1. We all seem to agree that there is a need for higher level Dataset API or more flexible functions in SchemaRDD for new operations. `appendCol` seems like it'll be very useful for one. 2. We also seem to have agreement that we will have an additional, simpler Transformer API from RDD to RDD. The one using Datasets will still be there for cases where the simpler API isn't enough Also I am not sure I fully understand the difference between the User API and the Developer API (developers are the main users for Pipelines ?). Pipelines 1. Batching, Loops etc: @mengxr -- I'll try out the parallel pipeline + feature assembler and let you know how it goes. One other requirement that we have for large pipelines is that we lazily evaluate each batch to fit in memory and use boosting. I'll try to see if we can do something similar here and get back to you. 2. Constructors, Params: @mateiz -- Completely agree on the binary compatibility problems and the trouble we had with default constructor args. However I think some members are required as a part of the class vs. being optional parameters. For example, the regularization value is definitely a parameter with a default value of 0.0 and we want to tune its value etc. On the other hand in say [HadamardProduct](https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/RandomSignNode.scala#L19) or even [LinearModel](https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/MultiClassLinearRegressionEstimator.scala#L81) the signs, weights are a part of the object. You almost never want to replace these values as doing so would result in creating a completely different pipeline. So I think there are roles for both -- and we need to be careful especially when we are saving and loading pipelines etc. 3. Evaluators: I can see that these are useful for model selection inside estimators, but as @jkbradley said we need to figure out a better way to chain them to a pipeline. FWIW my example was very simple and just trying to compute test error for a single model and not doing any model selection. 4. Parameter setters, passing, maps etc -- We seemed to have reached a nice design point on this ! I agree that implicit mapping was a bit tedious and `map(param)` is fine. 5. Parameter traits like HasInputCol -- This is the one issue where we don't have great ideas so far I guess. On the one hand having too many traits seems wasteful. On the other hand the amount of cruft code without them is also tedious. One idea I had was to try out annotations (like @Param featureCol: String) and auto generating code for setters, getters. More knowledgeable Java/Scala people may know more. (@JoshRosen ?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62227737 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23077/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62223666 On the parameters vs constructors, the problem with default args to constructors is that they make it very difficult to add settings to an algorithm. Such changes will break binary compatibility in Scala and may also break calls to the algorithm if you change the order. In the current MLlib API I believe all the constructors are package-private because of this. The other thing with setters is that the API looks more similar in Java to the other two languages. This makes it easier for people to learn Spark by grabbing snippets of code and moving them across languages. Of course it has to be weighed against utility though, if this stuff gets out of hand. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62207734 I'd like to second the thanks to you both for trying out the new API! Some thoughts: @shivaram About your Dataset API comments: As @mengxr said, I am planning several abstractions which should help with the boilerplate. I agree with your proposal for sticking with familiar RDD[MyType] API where possible and letting abstractions handle the boilerplate of working with SchemaRDD. When that is not possible, I still hope to provide some helper functions to reduce boilerplate. I have this set of classes partly sketched out and will send a WIP PR once this PR gets merged. @shivaram About your Pipelines API comments: * Loops in a pipeline: What @mengxr suggested might work for the FFT thing, but general Pipelines with cycles, etc. are definitely future work. * Parameters vs. Constructors: Instinctively, I agree about having at least some parameters specified in a constructor, especially when they are required parameters (e.g., the Estimator for CrossValidation). However, @mengxr convinced me that it makes things difficult. E.g., for CrossValidation, you really don't want a CV instance to be tied to a particular estimator since you may want to run CV to choose between several Estimators. * Chaining evaluators to a Pipeline: Initially, the 2 ways to get evaluations will be to look at Transformers created by fitting Estimators (to see training evaluation metrics) and to compute metrics on your own using the new columns in the SchemaRDD produced by transform (to get test metrics). Later on, it would be great to allow users to insert Evaluators into Pipelines, to compute custom metrics more easily. @tomerk About a few comments: * "There are a lot of parameter traits": I too am ambivalent here. It may save a little code duplication, but may also discourage people from writing customized documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user tomerk commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r20034471 --- Diff: mllib/src/main/scala/org/apache/spark/ml/example/StandardScaler.scala --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.example + +import org.apache.spark.ml._ +import org.apache.spark.ml.param._ +import org.apache.spark.mllib.feature +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.sql.SchemaRDD +import org.apache.spark.sql.catalyst.analysis.Star +import org.apache.spark.sql.catalyst.dsl._ +import org.apache.spark.sql.catalyst.expressions.Row + +class StandardScaler extends Estimator[StandardScalerModel] with HasInputCol { + + def setInputCol(value: String): this.type = { set(inputCol, value); this } + + override val modelParams: StandardScalerModelParams = new StandardScalerModelParams {} + + override def fit(dataset: SchemaRDD, paramMap: ParamMap): StandardScalerModel = { +import dataset.sqlContext._ +val map = this.paramMap ++ paramMap --- End diff -- This bit of boilerplate that appears in every transform and fit implementation should probably be moved into the interface somehow --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62199983 @tomerk > At @shivaram's suggestion, I started porting over a simple text classifier pipeline that was already using an Estimator/Transformer abstraction of RDD[U] to RDD[V] transforms to this interface. The almost-complete port (the imports got messed up when moving files around) can be found at shivaram/spark-ml@522aec7. Thanks for testing on the newsgroup20 pipeline! > The trickiest bit by far was all of the implicit conversions. I ended up needing to use several types of implicit conversion imports (case class -> schema RDD, spark sql dsl, parameter map, etc.) They also got mysteriously deleted by the IDE as I moved files between projects. I ended up having to copy and paste these whenever appropriate because I couldn't keep track of them. I removed implicitMapping because I found that `map(inputCols)` is actually shorter than `(inputCols: String)`. For a Scala IDE, I don't think we can trust it translating code. > Like Shivaram, I'm also not familiar with the Spark SQL dsl, so here I also had to copy and paste code. It's unclear what syntax is valid and what isn't. For example, is saying "as outputCol" enough, or is "as Symbol(outputCol)" required? `as(String)` is recently added to Spark master. If you are not on the latest version, you need `as Symbol(outputCol)`. > There is a lot of boilerplate code. It was easier to write the Transformers in the form RDD[U] to RDD[V] instead of SchemaRDD to SchemaRDD, so I fully agree with Shivaram on that front. Potentially, certain interfaces along those lines (iterator to iterator transformers that can be applied to RDDs using mappartitions) could make it easier to have transformers not depend on local Spark Contexts to execute. We will keep that option. Note that the strong-type approach will face troubles when we need to deal with extra columns, e.g, event ids, or support more features later, e.g., support weighted instances. > I found the parameter mapping in estimators fairly verbose, I like Shivaram's idea of having the estimators pass everything to the transformers no matter what. Yes, I like that idea too. > Estimators requiring the transformers they output to extend Model didn't make sense to me. Certain estimators, such as to choose only the most frequent tokens in a collection to keep for each document, don't seem like they should output models. On that front, should it be required for estimators to specify the type of transformer they output? It can be convenient sometimes to just inline an anonymous Transformer to output without making it a top-level class. The set of most frequent tokens could be viewed as a descriptive model in your case. I did try estimators without generic model types, but I don't remember what went wrong and made me switch to generic. > There are a lot of parameter traits: HasRegParam, HasMaxIter, HasScoreCol, HasFeatureCol Does it make sense to have this many specific parameter traits if we still have to maintain boilerplate setters code for Java anyway? I have no preference on this one. I put those before I realize the problem with Java. I marked those as `private[ml]` in the current version and I'm okay to remove them completely. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62196531 > @mengxr @jkbradley I tried to port one of our simple image processing pipelines to the new interface today and the code for this is at https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/MnistRandomFFTPipeline.scala . Thanks for testing with a real pipeline! > Note that I tried to write this as an application layer on top of Spark as I think our primary goal is to enable custom pipelines, not just those using our provided components. Exactly. > But as the SQL component itself is pretty new, I think my skill set maybe representative of the average Spark application developer. We are at the stage of learning Spark SQL as well. Note that this component will be marked as alpha. > Because we are using SchemaRDDs directly I found that there is 5-10 lines of boiler plate code in every transform or fit function. This is usually selecting one or more input columns and getting out an RDD which can be processed and then adding an output column at the end. It would be good if we can have wrappers which do this automatically (I have one proposal below) The Scala DSL is new. @liancheng helped add the implicit ".call" method to UDF. I think when we fully understand the operations we need, we can work with @marmbrus to add those wrappers. > The UDF interface or Row => Row feels pretty restrictive. For numerical computations I often want to do things like mapPartitions or use broadcast variables etc. In those cases I am not sure how to directly use UDFs. We are not restricted by UDFs. As you commented, we can create a new RDD and zip back. > And the main reason I used UDFs was to handle the semantics of appending an output column. Is there any other API other than using select Star(None) ... ? It'd be great if we had something like dataset.zip() which is easier / more flexible ? I think the only boilerplate code here is `Star(None)`. The input columns and output column are required. I'm sure it is easy to add operations like ~~~ dataset.appendCol(predict('features) as 'prediction) ~~~ > One proposal I had for simplifying things is to define another transformer API which is less general but easier to use (we can keep the existing ones as base classes). I like this idea. We will also keep the `predict` and `predictValues` methods that operate on normal RDDs. Note that the main focus of this PR is user-facing APIs: how users create pipelines and specify parameters. @jkbradley is working on the developer side APIs. Pipelines API > Multiple batches of transformers -- One of the main problems I ran into was that I couldn't see an easy way to have loops inside a pipeline. We can create a transformer like the following: ~~~ val featurePipeline = // the pipeline you created val par = new ParallelPipeline() .setPipeline(featurePipeline) .setParamMaps(Array( ParamMap(softMax.outputCol -> "f0")), ParamMap(softMax.outputCol -> "f1")), ParamMap(softMax.outputCol -> "f2" val fvAssembler = new FeatureVectorAssembler() .setInputCols(Array("f0", "f1", "f2)) val pipeline = new Pipeline() .setStages(Array(featurePipeline, fvAssembler, linearSolver)) ~~~ > Passing parameters in Estimators -- I finally understood the problem you were describing originally ! However I solved it slightly differently that having 2 sets of parameters (This is in MultiClassLinearRegressionEstimator.scala). I made parent Estimator class take in as params all the parameters required for the child Transformer class as well and then passed them along. I thought this was cleaner than having modelParams and params as two different members. Yes, this is cleaner. So we separate Estimator from the output Model but both shares the same set of params. I like this idea and then we don't need `trainingParams`:) > Parameters vs. Constructors -- While I agree with your comment that constructor arguments make binary compatibility tricky, I ran into a couple of cases where creating a transformer without a particular argument doesn't make sense. Will we have a guideline that things which won't vary should be constructor arguments ? I think it'll be good to come up with some distinction that makes it clear for the programmer. Could you list a few examples? Having many parameters is common for ML components. I feel it is hard to decide what parameters won't vary. My understanding is that `LogisticRegression` is just a placeholder with parameters to be configured. and validation will happen at runtime (not added yet). > Chaining evaluators to a Pipeline -- I wrote a simple evaluator that computed test error at the end b
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user tomerk commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62112717 At @shivaram's suggestion, I started porting over a simple text classifier pipeline that was already using an Estimator/Transformer abstraction of RDD[U] to RDD[V] transforms to this interface. The almost-complete port (the imports got messed up when moving files around) can be found at https://github.com/shivaram/spark-ml/commit/522aec73172b28a4bc1b22df030a459fddbd93dd. Beyond what Shivaram already mentioned, here are my thoughts: 1. The trickiest bit by far was all of the implicit conversions. I ended up needing to use several types of implicit conversion imports (case class -> schema RDD, spark sql dsl, parameter map, etc.) They also got mysteriously deleted by the IDE as I moved files between projects. I ended up having to copy and paste these whenever appropriate because I couldn't keep track of them. 2. Like Shivaram, I'm also not familiar with the Spark SQL dsl, so here I also had to copy and paste code. It's unclear what syntax is valid and what isn't. For example, is saying "as outputCol" enough, or is "as Symbol(outputCol)" required? 3. There is a lot of boilerplate code. It was easier to write the Transformers in the form RDD[U] to RDD[V] instead of SchemaRDD to SchemaRDD, so I fully agree with Shivaram on that front. Potentially, certain interfaces along those lines (iterator to iterator transformers that can be applied to RDDs using mappartitions) could make it easier to have transformers not depend on local Spark Contexts to execute. 4. I found the parameter mapping in estimators fairly verbose, I like Shivaram's idea of having the estimators pass everything to the transformers no matter what. 5. Estimators requiring the transformers they output to extend Model didn't make sense to me. Certain estimators, such as to choose only the most frequent tokens in a collection to keep for each document, don't seem like they should output models. On that front, should it be required for estimators to specify the type of transformer they output? It can be convenient sometimes to just inline an anonymous Transformer to output without making it a top-level class. 6. There are a lot of parameter traits: HasRegParam, HasMaxIter, HasScoreCol, HasFeatureCol Does it make sense to have this many specific parameter traits if we still have to maintain boilerplate setters code for Java anyway? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62105053 @mengxr @jkbradley I tried to port one of our simple image processing pipelines to the new interface today and the code for this is at https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/MnistRandomFFTPipeline.scala . Note that I tried to write this as an application layer on top of Spark as I think our primary goal is to enable custom pipelines, not just those using our provided components. I had some thoughts about the Pipelines and Dataset APIs while writing this code and I've detailed them below. It'll be great if you could also look over the code to see if I am missing something. Dataset API I should preface this section by saying that my knowledge of SparkSQL and SchemaRDD is very limited. I haven't written much code using this and I referred to your code + the programming guide for most part. But as the SQL component itself is pretty new, I think my skill set maybe representative of the average Spark application developer. That said please let me know if I missed something obvious. 1. Because we are using SchemaRDDs directly I found that there is 5-10 lines of boiler plate code in every transform or fit function. This is usually selecting one or more input columns and getting out an RDD which can be processed and then adding an output column at the end. It would be good if we can have wrappers which do this automatically (I have one proposal below) 2. The UDF interface or Row => Row feels pretty restrictive. For numerical computations I often want to do things like mapPartitions or use broadcast variables etc. In those cases I am not sure how to directly use UDFs. 3. And the main reason I used UDFs was to handle the semantics of appending an output column. Is there any other API other than using `select Star(None) ...` ? It'd be great if we had something like dataset.zip() which is easier / more flexible ? One proposal I had for simplifying things is to define another transformer API which is less general but easier to use (we can keep the existing ones as base classes). This API might look something like this ``` abstract class SimpleTransformer[U, V] extends Transformer with HasInputCol with HasOutputCol { def transform(RDD[U]): RDD[V] } ``` We will then have wrapper code that will select the input col, cast it to `U` and similar take `V` add it as outpuCol to the dataset. There are a couple of advantages with this: First is that the RDD API is flexible, well documented and well understood by programmers. Secondly this also provides a good boundary for unit testing and gives the programmer some type safety within their transformer. Pipelines API 1. Multiple batches of transformers -- One of the main problems I ran into was that I couldn't see an easy way to have loops inside a pipeline. By loops I mean repeating some part of the pipeline many times. In this example I wanted to run RandomSignNode followed by FFT a bunch of times and then concatenate their output as features for Regression. There is a similar pattern in the pipeline described in [Evan's blog post](https://amplab.cs.berkeley.edu/2014/09/24/4643/) 2. Passing parameters in Estimators -- I finally understood the problem you were describing originally ! However I solved it slightly differently that having 2 sets of parameters (This is in MultiClassLinearRegressionEstimator.scala). I made parent Estimator class take in as params all the parameters required for the child Transformer class as well and then passed them along. I thought this was cleaner than having modelParams and params as two different members. 3. Parameters vs. Constructors -- While I agree with your comment that constructor arguments make binary compatibility tricky, I ran into a couple of cases where creating a transformer without a particular argument doesn't make sense. Will we have a guideline that things which won't vary should be constructor arguments ? I think it'll be good to come up with some distinction that makes it clear for the programmer. 4. Chaining evaluators to a Pipeline -- I wrote a simple evaluator that computed test error at the end but wasn't sure how to chain this to a pipeline. Is this something we don't intend to support ? 5. Strings, Vectors etc -- There were a couple of minor things that weren't ideal, but I think we will have to live with. With every node having an input and output column, there are lots of strings floating around which the user needs to get right. Also going from Spark's Vector / Matrix classes to Breeze and back to Spark classes gets tedious. We could add commonly used operators like add, multiply etc. to our public APIs and that might help a bit. --- If your project is set up for it,
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62034301 [Test build #23011 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23011/consoleFull) for PR 3099 at commit [`2d040b3`](https://github.com/apache/spark/commit/2d040b3b99be4192b6301452f04c1a758b0e6a84). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class Estimator[M <: Model] extends PipelineStage with Params ` * `abstract class Evaluator extends Identifiable ` * `trait Identifiable extends Serializable ` * `abstract class Model extends Transformer ` * `class Pipeline extends Estimator[PipelineModel] ` * `class PipelineModel(val transformers: Array[Transformer]) extends Model ` * `abstract class Transformer extends PipelineStage with Params ` * `class CrossValidator extends Estimator[CrossValidatorModel] with Params ` * `class CrossValidatorModel(bestModel: Model, metric: Double) extends Model ` * `class LogisticRegressionModel(` * `class StandardScaler extends Estimator[StandardScalerModel] with HasInputCol ` * `trait StandardScalerModelParams extends Params with HasInputCol with HasOutputCol ` * `class DoubleParam(parent: Params, name: String, doc: String, default: Option[Double] = None)` * `class IntParam(parent: Params, name: String, doc: String, default: Option[Int] = None)` * `case class ParamPair[T](param: Param[T], value: T)` * `trait Params extends Identifiable ` * `class ParamGridBuilder ` * `trait HasRegParam extends Params ` * `trait HasMaxIter extends Params ` * `trait HasFeaturesCol extends Params ` * `trait HasLabelCol extends Params ` * `trait HasScoreCol extends Params ` * `trait HasThreshold extends Params ` * `trait HasMetricName extends Params ` * `trait HasInputCol extends Params ` * `trait HasOutputCol extends Params ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62034255 [Test build #23011 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23011/consoleFull) for PR 3099 at commit [`2d040b3`](https://github.com/apache/spark/commit/2d040b3b99be4192b6301452f04c1a758b0e6a84). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62034309 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23011/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61978539 Oh, and I'm not saying "let's not support PMML" - I'm saying let's have a sensible way of handing off models to other JVM programs that doesn't involve writing to XML in the middle. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61977531 I've got several other comments on this PR - mostly good, and will leave some more detailed comments in a bit. TL;DR - What's wrong with Java Serialization? But - PMML support seems a little like overkill here. I admit I'm fairly ignorant to the details of PMML, but my understanding is that it is designed to facilitate transfer of models between languages - e.g. SAS to Java to R, etc. While I'm sure it's general purpose enough to capture complex pipelines, I'd be surprised if it can do so efficiently. @jegonzal and @dcrankshaw are talking about a java-based runtime environment for serving complex pipelines trained via spark and MLlib. Spark is already pretty good about shipping JVM code around to remote sites and executing it - it does so by serializing and shipping standard Java objects, and I don't see why we should follow a very different pattern here. This design has been part of MLlib since day 1. I don't want to speak for these two, but I don't think they have a problem having a runtime dependency on MLlib or some other JVM-based machine learning library, their main issue is that they don't want to have to call into a batch system (aka spark) to execute their models. @jkbradley - I agree that some models or transformers are going to require a lot of state, and potentially distributed computation, but this should be the exception, not the rule. In general, an Estimator should compute a fairly small object (Transformer) which is small enough to be passed around and doesn't need cluster resources to run. In the case of outlier removal, for example, I'd imagine that the Estimator would take several passes over the data and compute some sufficient statistics to remove new points. For cases like ALS, where the model is really big, this is exactly where @jegonzal and @dcrankshaw's research comes in. To make an analogy - just as I'd happily use spark to compute a distributed inverted index, I certainly wouldn't use it to do point-queries on that index. So some interface for transmitting this large distributed state to a system more prepared to answer point queries based on that state is required. At any rate - this is great stuff, I just want to make sure we don't get lost in the weeds of supporting the general case at the expense of streamlined support for the obvious case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user vruusmann commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61949888 @jegonzal PMML is essentially a domain-specific language (DSL) for the domain of predictive analytic applications. It is commonly used only for the representation of statistical and data mining models (ie. simple "model export" functionality). But it can do much more, including chaining multiple models together into a so-called model ensemble, implementing complex data pre- and post-processing logic, adding annotations and machine-readable documentation to the solution and so on. I have a separate project in the works that deals specifically with translating memory- and CPU-intensive parts of PMML documents into Java source code for more efficient execution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61935382 Both of those solutions seem reasonable. However I wonder if PMML is sufficient to express complex model pipelines? The later solution seems much more interesting and would be perhaps something @dcrankshaw might want investigate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61934791 @jegonzal @jkbradley (This may be slightly off-topic.) To serve models, which are usually long pipelines in practice, I'm thinking of the following: 1. serialize to portable format, like PMML. For example, Google Prediction API accepts PMML with transformations. @selvinsource is working on #3062 and @srowen is helping. 2. code generation, this is less compatible than PMML, but faster and maybe easier to implement. For example, we can do the following: ~~~ val model = pipeline.fit(dataset) val code = model.generateCode(schema) ~~~ Then the compiled code can make predictions on normal Java input, without dependencies on Spark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61934417 @jkbradley Right now we are planning to serve linear combinations of models derived from MLlib (currently latent factor models, naive bayes, and decision trees). Though I agree that in some cases (e.g., latent factor models) serving is less trivial (thats the research). Still, it would be good to think of models as output to be consumed by systems beyond Spark should people want to do things beyond computing test error, admittedly thats all I ever do :-). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61932628 @jegonzal Do you feel like most model classes should be serve-able by design? Or would you recommend having lighter model classes which can be built from regular model classes, serialized, and shipped for serving? When you have examples of serve-able models available, it would be great to see them to plan ahead for Spark. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61932391 @mengxr @shivaram Adding to the conversation... > In addition to the SchemaRDD based transformer API, it would be good to have a single Row transformation API too. I agree, but per-Row transformations may need to be kept to subclasses of Transformer. E.g., a Transformer which filters out outlier instances might only be able to transform a full RDD, whereas most Classifier/Regressor concepts can operate per-Row. I plan to do a PR for some of these abstractions after Pipelines are merged in. I think @mengxr covered the other answers so far. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61931724 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22991/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61931722 [Test build #22991 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22991/consoleFull) for PR 3099 at commit [`fd751fc`](https://github.com/apache/spark/commit/fd751fc038bdab9367d9fd07635e9ec30526f7a2). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class Estimator[M <: Model] extends PipelineStage with Params ` * `abstract class Evaluator extends Identifiable ` * `trait Identifiable extends Serializable ` * `abstract class Model extends Transformer ` * `class Pipeline extends Estimator[PipelineModel] ` * `class PipelineModel(val transformers: Array[Transformer]) extends Model ` * `abstract class Transformer extends PipelineStage with Params ` * `trait HasEstimator extends Params ` * `trait HasEvaluator extends Params ` * `trait HasEstimatorParamMaps extends Params ` * `class CrossValidatorModel(bestModel: Model, metric: Double) extends Model ` * `class LogisticRegressionModel(` * `class StandardScaler extends Transformer with Params with HasInputCol with HasOutputCol ` * `class DoubleParam(parent: Params, name: String, doc: String, default: Option[Double] = None)` * `class IntParam(parent: Params, name: String, doc: String, default: Option[Int] = None)` * `case class ParamPair[T](param: Param[T], value: T)` * `trait Params extends Identifiable ` * `class ParamGridBuilder ` * `trait HasRegParam extends Params ` * `trait HasMaxIter extends Params ` * `trait HasFeaturesCol extends Params ` * `trait HasLabelCol extends Params ` * `trait HasScoreCol extends Params ` * `trait HasThreshold extends Params ` * `trait HasMetricName extends Params ` * `trait HasInputCol extends Params ` * `trait HasOutputCol extends Params ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61931707 [Test build #22991 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22991/consoleFull) for PR 3099 at commit [`fd751fc`](https://github.com/apache/spark/commit/fd751fc038bdab9367d9fd07635e9ec30526f7a2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19927481 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import scala.annotation.varargs + +import org.apache.spark.ml.param.{ParamMap, ParamPair, Params} +import org.apache.spark.sql.SchemaRDD + +/** + * Abstract class for estimators that fits models to data. + */ +abstract class Estimator[M <: Model] extends PipelineStage with Params { + + /** + * Fits a single model to the input data with optional parameters. + * + * @param dataset input dataset + * @param paramPairs optional list of param pairs, overwrite embedded params + * @return fitted model + */ + @varargs + def fit(dataset: SchemaRDD, paramPairs: ParamPair[_]*): M = { +val map = new ParamMap() +paramPairs.foreach(map.put(_)) +fit(dataset, map) + } + + /** + * Fits a single model to the input data with provided parameter map. + * + * @param dataset input dataset + * @param paramMap parameters + * @return fitted model + */ + def fit(dataset: SchemaRDD, paramMap: ParamMap): M + + /** + * Fits multiple models to the input data with multiple sets of parameters. + * The default implementation uses a for loop on each parameter map. + * Subclasses could overwrite this to optimize multi-model training. + * + * @param dataset input dataset + * @param paramMaps an array of parameter maps + * @return fitted models, matching the input parameter maps + */ + def fit(dataset: SchemaRDD, paramMaps: Array[ParamMap]): Seq[M] = { // how to return an array? +paramMaps.map(fit(dataset, _)) --- End diff -- .toArray ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61930130 [Test build #22987 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22987/consoleFull) for PR 3099 at commit [`3f810cd`](https://github.com/apache/spark/commit/3f810cd31f451a9b0019cd8f69e9fbe400dd4269). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class Estimator[M <: Model] extends PipelineStage with Params ` * `abstract class Evaluator extends Identifiable ` * `trait Identifiable extends Serializable ` * `abstract class Model extends Transformer ` * `class Pipeline extends Estimator[PipelineModel] ` * `class PipelineModel(val transformers: Array[Transformer]) extends Model ` * `abstract class Transformer extends PipelineStage with Params ` * `trait HasEstimator extends Params ` * `trait HasEvaluator extends Params ` * `trait HasEstimatorParamMaps extends Params ` * `class CrossValidatorModel(bestModel: Model, metric: Double) extends Model ` * `class LogisticRegressionModel(` * `class StandardScaler extends Transformer with Params with HasInputCol with HasOutputCol ` * `class DoubleParam(parent: Params, name: String, doc: String, default: Option[Double] = None)` * `class IntParam(parent: Params, name: String, doc: String, default: Option[Int] = None)` * `case class ParamPair[T](param: Param[T], value: T)` * `trait Params extends Identifiable ` * `class ParamGridBuilder ` * `trait HasRegParam extends Params ` * `trait HasMaxIter extends Params ` * `trait HasFeaturesCol extends Params ` * `trait HasLabelCol extends Params ` * `trait HasScoreCol extends Params ` * `trait HasThreshold extends Params ` * `trait HasMetricName extends Params ` * `trait HasInputCol extends Params ` * `trait HasOutputCol extends Params ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61930125 [Test build #22987 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22987/consoleFull) for PR 3099 at commit [`3f810cd`](https://github.com/apache/spark/commit/3f810cd31f451a9b0019cd8f69e9fbe400dd4269). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61930133 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22987/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61929839 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22986/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61929837 [Test build #22986 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22986/consoleFull) for PR 3099 at commit [`9f408ed`](https://github.com/apache/spark/commit/9f408edef1c7f23323bf8a9d86ae69cbbda7002e). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class Estimator[M <: Model] extends PipelineStage with Params ` * `abstract class Evaluator extends Identifiable ` * `trait Identifiable extends Serializable ` * `abstract class Model extends Transformer ` * `class Pipeline extends Estimator[PipelineModel] ` * `class PipelineModel(val transformers: Array[Transformer]) extends Model ` * `abstract class Transformer extends PipelineStage with Params ` * `trait HasEstimator extends Params ` * `trait HasEvaluator extends Params ` * `trait HasEstimatorParamMaps extends Params ` * `class CrossValidatorModel(bestModel: Model, metric: Double) extends Model ` * `class LogisticRegressionModel(` * `class StandardScaler extends Transformer with Params with HasInputCol with HasOutputCol ` * `class DoubleParam(parent: Params, name: String, doc: String, default: Option[Double] = None)` * `class IntParam(parent: Params, name: String, doc: String, default: Option[Int] = None)` * `case class ParamPair[T](param: Param[T], value: T)` * `trait Params extends Identifiable ` * `class ParamGridBuilder ` * `trait HasRegParam extends Params ` * `trait HasMaxIter extends Params ` * `trait HasFeaturesCol extends Params ` * `trait HasLabelCol extends Params ` * `trait HasScoreCol extends Params ` * `trait HasThreshold extends Params ` * `trait HasMetricName extends Params ` * `trait HasInputCol extends Params ` * `trait HasOutputCol extends Params ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61929832 [Test build #22986 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22986/consoleFull) for PR 3099 at commit [`9f408ed`](https://github.com/apache/spark/commit/9f408edef1c7f23323bf8a9d86ae69cbbda7002e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61923441 @manishamde Thanks for your feedback! > Like @jkbradley, I prefer lr.setMaxIter(50) over lr.set(lr.maxIter, 50). Me too and this is implemented in the current version. > Also, prefer to avoid passing parameters to fit like lr.fit(dataset, lr.maxIter -> 50). 1. It is useful when you want to try some parameters interactively. For example, I'm using the default setting and I want to see how each parameter affects the result. I can try: ~~~ val lr = new LogisticRegression() .setMaxIter(20) .setRegParam(0.1) lr.fit(dataset, lr.maxIter -> 50) lr.fit(dataset, lr.regParam -> 1.0) // note that maxIter goes back to the default setting ~~~ If we only allow setters, the code will be ~~~ val previousMaxIter = lr.getMaxIter lr.setMaxIter(50) lr.fit(dataset) lr.setMaxIter(previousMaxIter) val previousRegParam = lr.getRegParam lr.fit(dataset) lr.setRegParam(previousRegParam) ~~~ Another reason I want to have parameters specified in `fit` is for multi-model training, as described in the design doc. > Constructors with getters and setters as @shivaram pointed will be great. The LOC reduction is important and should not be discounted. Besides the binary compatibility issue and Java issue I mentioned, it doesn't save you many characters: ~~~ val lr = new LogisticRegression(maxIter = 50, regParam = 0.1) val lr = new LogisticRegression() .setMaxIter(50) .setRegParam(0.1) ~~~ > Do we plan to provided syntactic sugar such as a predict method when we use model to transform a dataset? For me transform fits well with the feature engineering stage and predict after the model training has been performed. I think we should keep methods that operate on normal RDDs and individual instances. > It will be great to see the corresponding examples in Python.The getter/setters would map well to Python properties. Also, it will be nice to do an apples-to-apples comparison with the scikit-learn pipeline. We need to deal with the serialization of objects and parameters. @davies is the expert. I expect the Python API be very similar to Scala/Java API. > Finally, how do we plan to programatically answer (developer/user) queries about algorithm properties such as multiclass classification support, using internal storage format, etc. This is beyond this PR. SPARK-3702 is relevant to your question, which @jkbradley is working on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61922586 The model serving work would really benefit from being able to evaluate models without requiring a Spark context especially since we are shooting for 10s millisecond latencies. Though more generally, we should think about how one might want to use the artifact of the pipeline. I suspect their are uses that may exist outside of Spark and the extent to which the models themselves are "portable functions" could enable greater adoption. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61922261 @shivaram Thanks for your feedback! > Could we have constructors also with getter, setters? It would be hard to maintain binary compatibility in the long run. After we remove the @Experimental tag, each time we add a new parameter, we need to create a new auxiliary constructor. It is also easier to deprecate parameters. It will also make Scala code different from Java's, which we try to avoid. Using getters and setters, the Java code `JavaLogisticRegressionSuite` looks very similar to the Scala code `LogisticRegressionSuite`. > Also, FWIW I am not very sure about having setter, getters using traits like HasRegParam etc. This creates a lot of classes as we add more classes each having 2-3 params ? I'm okay with either approach. I felt this is a nice feature of Scala. Some parameters are very common, e.g., `regParam`, `maxIter`, and `featuresCol`. It is tedious to copy the same code around. And we can group parameter traits into something like `IterativeParams` or `RegressionModelParams` later. However, when I coded this up, I found Java doesn't interpret the return type correctly. So I have to override the setters in each class, which is bad. > Passing params across modules. The model parameter like `featuresCol` is set before training. So we still need to deal with the logic after training: 1) If `lr.model.featuresCol` is set, use it. 2) If `lr.model.featuresCol` is not set, use `lr.featuresCol` or keep the default if `lr.featuresCol` is not set either. > In addition to the SchemaRDD based transformer API, it would be good to have a single Row transformation API too. I agree. We need to know the schema in order to transform individual instances. The row object doesn't have reference to its schema. We can add `schema` as a parameter, which is required to transform a row. Btw, we will keep the `predict` method that works on a single Vector. > Other than the API, the biggest worry I have is in terms of memory management. We can call `unpersist` inside `StandardScaler` after we computed the mean and the standard deviation. It is not an issue here but I got your point. We can add pipeline components that persist/checkpoint input datasets. When to `unpersist` is always the problem. I'll think about it. > Also if you don't mind I'll try to port some of the image processing pipelines we have to this class structure by forking your branch. I feel that'll help me figure out if what features are easy to use etc. That's great! I felt that making the code compile is really helpful to see the trade-offs. But please understand that this is a WIP and I may update the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19923238 --- Diff: mllib/src/main/scala/org/apache/spark/ml/parameters.scala --- @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import java.lang.reflect.Modifier + +import scala.language.implicitConversions +import scala.collection.mutable + +/** + * A param with self-contained documentation and optionally default value. + * + * @param parent parent object + * @param name param name + * @param doc documentation + * @param default optional default value + * @tparam T param value type + */ +class Param[T] private ( +val parent: Identifiable, +val name: String, +val doc: String, +val default: Option[T]) extends Serializable { + + /** + * Creates a param without a default value. + * + * @param parent parent object + * @param name param name + * @param doc documentation + */ + def this(parent: Identifiable, name: String, doc: String) = this(parent, name, doc, None) + + /** + * Creates a param with a default value. + * + * @param parent parent object + * @param name param name + * @param doc documentation + * @param default default value + */ + def this(parent: Identifiable, name: String, doc: String, default: T) = +this(parent, name, doc, Some(default)) + + /** + * Creates a param pair with the given value (for Java). + */ + def w(value: T): ParamPair[T] = ParamPair(this, value) + + /** + * Creates a param pair with the given value (for Scala). + */ + def ->(value: T): ParamPair[T] = ParamPair(this, value) + + override def toString: String = { +if (default.isDefined) { + s"$name: $doc (default: ${default.get})" +} else { + s"$name: $doc" +} + } +} + +/** + * A param amd its value. + */ +case class ParamPair[T](param: Param[T], value: T) + +/** + * Trait for components that take parameters. + */ +trait Params { + + /** Returns all params. */ + def params: Array[Param[_]] = { +val methods = this.getClass.getMethods +methods.filter { m => + Modifier.isPublic(m.getModifiers) && +classOf[Param[_]].isAssignableFrom(m.getReturnType) && +m.getParameterTypes.isEmpty +}.map(m => m.invoke(this).asInstanceOf[Param[_]]) + } + + /** Gets a param by its name. */ + def getParam(paramName: String): Param[Any] = { +val m = this.getClass.getMethod(paramName) +assert(Modifier.isPublic(m.getModifiers) && + classOf[Param[_]].isAssignableFrom(m.getReturnType)) +m.invoke(this).asInstanceOf[Param[Any]] + } + + /** + * Validates parameters specified by the input parameter map. + * Raises an exception if any parameter belongs to this object is invalid. + */ + def validateParams(paramMap: ParamMap): Unit = {} + + /** + * Returns the documentation of all params. + */ + def explainParams(): String = params.mkString("\n") +} + +/** + * Trait for instances that hold their own param maps. + */ +trait OwnParamMap { + + /** + * Internal param map. + */ + val paramMap: ParamMap = ParamMap.empty + + /** + * Sets a parameter in the own parameter map. + */ + def set[T](param: Param[T], value: T): this.type = { +paramMap.put(param.asInstanceOf[Param[Any]], value) +this + } +} + +private[ml] object Params { + + /** + * Returns a Params implementation without any + */ + val empty: Params = new Params { +override def params: Array[Param[_]] = Array.empty + } +} + +/** + * A param to value map. + */ +class ParamMap private[ml] ( +priv
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61916011 I have a few comments based upon the API: 1. Like @jkbradley, I prefer ```lr.setMaxIter(50)``` over ```lr.set(lr.maxIter, 50)```. Also, prefer to avoid passing parameters to fit like ```lr.fit(dataset, lr.maxIter -> 50)```. 2. Constructors with getters and setters as @shivaram pointed will be great. The LOC reduction is important and should not be discounted. 3. Do we plan to provided syntactic sugar such as a ```predict``` method when we use ```model``` to transform a dataset? For me ```transform``` fits well with the feature engineering stage and ```predict``` after the model training has been performed. 4. It will be great to see the corresponding examples in Python.The getter/setters would map well to Python properties. Also, it will be nice to do an apples-to-apples comparison with the scikit-learn pipeline. 5. Finally, how do we plan to programatically answer (developer/user) queries about algorithm properties such as multiclass classification support, using internal storage format, etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61913282 @mengxr Thanks for putting this together ! I had some high level comments by looking at the code 1. Could we have constructors also with getter, setters ? This is the same style we use in MLLib and it will make user's code much cleaner. ``` val scaler = new StandardScaler() .setInputCol("features") .setOutputCol("scaledFeatures") val lr = new LogisticRegression() .setFeaturesCol("scaledFeatures") will become one liners like ``` val scaler = new StandardScaler(inputCol="features", outputCol="scaledFeatures") val lr = new LogisticRegression(featuresCol="scaledFeatures") ``` Also we will still have an empty constructor that will just set the default values that exist for Params right now. Also, FWIW I am not very sure about having setter, getters using traits like HasRegParam etc. This creates a lot of classes as we add more classes each having 2-3 params ? 2. Passing params across modules: The code to pass params from Estimator to Transformer is a little clunky right now. I think this can also be solved with the constructor interface above. For example in LogisticRegression.scala, we can have ``` val lrm = new LogisticRegressionModel(lr.run(instances).weights) if (!lrm.paramMap.contains(lrm.featuresCol) && map.contains(lrm.featuresCol)) { lrm.setFeaturesCol(featuresCol) } ``` can become ``` val lrm = new LogisticRegressionModel(lr.run(instances).weights, featureCol=featuresCol) ``` 3. In addition to the SchemaRDD based transformer API, it would be good to have a single Row transformation API too. If you want to reuse the same transformers for training and for online prediction you would want to use the same Pipeline, but with 1 element at a time or something like that ? There are some folks in the AMPLab (cc @jegonzal @dcrankshaw) who are working on model serving and this would be useful for integration with such systems. Also I can see the single element API being very useful for unit testing. 4. Other than the API, the biggest worry I have is in terms of memory management. Because most of the transformations can be lazy operations, it is very tricky to figure out when the programmer should call persist / unpersist. For example in StandardScaler.scala, we now have `input.cache()`. But AFAIK there is no simple way to do unpersist this RDD if the Scaler is being run lazily. Also if you don't mind I'll try to port some of the image processing pipelines we have to this class structure by forking your branch. I feel that'll help me figure out if what features are easy to use etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61905231 [Test build #22962 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22962/consoleFull) for PR 3099 at commit [`1ef26e0`](https://github.com/apache/spark/commit/1ef26e07af574dee8432a1509d0efb46c6074b1d). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61905232 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22962/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61905223 [Test build #22962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22962/consoleFull) for PR 3099 at commit [`1ef26e0`](https://github.com/apache/spark/commit/1ef26e07af574dee8432a1509d0efb46c6074b1d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61864795 I vote for the lr.setRegParam(0.1) setup. I also vote for setting parameters beforehand, and not allowing parameters in fit(). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19897343 --- Diff: mllib/src/main/scala/org/apache/spark/ml/parameters.scala --- @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import java.lang.reflect.Modifier + +import scala.language.implicitConversions +import scala.collection.mutable + +/** + * A param with self-contained documentation and optionally default value. + * + * @param parent parent object + * @param name param name + * @param doc documentation + * @param default optional default value + * @tparam T param value type + */ +class Param[T] private ( +val parent: Identifiable, +val name: String, +val doc: String, +val default: Option[T]) extends Serializable { + + /** + * Creates a param without a default value. + * + * @param parent parent object + * @param name param name + * @param doc documentation + */ + def this(parent: Identifiable, name: String, doc: String) = this(parent, name, doc, None) + + /** + * Creates a param with a default value. + * + * @param parent parent object + * @param name param name + * @param doc documentation + * @param default default value + */ + def this(parent: Identifiable, name: String, doc: String, default: T) = +this(parent, name, doc, Some(default)) + + /** + * Creates a param pair with the given value (for Java). + */ + def w(value: T): ParamPair[T] = ParamPair(this, value) + + /** + * Creates a param pair with the given value (for Scala). + */ + def ->(value: T): ParamPair[T] = ParamPair(this, value) + + override def toString: String = { +if (default.isDefined) { + s"$name: $doc (default: ${default.get})" +} else { + s"$name: $doc" +} + } +} + +/** + * A param amd its value. + */ +case class ParamPair[T](param: Param[T], value: T) + +/** + * Trait for components that take parameters. + */ +trait Params { + + /** Returns all params. */ + def params: Array[Param[_]] = { +val methods = this.getClass.getMethods +methods.filter { m => + Modifier.isPublic(m.getModifiers) && +classOf[Param[_]].isAssignableFrom(m.getReturnType) && +m.getParameterTypes.isEmpty +}.map(m => m.invoke(this).asInstanceOf[Param[_]]) + } + + /** Gets a param by its name. */ + def getParam(paramName: String): Param[Any] = { +val m = this.getClass.getMethod(paramName) +assert(Modifier.isPublic(m.getModifiers) && + classOf[Param[_]].isAssignableFrom(m.getReturnType)) +m.invoke(this).asInstanceOf[Param[Any]] + } + + /** + * Validates parameters specified by the input parameter map. + * Raises an exception if any parameter belongs to this object is invalid. + */ + def validateParams(paramMap: ParamMap): Unit = {} + + /** + * Returns the documentation of all params. + */ + def explainParams(): String = params.mkString("\n") +} + +/** + * Trait for instances that hold their own param maps. + */ +trait OwnParamMap { + + /** + * Internal param map. + */ + val paramMap: ParamMap = ParamMap.empty + + /** + * Sets a parameter in the own parameter map. + */ + def set[T](param: Param[T], value: T): this.type = { +paramMap.put(param.asInstanceOf[Param[Any]], value) +this + } +} + +private[ml] object Params { + + /** + * Returns a Params implementation without any + */ + val empty: Params = new Params { +override def params: Array[Param[_]] = Array.empty + } +} + +/** + * A param to value map. + */ +class ParamMap private[ml] ( +priva
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61858097 Both the reference and the class internals are immutable, no? Typical Java conventions would put such a variable in all caps, though maybe in Scala it's different? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61854896 @sryza `maxIter` is not a constant. It is a parameter key in the current design. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61853890 If maxIter is a constant, would it be clearer to use MAX_ITER? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61785223 @mengxr Yes I agree with your conclusion. With a generic get/set by key method, I suppose you don't strictly need all the particular getters and setters. I suppose that's simpler, and if there's just one way to do it, there's no confusion about key vs getter/setter. I don't know if you want to remove the getters/setters entirely, but it wouldn't be crazy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61775481 @srowen Glad to hear that you like it :) Your feedback will be greatly appreciated, but I don't want to waste your time on minor details. Let's run the discussion in the main thread, so we can still see them when I update the code. One thing I have been struggling with is the way we handle parameters here. Please check the last section of the PR description. Basically, I think ~~~ val lr = new LogisticRegression .setMaxIter(50) .setRegParam(0.1) ~~~ looks better than ~~~ val lr = new LogisticRegression lr.set(lr.maxIter, 50) .set(lr.regParam, 0.1) ~~~ But if we use setters/getters, it is a little weird that `lr.maxIter` is a parameter key instead of its value. Or we can let `lr.maxIter` store the value (so the setters/getters can be generated using @BeanProperty), and then use something like `lr.maxIter_` (underscore) as the parameter key. The code would become ~~~ val lr = new LogisticRegression .setRegParam(0.1) // attach regParam = 0.1 with the object "lr" val lrParamMaps = new ParamMapBuilder() .addMulti(lr.maxIter_, Array(10, 20)) // specify parameters (with underscore) outside the object "lr" .build() // multi-model training val models = lr.fit(dataset, lrParamMaps) ~~~ Does it look better than the current version? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19859297 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Model.scala --- @@ -0,0 +1,6 @@ +package org.apache.spark.ml --- End diff -- Oh Ok, is it worth me reviewing this yet? I'm liking it already. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19859152 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Model.scala --- @@ -0,0 +1,6 @@ +package org.apache.spark.ml --- End diff -- Hey @srowen, please ignore these. I'm not trying to let it pass Jenkins now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19859074 --- Diff: mllib/src/main/scala/org/apache/spark/ml/example/CrossValidator.scala --- @@ -0,0 +1,64 @@ +package org.apache.spark.ml.example --- End diff -- Same, copyright header --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19859052 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Model.scala --- @@ -0,0 +1,6 @@ +package org.apache.spark.ml --- End diff -- (Missing copyright header) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61748064 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22907/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61748036 [Test build #22907 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22907/consoleFull) for PR 3099 at commit [`376db0a`](https://github.com/apache/spark/commit/376db0add8ec05252a1e8ffc2f622f3dfab1b0cd). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61748060 [Test build #22907 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22907/consoleFull) for PR 3099 at commit [`376db0a`](https://github.com/apache/spark/commit/376db0add8ec05252a1e8ffc2f622f3dfab1b0cd). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/3099 [WIP][SPARK-3530][MLLIB] pipeline and parameters ### This is a WIP, so please do not spend time on individual lines. This PR adds package "org.apache.spark.ml" with pipeline and parameters, as discussed on the JIRA. This is a joint work of @jkbradley @etrain @shivaram and many others who helped on the design, also with help from @marmbrus and @liancheng on the Spark SQL side. The design doc can be found at: https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing Again, though it compiles, this is a WIP. So please do not spend time on individual lines. The new set of APIs, inspired by the MLI project from AMPLab and scikit-learn, takes leverage on Spark SQL's schema support and execution plan optimization. It introduces the following components that help build a practical pipeline: 1. Transformer, which transforms a dataset into another 2. Estimator, which fits models to data, where models are transformers 3. Evaluator, which evaluates model output and returns a scalar metric 4. Pipeline, a simple pipeline that consists of transformers and estimators Parameters could be supplied at fit/transform or embedded with parameters. 1. Param: a strong-typed parameter key with self-contained doc 2. ParamMap: a param -> value map 3. Params: trait for components with parameters 4. OwnParamMap: trait for components with embedded parameter map For any component that implements `Params`, user can easily check the doc by calling `explainParams`: ~~~ > val lr = new LogisticRegression > lr.explainParams maxIter: max number of iterations (default: 100) regParam: regularization constant (default: 0.1) labelCol: label column name (default: label) featuresCol: features column name (default: features) ~~~ or user can check individual param: ~~~ > lr.maxIter maxIter: max number of iterations (default: 100) ~~~ **Please start with the example code in `LogisticRegressionSuite.scala`, where I put three examples:** 1. run a simple logistic regression job ~~~ val lr = new LogisticRegression lr.set(lr.maxIter, 10) .set(lr.regParam, 1.0) val model = lr.fit(dataset) model.transform(dataset, model.threshold -> 0.8) // overwrite threshold .select('label, 'score, 'prediction).collect() .foreach(println) ~~~ 2. run logistic regression with cross-validation and grid search using areaUnderROC as the metric ~~~ val lr = new LogisticRegression val cv = new CrossValidator val eval = new BinaryClassificationEvaluator val lrParamMaps = new ParamGridBuilder() .addMulti(lr.regParam, Array(0.1, 100.0)) .addMulti(lr.maxIter, Array(0, 5)) .build() cv.set(cv.estimator, lr.asInstanceOf[Estimator[_]]) .set(cv.estimatorParamMaps, lrParamMaps) .set(cv.evaluator, eval) .set(cv.numFolds, 3) val bestModel = cv.fit(dataset) ~~~ 3. run a pipeline consists of a standard scaler and a logistic regression component ~~~ val scaler = new StandardScaler scaler .set(scaler.inputCol, "features") .set(scaler.outputCol, "scaledFeatures") val lr = new LogisticRegression lr.set(lr.featuresCol, "scaledFeatures") val pipeline = new Pipeline val model = pipeline.fit(dataset, pipeline.stages -> Array(scaler, lr)) val predictions = model.transform(dataset) .select('label, 'score, 'prediction) .collect() .foreach(println) ~~~ **What are missing now and will be added soon:** 1. Runtime check of schemas. So before we touch the data, we will go through the schema and make sure column names and types match the input parameters. 2. Java examples. 3. Serialization. 4. Store training parameters in trained models. **What I'm not very confident with and definitely need feedback:** The usage of parameters is a little messy. The interface I do want to keep is the one that supports multi-model training: ~~~ def fit(dataset: SchemaRDD, paramMaps: Array[ParamMap]): Array[Model] ~~~ which leaves space for the algorithm to optimize the training phrase. This is different from scikit-learn's design, where `fit` returns itself. But for large-scale datasets, we do want to do training in parallel and returns multiple models. Now there are two places you can specify parameters: 1. using the embedded paramMap ~~~ lr.set(lr.maxIter, 50) ~~~ 2. at fit/transform time: ~~~ lr.fit(dataset, lr.maxIter -> 50) ~~~ The la