Github user tomerk commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62112717 At @shivaram's suggestion, I started porting over a simple text classifier pipeline that was already using an Estimator/Transformer abstraction of RDD[U] to RDD[V] transforms to this interface. The almost-complete port (the imports got messed up when moving files around) can be found at https://github.com/shivaram/spark-ml/commit/522aec73172b28a4bc1b22df030a459fddbd93dd. Beyond what Shivaram already mentioned, here are my thoughts: 1. The trickiest bit by far was all of the implicit conversions. I ended up needing to use several types of implicit conversion imports (case class -> schema RDD, spark sql dsl, parameter map, etc.) They also got mysteriously deleted by the IDE as I moved files between projects. I ended up having to copy and paste these whenever appropriate because I couldn't keep track of them. 2. Like Shivaram, I'm also not familiar with the Spark SQL dsl, so here I also had to copy and paste code. It's unclear what syntax is valid and what isn't. For example, is saying "as outputCol" enough, or is "as Symbol(outputCol)" required? 3. There is a lot of boilerplate code. It was easier to write the Transformers in the form RDD[U] to RDD[V] instead of SchemaRDD to SchemaRDD, so I fully agree with Shivaram on that front. Potentially, certain interfaces along those lines (iterator to iterator transformers that can be applied to RDDs using mappartitions) could make it easier to have transformers not depend on local Spark Contexts to execute. 4. I found the parameter mapping in estimators fairly verbose, I like Shivaram's idea of having the estimators pass everything to the transformers no matter what. 5. Estimators requiring the transformers they output to extend Model didn't make sense to me. Certain estimators, such as to choose only the most frequent tokens in a collection to keep for each document, don't seem like they should output models. On that front, should it be required for estimators to specify the type of transformer they output? It can be convenient sometimes to just inline an anonymous Transformer to output without making it a top-level class. 6. There are a lot of parameter traits: HasRegParam, HasMaxIter, HasScoreCol, HasFeatureCol.... Does it make sense to have this many specific parameter traits if we still have to maintain boilerplate setters code for Java anyway?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org