Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75106841 > Rename FeatureType? and what's its value for AttributeGroup? GROUP or null? I wish we could use `type`, but it is already taken by Scala. `DataType` is taken by SQL. So `DatumType` or `MLDataType`? ... I don't really have good suggestions. I'm not sure whether we should make `AttributeGroup` an `Attribute`. What is the benefit of making it an `Attribute`? > You could imagine a more elaborate hierarchy of types: discrete is a special case of continuous, ordinal is a special case of discrete. It's nice to have that expressiveness; it adds somewhat to the complexity for the caller and the code. Maybe you could argue that the schema should force an interpretation for the algorithm. But I kind of like it. The type objects would have methods like isContinuous, isCategorical. Should I make a fuller hierarchy or stick to adding BINARY? I think having a full hierarchy is a good idea. Could you list all of the types you want to include? Then we can check the complexity. Btw, I don't know whether we should have ML attributes attached to string columns. It seems to me that a string column should be mapped to an integer column first to become an ML column with attribute. Hopefully that reduces the complexity.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org