Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75304366 > Don't you always have to look at the data to determine how many unique values a column has, regardless of type? No if we already have ML attributes saved together with the data or defined by users. > String and int are encodings, but attribute types like categorical and continuous are interpretations. Those seem orthogonal to me and I thought Attribute was only metadata representing the attribute type, whereas the RDD schema already knows the actual column data types. Conceptually, this is true. But adding restrictions would simplify our implementation. The restriction I proposed is that data stored in columns with ML attributes are values (Float/Double/Int/Long/Boolean/Vector). So algorithms and transformers don't need to handle special types. Let's consider a vector assembler that merges multiple columns into a vector column. If it needs to handle string columns, it needs to call some indexer to turn strings into indices and merge them. This piece of code would probably appear in every algorithm and the unit test. If we force users turn string columns into numeric ones first. The implementation of the rest of the pipelines could be simplified. > From a user perspective, I'd be surprised if I had to encode categorical string values as it seems like something a framework can easily do for me. There's nothing inherently strange about providing a string-valued column that is (necessarily) categorical and in fact they often are strings in input. But there's no reason it couldn't encode the categorical values internally in a different way if needed. Are you just referring to the latter? sure, anything can be done within the framework. scikit-learn has a clear separation of string values and numeric values. All string values must be encoded into categorical columns through transformers before calling ML algorithms, and all ML algorithms take a matrix `X` and a vector `y`. That didn't surprise users much (hopefully). In MLlib, we will provide transformers that turn strings into categorical columns in various ways. > I agree that discrete and ordinal don't come up. I don't think they're required as types, but may allow optimizations. For example, there's no point in checking decision rules like >= 3.4, >= 3.5, >= 3.6 for a discrete feature. They're all the same rule. That optimization doesn't exist yet. I can't actually think of a realistic optimization that would depend on knowing a value is ordinal (1,2,3,...) I'd drop that maybe. For trees, if the features are integers and their is a split `> 3.4`. Then there won't be splits between 3 and 4 because all points are separated. It looks okay to me that we have a split `> 3.4` while all values are integers. We can definitely add this attribute back if it becomes necessary.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org