Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75304366
  
    > Don't you always have to look at the data to determine how many unique 
values a column has, regardless of type?
    
    No if we already have ML attributes saved together with the data or defined 
by users.
    
    > String and int are encodings, but attribute types like categorical and 
continuous are interpretations. Those seem orthogonal to me and I thought 
Attribute was only metadata representing the attribute type, whereas the RDD 
schema already knows the actual column data types.
    
    Conceptually, this is true. But adding restrictions would simplify our 
implementation. The restriction I proposed is that data stored in columns with 
ML attributes are values (Float/Double/Int/Long/Boolean/Vector). So algorithms 
and transformers don't need to handle special types. Let's consider a vector 
assembler that merges multiple columns into a vector column. If it needs to 
handle string columns, it needs to call some indexer to turn strings into 
indices and merge them. This piece of code would probably appear in every 
algorithm and the unit test. If we force users turn string columns into numeric 
ones first. The implementation of the rest of the pipelines could be simplified.
    
    > From a user perspective, I'd be surprised if I had to encode categorical 
string values as it seems like something a framework can easily do for me. 
There's nothing inherently strange about providing a string-valued column that 
is (necessarily) categorical and in fact they often are strings in input. But 
there's no reason it couldn't encode the categorical values internally in a 
different way if needed. Are you just referring to the latter? sure, anything 
can be done within the framework.
    
    scikit-learn has a clear separation of string values and numeric values. 
All string values must be encoded into categorical columns through transformers 
before calling ML algorithms, and all ML algorithms take a matrix `X` and a 
vector `y`. That didn't surprise users much (hopefully). In MLlib, we will 
provide transformers that turn strings into categorical columns in various ways.
    
    > I agree that discrete and ordinal don't come up. I don't think they're 
required as types, but may allow optimizations. For example, there's no point 
in checking decision rules like >= 3.4, >= 3.5, >= 3.6 for a discrete feature. 
They're all the same rule. That optimization doesn't exist yet. I can't 
actually think of a realistic optimization that would depend on knowing a value 
is ordinal (1,2,3,...) I'd drop that maybe.
    
    For trees, if the features are integers and their is a split `> 3.4`. Then 
there won't be splits between 3 and 4 because all points are separated. It 
looks okay to me that we have a split `> 3.4` while all values are integers. We 
can definitely add this attribute back if it becomes necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to