[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

MLnick Mon, 17 Jul 2017 04:07:49 -0700

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    @sethah thanks for reviewing. 
    
    _For the 1st question:_
    
    Yes, currently categorical columns that are numerical would need to be 
explicitly encoded as strings. I mentioned it as a follow up improvement. It's 
easy to handle, it's just the API for this I'm not certain of yet, here are the 
two options I see:
    
    1. User can specify param `categoricalCols` to explicitly set categorical 
cols. But, do we then assume that all other columns not in that list, that are 
strings, are categorical? i.e. this param is effectively only for numeric 
columns that must be treated as categorical? Or do we ignore all other 
non-numerical columns? etc
    2. User can specify param `realCols` to explicitly set the numeric columns. 
All other columns are treated as categorical.
    
    We could potentially offer both formats, though I tend to gravitate towards 
potentially (2) above, since the default use case will be encoding many 
(usually high cardinality) categorical columns, with maybe a few real columns 
in there.
    
    _For the second issue:_
    
    There is no way (at least that I know of) to provide a `dropLast` feature, 
since we don't know how many features there are - the whole point of hashing is 
not to keep the `feature <-> index` mapping for speed and memory efficiency.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Reply via email to