Hello everyone, In MLLib, I’m trying to rely essentially on pipelines to create features out of the Titanic dataset, and show-case the power of feature hashing. I want to:
- Apply bucketization on some columns (QuantileDiscretizer is fine) - Then I want to cross all my columns with each other to have cross features. - Then I would like to hash all of these cross features into a vector. - Then give it to a logistic regression. Looking at the documentation, it looks like the only way to hash features is the *FeatureHasher* transformation. It takes multiple columns as input, type can be numeric, bool, string (but no vector/array). But now I’m left wondering how I can create my cross-feature columns. I’m looking at a transformation that could take two columns as input, and return a numeric, bool, or string. I didn't manage to find anything that does the job. There are multiple transformations such as VectorAssembler, that operate on vector, but this is not a typeaccepted by the FeatureHasher. Of course, I could try to combine columns directly in my dataframe (before the pipeline kicks-in), but then I would not be able to benefit any more from QuantileDiscretizer and other cool functions. Am I missing something in the transformation api ? Or is my approach to hashing wrong ? Or should we consider to extend the api somehow ? Thank you, kind regards, David