Hello everyone,

In MLLib, I’m trying to rely essentially on pipelines to create features
out of the Titanic dataset, and show-case the power of feature hashing. I
want to:

-          Apply bucketization on some columns (QuantileDiscretizer is fine)

-          Then I want to cross all my columns with each other to have
cross features.

-          Then I would like to hash all of these cross features into a
vector.

-          Then give it to a logistic regression.

Looking at the documentation, it looks like the only way to hash features
is the *FeatureHasher* transformation. It takes multiple columns as input,
type can be numeric, bool, string (but no vector/array).

But now I’m left wondering how I can create my cross-feature columns. I’m
looking at a transformation that could take two columns as input, and
return a numeric, bool, or string. I didn't manage to find anything that
does the job. There are multiple transformations such as VectorAssembler,
that operate on vector, but this is not a typeaccepted by the FeatureHasher.

Of course, I could try to combine columns directly in my dataframe (before
the pipeline kicks-in), but then I would not be able to benefit any more
from QuantileDiscretizer and other cool functions.


Am I missing something in the transformation api ? Or is my approach to
hashing wrong ? Or should we consider to extend the api somehow ?



Thank you, kind regards,

David

Reply via email to