Handling nulls in vector columns is non-trivial

Franklyn D'souza Wed, 21 Jun 2017 18:04:34 -0700

I just wanted to highlight some of the rough edges around using vectors in
columns in dataframes.


If there is a null in a dataframe column containing vectors pyspark ml
models like logistic regression will completely fail.

However from what i've read there is no good way to fill in these nulls
with empty vectors.

Its not possible to create a literal vector column expressiong and coalesce
it with the column from pyspark.

so we're left with writing a python udf which does this coalesce, this is
really inefficient on large datasets and becomes a bottleneck for ml
pipelines working with real world data.

I'd like to know how other users are dealing with this and what plans there
are to extend vector support for dataframes.

Thanks!,

Franklyn

Handling nulls in vector columns is non-trivial

Reply via email to