Re: Handling nulls in vector columns is non-trivial

2017-06-24 Thread Timur Shenkao
Hi Franklyn, I had the same problem like yours with vectors & Maps. I tried: 1) UDF --> cumbersome and difficult to maintain. One has to re-write / re-implement UDFs + extensive docs should be provided for colleagues + something weird may happen when you migrate to new Spark version 2) RDD / DataF

Re: Handling nulls in vector columns is non-trivial

2017-06-23 Thread Franklyn D'souza
As a reference this is what is required to coalesce a vector column in pyspark. df = sc.sql.createDataFrame([(SparseVector(10,{1:44}),), (None,), (SparseVector(10,{1:23}),), (None,), (SparseVector(10,{1:35}),)], schema=schema empty_vector = sc.sql.createDataFrame([(SparseVector(10, {}),)], schema=

Re: Handling nulls in vector columns is non-trivial

2017-06-22 Thread Franklyn D'souza
We've developed Scala UDFs internally to address some of these issues and we'd love to upstream them back to spark. Just trying to figure out what the vector support looks like on the road map. would it be best to put this functionality into the Imputer, VectorAssembler or maybe try to give it mor

Re: Handling nulls in vector columns is non-trivial

2017-06-21 Thread Franklyn D'souza
>From the documentation it states that ` The input columns should be of DoubleType or FloatType.` so i dont think that is what im looking for. Also in general the API around vectors is highly lacking, especially from the pyspark side. Very common vector operations like addition, subtractions and d

Re: Handling nulls in vector columns is non-trivial

2017-06-21 Thread Maciej Szymkiewicz
Since 2.2 there is Imputer: https://github.com/apache/spark/blob/branch-2.2/examples/src/main/python/ml/imputer_example.py which should at least partially address the problem. On 06/22/2017 03:03 AM, Franklyn D'souza wrote: > I just wanted to highlight some of the rough edges around using > vect

Handling nulls in vector columns is non-trivial

2017-06-21 Thread Franklyn D'souza
I just wanted to highlight some of the rough edges around using vectors in columns in dataframes. If there is a null in a dataframe column containing vectors pyspark ml models like logistic regression will completely fail. However from what i've read there is no good way to fill in these nulls wi