Re: VectorAssembler handling null values

2016-04-20 Thread Koert Kuipers
thanks for that, its good to know that functionality exists. but shouldn't a decision tree be able to handle missing (aka null) values more intelligently than simply using replacement values? see for example here:

Re: VectorAssembler handling null values

2016-04-20 Thread John Trengrove
You could handle null values by using the DataFrame.na functions in a preprocessing step like DataFrame.na.fill(). For reference: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions John On 21 April 2016 at 03:41, Andres Perez

Re: VectorAssembler handling null values

2016-04-20 Thread Andres Perez
so the missing data could be on a one-off basis, or from fields that are in general optional, or from, say, a count that is only relevant for certain cases (very sparse): f1|f2|f3|optF1|optF2|sparseF1 a|15|3.5|cat1|142L| b|13|2.4|cat2|64L|catA c|2|1.6||| d|27|5.1||0| -Andy On Wed, Apr 20, 2016

Re: VectorAssembler handling null values

2016-04-19 Thread Nick Pentreath
Could you provide an example of what your input data looks like? Supporting missing values in a sparse result vector makes sense. On Tue, 19 Apr 2016 at 23:55, Andres Perez wrote: > Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently cannot > handle null

VectorAssembler handling null values

2016-04-19 Thread Andres Perez
Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently cannot handle null values. This presents a problem for us as we wish to run a decision tree classifier on sometimes sparse data. Is there a particular reason VectorAssembler is implemented in this way, and can anyone recommend the