You could handle null values by using the DataFrame.na functions in a preprocessing step like DataFrame.na.fill().
For reference: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions John On 21 April 2016 at 03:41, Andres Perez <and...@tresata.com> wrote: > so the missing data could be on a one-off basis, or from fields that are > in general optional, or from, say, a count that is only relevant for > certain cases (very sparse): > > f1|f2|f3|optF1|optF2|sparseF1 > a|15|3.5|cat1|142L| > b|13|2.4|cat2|64L|catA > c|2|1.6||| > d|27|5.1||0| > > -Andy > > On Wed, Apr 20, 2016 at 1:38 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> Could you provide an example of what your input data looks like? >> Supporting missing values in a sparse result vector makes sense. >> >> On Tue, 19 Apr 2016 at 23:55, Andres Perez <and...@tresata.com> wrote: >> >>> Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently >>> cannot handle null values. This presents a problem for us as we wish to run >>> a decision tree classifier on sometimes sparse data. Is there a particular >>> reason VectorAssembler is implemented in this way, and can anyone recommend >>> the best path for enabling VectorAssembler to build vectors for data that >>> will contain empty values? >>> >>> Thanks! >>> >>> -Andres >>> >>> >