You could handle null values by using the DataFrame.na functions in a
preprocessing step like DataFrame.na.fill().

For reference:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions

John

On 21 April 2016 at 03:41, Andres Perez <and...@tresata.com> wrote:

> so the missing data could be on a one-off basis, or from fields that are
> in general optional, or from, say, a count that is only relevant for
> certain cases (very sparse):
>
> f1|f2|f3|optF1|optF2|sparseF1
> a|15|3.5|cat1|142L|
> b|13|2.4|cat2|64L|catA
> c|2|1.6|||
> d|27|5.1||0|
>
> -Andy
>
> On Wed, Apr 20, 2016 at 1:38 AM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> Could you provide an example of what your input data looks like?
>> Supporting missing values in a sparse result vector makes sense.
>>
>> On Tue, 19 Apr 2016 at 23:55, Andres Perez <and...@tresata.com> wrote:
>>
>>> Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently
>>> cannot handle null values. This presents a problem for us as we wish to run
>>> a decision tree classifier on sometimes sparse data. Is there a particular
>>> reason VectorAssembler is implemented in this way, and can anyone recommend
>>> the best path for enabling VectorAssembler to build vectors for data that
>>> will contain empty values?
>>>
>>> Thanks!
>>>
>>> -Andres
>>>
>>>
>

Reply via email to