In the forum mentioned above the flowing solution is suggested

Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be
fixed by changing line 113 like so:
before: val requiredSamples = math.max(numBins * numBins, 10000)
after: val requiredSamples = math.max(numBins * numBins, 10000.0)

Is there another way?


2016-07-11 18:28 GMT-04:00 Pasquinell Urbani <
pasquinell.urb...@exalitica.com>:

> Hi all,
>
> We have a dataframe with 2.5 millions of records and 13 features. We want
> to perform a logistic regression with this data but first we neet to divide
> each columns in discrete values using QuantileDiscretizer. This will
> improve the performance of the model by avoiding outliers.
>
> For small dataframes QuantileDiscretizer works perfect (see the ml
> example:
> https://spark.apache.org/docs/1.6.0/ml-features.html#quantilediscretizer),
> but for large data frames it tends to split the column in only the values 0
> and 1 (despite the custom number of buckets is settled in to 5). Here is my
> code:
>
> val discretizer = new QuantileDiscretizer()
>   .setInputCol("C4")
>   .setOutputCol("C4_Q")
>   .setNumBuckets(5)
>
> val result = discretizer.fit(df3).transform(df3)
> result.show()
>
> I found the same problem presented here:
> https://issues.apache.org/jira/browse/SPARK-13444 . But there is no
> solution yet.
>
> Do I am configuring the function in a bad way? Should I pre-process the
> data (like z-scores)? Can somebody help me dealing with this?
>
> Regards
>

Reply via email to