In the forum mentioned above the flowing solution is suggested Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed by changing line 113 like so: before: val requiredSamples = math.max(numBins * numBins, 10000) after: val requiredSamples = math.max(numBins * numBins, 10000.0)
Is there another way? 2016-07-11 18:28 GMT-04:00 Pasquinell Urbani < pasquinell.urb...@exalitica.com>: > Hi all, > > We have a dataframe with 2.5 millions of records and 13 features. We want > to perform a logistic regression with this data but first we neet to divide > each columns in discrete values using QuantileDiscretizer. This will > improve the performance of the model by avoiding outliers. > > For small dataframes QuantileDiscretizer works perfect (see the ml > example: > https://spark.apache.org/docs/1.6.0/ml-features.html#quantilediscretizer), > but for large data frames it tends to split the column in only the values 0 > and 1 (despite the custom number of buckets is settled in to 5). Here is my > code: > > val discretizer = new QuantileDiscretizer() > .setInputCol("C4") > .setOutputCol("C4_Q") > .setNumBuckets(5) > > val result = discretizer.fit(df3).transform(df3) > result.show() > > I found the same problem presented here: > https://issues.apache.org/jira/browse/SPARK-13444 . But there is no > solution yet. > > Do I am configuring the function in a bad way? Should I pre-process the > data (like z-scores)? Can somebody help me dealing with this? > > Regards >