You can use CrossValidator/TrainingValidationSplit with ParamGridBuilder
and Evaluator to empirically choose the model hyper parameters (ie.
numFeatures) per the following:
http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
You can refer the following code snippet to set numFeatures for HashingTF:
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(n)
2015-10-16 0:17 GMT+08:00 Nick Pentreath :
> Setting the numfeatures higher
Setting the numfeatures higher than vocab size will tend to reduce the chance
of hash collisions, but it's not strictly necessary - it becomes a memory /
accuracy trade off.
Surprisingly, the impact on model performance of moderate hash collisions is
often not significant.
So it may