Re: How to specify the numFeatures in HashingTF

2016-01-02 Thread Chris Fregly
You can use CrossValidator/TrainingValidationSplit with ParamGridBuilder and Evaluator to empirically choose the model hyper parameters (ie. numFeatures) per the following: http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation

Re: How to specify the numFeatures in HashingTF

2016-01-01 Thread Yanbo Liang
You can refer the following code snippet to set numFeatures for HashingTF: val hashingTF = new HashingTF() .setInputCol("words") .setOutputCol("features") .setNumFeatures(n) 2015-10-16 0:17 GMT+08:00 Nick Pentreath : > Setting the numfeatures higher

Re: How to specify the numFeatures in HashingTF

2015-10-15 Thread Nick Pentreath
Setting the numfeatures higher than vocab size will tend to reduce the chance of hash collisions, but it's not strictly necessary - it becomes a memory / accuracy trade off. Surprisingly, the impact on model performance of moderate hash collisions is often not significant. So it may