Re: How to specify the numFeatures in HashingTF

2016-01-02 Thread Chris Fregly
You can use CrossValidator/TrainingValidationSplit with ParamGridBuilder
and Evaluator to empirically choose the model hyper parameters (ie.
numFeatures) per the following:

http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation

http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split

On Fri, Jan 1, 2016 at 7:48 AM, Yanbo Liang  wrote:

> You can refer the following code snippet to set numFeatures for HashingTF:
>
> val hashingTF = new HashingTF()
>   .setInputCol("words")
>   .setOutputCol("features")
>   .setNumFeatures(n)
>
>
> 2015-10-16 0:17 GMT+08:00 Nick Pentreath :
>
>> Setting the numfeatures higher than vocab size will tend to reduce the
>> chance of hash collisions, but it's not strictly necessary - it becomes a
>> memory / accuracy trade off.
>>
>> Surprisingly, the impact on model performance of moderate hash collisions
>> is often not significant.
>>
>> So it may be worth trying a few settings out (lower than vocab, higher
>> etc) and see what the impact is on evaluation metrics.
>>
>> —
>> Sent from Mailbox 
>>
>>
>> On Thu, Oct 15, 2015 at 5:46 PM, Jianguo Li 
>> wrote:
>>
>>> Hi,
>>>
>>> There is a parameter in the HashingTF called "numFeatures". I was
>>> wondering what is the best way to set the value to this parameter. In the
>>> use case of text categorization, do you need to know in advance the number
>>> of words in your vocabulary? or do you set it to be a large value, greater
>>> than the number of words in your vocabulary?
>>>
>>> Thanks,
>>>
>>> Jianguo
>>>
>>
>>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com


Re: How to specify the numFeatures in HashingTF

2016-01-01 Thread Yanbo Liang
You can refer the following code snippet to set numFeatures for HashingTF:

val hashingTF = new HashingTF()
  .setInputCol("words")
  .setOutputCol("features")
  .setNumFeatures(n)


2015-10-16 0:17 GMT+08:00 Nick Pentreath :

> Setting the numfeatures higher than vocab size will tend to reduce the
> chance of hash collisions, but it's not strictly necessary - it becomes a
> memory / accuracy trade off.
>
> Surprisingly, the impact on model performance of moderate hash collisions
> is often not significant.
>
> So it may be worth trying a few settings out (lower than vocab, higher
> etc) and see what the impact is on evaluation metrics.
>
> —
> Sent from Mailbox 
>
>
> On Thu, Oct 15, 2015 at 5:46 PM, Jianguo Li 
> wrote:
>
>> Hi,
>>
>> There is a parameter in the HashingTF called "numFeatures". I was
>> wondering what is the best way to set the value to this parameter. In the
>> use case of text categorization, do you need to know in advance the number
>> of words in your vocabulary? or do you set it to be a large value, greater
>> than the number of words in your vocabulary?
>>
>> Thanks,
>>
>> Jianguo
>>
>
>


Re: How to specify the numFeatures in HashingTF

2015-10-15 Thread Nick Pentreath
Setting the numfeatures higher than vocab size will tend to reduce the chance 
of hash collisions, but it's not strictly necessary - it becomes a memory / 
accuracy trade off.





Surprisingly, the impact on model performance of moderate hash collisions is 
often not significant.






So it may be worth trying a few settings out (lower than vocab, higher etc) and 
see what the impact is on evaluation metrics.



—
Sent from Mailbox

On Thu, Oct 15, 2015 at 5:46 PM, Jianguo Li 
wrote:

> Hi,
> There is a parameter in the HashingTF called "numFeatures". I was wondering
> what is the best way to set the value to this parameter. In the use case of
> text categorization, do you need to know in advance the number of words in
> your vocabulary? or do you set it to be a large value, greater than the
> number of words in your vocabulary?
> Thanks,
> Jianguo