Re: How to specify the numFeatures in HashingTF

2016-01-02 Thread Chris Fregly
You can use CrossValidator/TrainingValidationSplit with ParamGridBuilder
and Evaluator to empirically choose the model hyper parameters (ie.
numFeatures) per the following:

http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation

http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split

On Fri, Jan 1, 2016 at 7:48 AM, Yanbo Liang  wrote:

> You can refer the following code snippet to set numFeatures for HashingTF:
>
> val hashingTF = new HashingTF()
>   .setInputCol("words")
>   .setOutputCol("features")
>   .setNumFeatures(n)
>
>
> 2015-10-16 0:17 GMT+08:00 Nick Pentreath :
>
>> Setting the numfeatures higher than vocab size will tend to reduce the
>> chance of hash collisions, but it's not strictly necessary - it becomes a
>> memory / accuracy trade off.
>>
>> Surprisingly, the impact on model performance of moderate hash collisions
>> is often not significant.
>>
>> So it may be worth trying a few settings out (lower than vocab, higher
>> etc) and see what the impact is on evaluation metrics.
>>
>> —
>> Sent from Mailbox 
>>
>>
>> On Thu, Oct 15, 2015 at 5:46 PM, Jianguo Li 
>> wrote:
>>
>>> Hi,
>>>
>>> There is a parameter in the HashingTF called "numFeatures". I was
>>> wondering what is the best way to set the value to this parameter. In the
>>> use case of text categorization, do you need to know in advance the number
>>> of words in your vocabulary? or do you set it to be a large value, greater
>>> than the number of words in your vocabulary?
>>>
>>> Thanks,
>>>
>>> Jianguo
>>>
>>
>>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com


Re: How to specify the numFeatures in HashingTF

2016-01-01 Thread Yanbo Liang
You can refer the following code snippet to set numFeatures for HashingTF:

val hashingTF = new HashingTF()
  .setInputCol("words")
  .setOutputCol("features")
  .setNumFeatures(n)


2015-10-16 0:17 GMT+08:00 Nick Pentreath :

> Setting the numfeatures higher than vocab size will tend to reduce the
> chance of hash collisions, but it's not strictly necessary - it becomes a
> memory / accuracy trade off.
>
> Surprisingly, the impact on model performance of moderate hash collisions
> is often not significant.
>
> So it may be worth trying a few settings out (lower than vocab, higher
> etc) and see what the impact is on evaluation metrics.
>
> —
> Sent from Mailbox 
>
>
> On Thu, Oct 15, 2015 at 5:46 PM, Jianguo Li 
> wrote:
>
>> Hi,
>>
>> There is a parameter in the HashingTF called "numFeatures". I was
>> wondering what is the best way to set the value to this parameter. In the
>> use case of text categorization, do you need to know in advance the number
>> of words in your vocabulary? or do you set it to be a large value, greater
>> than the number of words in your vocabulary?
>>
>> Thanks,
>>
>> Jianguo
>>
>
>


How to specify the numFeatures in HashingTF

2015-10-15 Thread Jianguo Li
Hi,

There is a parameter in the HashingTF called "numFeatures". I was wondering
what is the best way to set the value to this parameter. In the use case of
text categorization, do you need to know in advance the number of words in
your vocabulary? or do you set it to be a large value, greater than the
number of words in your vocabulary?

Thanks,

Jianguo


Re: How to specify the numFeatures in HashingTF

2015-10-15 Thread Nick Pentreath
Setting the numfeatures higher than vocab size will tend to reduce the chance 
of hash collisions, but it's not strictly necessary - it becomes a memory / 
accuracy trade off.





Surprisingly, the impact on model performance of moderate hash collisions is 
often not significant.






So it may be worth trying a few settings out (lower than vocab, higher etc) and 
see what the impact is on evaluation metrics.



—
Sent from Mailbox

On Thu, Oct 15, 2015 at 5:46 PM, Jianguo Li 
wrote:

> Hi,
> There is a parameter in the HashingTF called "numFeatures". I was wondering
> what is the best way to set the value to this parameter. In the use case of
> text categorization, do you need to know in advance the number of words in
> your vocabulary? or do you set it to be a large value, greater than the
> number of words in your vocabulary?
> Thanks,
> Jianguo