Yes, I agree. Maybe we can add a prune step or a minSupport parameter
to prune. But then again a lot depends on the tokenizer used. Numerals
plus string literal combinations like say 100-sanfrancisco-ugs found
in Wikipedia data a lot.  They add up to the feature count more than
English words

Robin

On Wed, Jul 22, 2009 at 11:41 PM, Ted Dunning<[email protected]> wrote:
> I could be mis-reading this, but it looks like you are saying that you have
> 31 million features.  That is, to put it mildly, a bit absurd.  Something is
> whacked to get that many features.  At the very least, singletons should not
> be used as features.
>
> On Wed, Jul 22, 2009 at 9:14 AM, Grant Ingersoll <[email protected]>wrote:
>
>>  Where are the <label,feature> values stored?
>>>>
>>>>
>>> tf-Idf Folder part-****
>>>
>>
>> That's 1.28 GB.  Count: 31,216,595
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Reply via email to