Yes, I agree. Maybe we can add a prune step or a minSupport parameter to prune. But then again a lot depends on the tokenizer used. Numerals plus string literal combinations like say 100-sanfrancisco-ugs found in Wikipedia data a lot. They add up to the feature count more than English words
Robin On Wed, Jul 22, 2009 at 11:41 PM, Ted Dunning<[email protected]> wrote: > I could be mis-reading this, but it looks like you are saying that you have > 31 million features. That is, to put it mildly, a bit absurd. Something is > whacked to get that many features. At the very least, singletons should not > be used as features. > > On Wed, Jul 22, 2009 at 9:14 AM, Grant Ingersoll <[email protected]>wrote: > >> Where are the <label,feature> values stored? >>>> >>>> >>> tf-Idf Folder part-**** >>> >> >> That's 1.28 GB. Count: 31,216,595 > > > > > -- > Ted Dunning, CTO > DeepDyve >
