Yes, this makes sense to me. I think I'll just keep all words, including
stop words, and if performance ever becomes an issue, I'll look at
bigrams again. But I think there's a good chance that I'll never see
significant impact either way.
Thanks guys!
Grant Ingersoll wrote:
Yep, still good reasons like I said, but becoming less important as
the hardware, etc. gets faster and cheaper, IMO, especially in the
context of more advanced search capabilities.
On Mar 3, 2008, at 10:49 AM, Mathieu Lecarme wrote:
Not sure, you might want to ask on Nutch. From a strict language
standpoint, the notion of a stopword in my mind is a bit dubious.
If the word really has no meaning, then why does the language have
it to begin with? In a search context, it has been treated as of
minimal use in the early days mostly because of space and memory
considerations. Now a days, as we get more sophisticated in our
search capabilities, I think it can be useful for doing better
phrase matching, etc. as well as more advanced NLP search. Now it
seems like the general response is disk is cheap, why throw away
information?
To limit writing on disk, to simplify merge ?
I don't know the ratio of stop word in current texts.
M.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]