I have some basic questions about Nutch. Can someone point me in the
right direction, or if you have time, maybe just blast out an answer. 

Question One:
I can see the terms that come from the web page. Can I set up a way to
also add these things to the index. In other words, if "ice cream" came
from a <h1> tag I want to know.

Question Two:
"Ice Cream" is really two words. But in the index it will be stored as
two entries. How can I tell Nutch (Lucene) that this and other things
are to be treated as one Token.. I know that somehow I will need to
supply a dictionary of these terms, but is it possible.. and if so how?

Question Three ( is will start hunting for this ):
I have to hunt around for this so.. I have not yet.. but since I am
asking questions.. How can I add more stop words into the stop word
list?

Question Four ( is will start hunting for this ):
Last one, promise.. The indexes themselves. Is there an explanation
written up for each of the fields in the index. 


Thanks for the help
Ray

Reply via email to