Original tags, attribute defs, multiword tokens, how is this done.

Lukas, Ray Tue, 17 Mar 2009 07:04:38 -0700

 
I have some basic questions about Nutch. Can someone point me in the
right direction, or if you have time, maybe just blast out an answer.


Question One:
I can see the terms that come from the web page. Can I set up a way to
also add these things to the index. In other words, if "ice cream" came
from a <h1> tag I want to know.

Question Two:
"Ice Cream" is really two words. But in the index it will be stored as
two entries. How can I tell Nutch (Lucene) that this and other things
are to be treated as one Token.. I know that somehow I will need to
supply a dictionary of these terms, but is it possible.. and if so how?

Question Three ( is will start hunting for this ):
I have to hunt around for this so.. I have not yet.. but since I am
asking questions.. How can I add more stop words into the stop word
list?

Question Four ( is will start hunting for this ):
Last one, promise.. The indexes themselves. Is there an explanation
written up for each of the fields in the index. 


Thanks for the help
Ray

Original tags, attribute defs, multiword tokens, how is this done.

Reply via email to