Jack Tang wrote:
> Hello cao,
> 
> I tried Chinese specified tokenizer today. And it was so odd, and I
> could not get the query result in nutch either.
> So, I think maybe there are some differences between Nutch's query and
> Luke's query.
> Anyone can explain?
> 
> 
> Thanks, 
>  /Jack

Hello Jack,

I believe there is a mismatch somewhere between the way the standard
Nutch analyzer produces index terms, and the way the query is re-written
into terms.

Here's what you can do to debug this problem:

* create a small segment with a well-known content, e.g. by crawling 2-3
pages with the crawl tool. Make sure you are using your analyzer when
parsing the content.

* index the segment, and then open the index with Luke. If you are
curious to see how your text was tokenized, you can open individual
documents (using "Reconstruct & Edit"), and look at each field,
especially at the "Tokenized" or "Restored" content, so that you can be
sure that the terms that should be there in fact made it into the index...

* to make things simpler, create a Nutch query that contains only a
single well-known term that is not a stopword, and then translate it
into a Lucene query by using the following command-line:

        ./nutch org.apache.nutch.searcher.Query

Please note that due to the fact that Nutch analyzer creates word
n-grams, your translated Lucene query will most likely look much more
complicated than you could expect... ;-) That's normal.

* then copy & paste this translated Lucene query into the search box in
Luke. *Make sure* to select a WhitespaceAnalyzer, so that the query is
unchanged by the QueryParser in Luke! You should probably check this by
clicking on the small "Update" button and comparing the "Parsed" form
with the unparsed query above - they should be identical.

Press "Search" and see if you get some results. With the setup as above,
you follow the same process of parsing, analyzing and searching as it
happens in Nutch.

-- 
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
This SF.net email is sponsored by Demarc:
A global provider of Threat Management Solutions.
Download our HomeAdmin security software for free today!
http://www.demarc.com/Info/Sentarus/hamr30
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to