RE: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Patrick Markiewicz Wed, 15 Oct 2008 06:40:14 -0700

Hi Matt,
     If you read the Lucene documentation you will discover that the
Analyzer used for searching needs to be the same type that indexed the
content.  I'm not sure what you're using for searching, but wherever you
reference an analyzer in Lucene, you need to change that from
StandardAnalyzer to
AnalyzerFactory.get(NutchConfiguration.create().get("en")) (which may
require importing nutch-specific classes).  In order to display the URL,
you need to reference the "url" field as opposed to the "path" field
that Lucene uses initially.  Use Luke to see what field stores the
content of the URL.  That may have to change from "content" to
"contents".
     To be honest, I never tried just changing the "path" field to
"url".  You could try that and see if the StandardAnalyzer would work,
but I don't have enough knowledge about the analyzers to know if that
would work.


Patrick

-----Original Message-----
From: Matthias W. [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 15, 2008 6:22 AM
To: [email protected]
Subject: Re: Using Nutch for crawling and Lucene for searching
(Wildcard/Fuzzy)


Thanks, but what does this mean for me?
I already tried to search the index with the Lucene webapp
(lucenewebapp.war
from Lucene package) including my nutch index 'nutchcrawl/index' and
'nutchcrawl/indexes/part-00000' but with both of them I get no results.
And my index is correct, because with Luke and the nutch webapp I get
results.

Andrzej Bialecki wrote:
> 
> Matthias W. wrote:
>> Hi,
>> I want to use Nutch for crawling contents and Lucene webapp to search
the
>> Nutch-created index.
>> I thought nutch creates a Lucene interoperable index, but when I'm
>> searching
>> the index with the Lucene webapp I get no results.
>> I'm using Nutch 0.9 and Lucene 2.4.0.
>> Should I use an older Lucene version like 2.0 or is this not crucial?
>> 
>> I want to use Lucene, because of its Wildcardsearch and Fuzzysearch,
...
>> Are there other possibilities to solve this?
> 
> Nutch indexes are plain Lucene indexes. The only difference is that as
a 
> side-effect of map-reduce processing these indexes may come in several

> parts, found in subdirectories named like part-xxxxx. Each
subdirectory 
> holds a valid Lucene index.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

-- 
View this message in context:
http://www.nabble.com/Using-Nutch-for-crawling-and-Lucene-for-searching-
%28Wildcard-Fuzzy%29-tp19990219p19990671.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Reply via email to