> What you should do is to compare the structure nutch uses with the 
> structure you use, and somehow combine the two. In most of 
> the fields, 
> you sould converge to the nutch version. Other than that, 
> once index the 
> index is created from nutch, it is lucene stuff. You can merge the 
> indexes or run a MultiSearcher, or open seperate 
> DistributedSearch$Clients and combine the results from 
> seperate indexes 
> on the fly. However there is an issue about summaries. Do you 
> intend to use them?

I see. I don't think I can unify the index fields, since we use a very
granular field structure for our DB content. It would be ok to have the
results displayed on the web page separated, with the first paragraph
showing the DB search results and the second one for the Nutch results,
effectively running and querying the two indexes separately.

Further issues:
- Are Lucene and Nutch Queries compatible? I've heard the "Query" class
hierarchy is different for Nutch. Basically, a query that works for Lucene
(maybe containing boolean operators, phrases etc.) should not throw an
exception or so in Nutch and return sensible results.

- I need to exclude things like header, footer and navigation from the
crawled pages and only index the content of a certain area. Can this be done
in Nutch? I found some vague hints pointing to HtmlParser and Plugins...

- My working environment for the current search is Java 1.4.2 and Lucene
2.1. I guess I have to use Nutch 0.8 (since 0.9 switched to Java 1.5) and
hope it can cope with the newer Lucene version?

Thanks a lot for your help so far!


Regards,

Michael Böckling

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to