[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doğacan Güney updated NUTCH-442: -------------------------------- Attachment: RFC_multiple_search_backends.patch Here is my (very large - sorry) patch for this issue: Patch consists of two parts: 1) Support for multiple indexing backends: See NUTCH-520 for details. This patch includes the latest patch from NUTCH-520. 2) Support for multiple search backends: * DistributedSearch.Client is removed. * Search is divided into three main parts: - SearchBean: implements Searcher and HitDetailer - SegmentBean: implements HitContent and HitSummarizer - HitInlinks: same as old This division may seem arbitrary (and it actually is), however these abstractions are useful enough that Solr and nutch's search server can work. If later further abstractions are needed for new search backends, they can be added. This division also has a nice side effect: Currently, an search server searches lucene indexes _and_ generate summaries for results. After this patch, it is now possible to start a search server that searches an index and a 'segment server' (that returns cached content of pages, generates summaries, etc.) seperately. DistributedSearch$IndexServer (uses LuceneSearchBean) and DistributedSearch$SegmentServer (uses FetchedSegments) classes are added for this. * SearchBean hierarchy is like this: SearchBean (extends Searcher, HitDetailer) RPCSearchBean (extends SearchBean, VersionedProtocol) LuceneSearchBean (implements RPCSearchBean, searches lucene indexes (may be local or on dfs), can also respond to RPC requests) SolrSearchBean (implements SearchBean, processes responses from a SOLR server) DistributedSearchBean (implements SearchBean, is also a container of SearchBeans. This class implements the searching part of DistributedSearch$Client. Sends parallel connections to multiple beans and merges their results. Does not use RPC.call API (since not all beans support hadoop's RPC), instead uses a modern threading pool for parallel requests. * Location of remote nutch/lucene servers are still read from crawl/search-servers.txt. Location of solr servers are read from crawl/solr-servers.txt (yes, it supports searching from more than 1 solr servers). * DistributedSearchBean routinely sends pings to its beans. If a bean fails to respond, it is removed from active list of search servers (so that it doesn't block searching). For example, if solr server dies, DistributedSearchBean realizes this and stops sending search requests to solr server. Later when solr comes back up, DistributedSearchBean re-adds it to active search server list. * SegmentBean is similar: SegmentBean (extends HitContents, HitSummarizer) RPCSegmentBean (extends SegmentBean, VersionedProtocol), FetchedSegments (is similar to older version) * DistributedSearch$SegmentServer (which uses FetchedSegments internally) reads its config from crawl/segment-servers.txt . * I also added a couple of utility classes for sending requests to solr and processing responses (under o.a.n.util.solr) Sorry, if the description is a bit complex (however, code itsef should be easy to understand) . Comments, suggestions, reviews and all other sorts of feedback are welcome. > Integrate Solr/Nutch > -------------------- > > Key: NUTCH-442 > URL: https://issues.apache.org/jira/browse/NUTCH-442 > Project: Nutch > Issue Type: New Feature > Environment: Ubuntu linux > Reporter: rubdabadub > Attachments: RFC_multiple_search_backends.patch > > > Hi: > After trying out Sami's patch regarding Solr/Nutch. Can be found here > (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html) > and I can confirm it worked :-) And that lead me to request the following : > I would be very very great full if this could be included in nutch 0.9 as I > am trying to eliminate my python based crawler which post documents to solr. > As I am in the corporate enviornment I can't install trunk version in the > production enviornment thus I am asking this to be included in 0.9 release. I > hope my wish would be granted. > I look forward to get some feedback. > Thank you. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.