Re: Real-time Solr integration

Markus Jelsma Fri, 15 Jul 2011 04:47:04 -0700


On Friday 15 July 2011 11:07:36 Julien Nioche wrote:
> > On Thursday 14 July 2011 15:03:34 Julien Nioche wrote:
> > > Have been thinking about this again. We could make so that the indexer
> > 
> > does
> > 
> > > not necessarily require a linkDB : some people are not particularly
> > > interested in getting the anchors. At the moment you have to have a
> > 
> > linkDB.
> > 
> > > This would make it a bit simpler (and quicker) to index within a crawl
> > > iteration. Any thoughts on this?
> > 
> > It still requires the CrawlDB right? Or are you suggesting we index
> > without mapping through the CrawlDB?
> 
> My suggestion was only to make the linkDB optional to start with and would
> still require the crawldb.


Not having to use (and create) a link db would be great in many cases. 
Inverting links in every crawl cycle drops average throughput considerably.

> 
> > And at which point during the crawl cycle? The fetcher with parsing
> > enabled?
> 
> It would still require the parsing to have been done (as part of fetching
> or separately - it does not matter) and an updated crawldb so after the
> update. Think about it in the following way : the indexer remains as it is
> but people will be able to do without the linkdb if they don't need the
> anchors. That's all
> 
> > In that case, do we need url filtering and normalizing in the parse job?
> 
> See other thread - we already have that
> 
> > Anyway, take care of memory. The indexer can fill up your heap real
> > quick, even with smaller add buffers. And of course handing indexing
> > failure. It's not uncommon for requests to time out. And there is also
> > the problem of an unhappily timed commit, which currently stops all
> > indexing to Solr (there's an
> > issue for this i remember).
> 
> This are general issues with the indexing regardless of where it is called
> (end of crawl vs end of loop)
> 
> Julien
> 
> > > On 12 July 2011 18:23, Markus Jelsma <markus.jel...@openindex.io> wrote:
> > > > > Thanks for the responses :)
> > > > > 
> > > > > So the size of the segments then i guess would determine the
> > > > > latency between crawling and indexing.
> > > > 
> > > > The size of your crawldb may matter even more in some cases. If you
> > > > segment has just on file and your crawldb many millions, the indexing
> > > > takes forever.
> > > > 
> > > > > I and my colleague will look more into the scripts to see how the
> > 
> > diffs
> > 
> > > > get
> > > > 
> > > > > pushed to Solr.
> > > > > 
> > > > > Thanks again
> > > > > 
> > > > > M
> > > > > 
> > > > > 
> > > > > On Tue, Jul 12, 2011 at 6:12 PM, lewis john mcgibbney <
> > > > > 
> > > > > lewis.mcgibb...@gmail.com> wrote:
> > > > > > To add to Julien's comments there was a contribution made by
> > 
> > Gabriele
> > 
> > > > > > a while ago which addressed this issue (however I have not used
> > > > > > his
> > > > 
> > > > scripts
> > > > 
> > > > > > extensively). They might be of interest for a look. Try the link
> > > > > > below
> > 
> > http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script
> > 
> > > > > > On Tue, Jul 12, 2011 at 2:15 PM, Julien Nioche <
> > > > > > 
> > > > > > lists.digitalpeb...@gmail.com> wrote:
> > > > > >> Hi Matthew,
> > > > > >> 
> > > > > >> This is usually achieved by writing a script containing the
> > > > > >> individual Nutch commands (as opposed to calling 'nutch crawl')
> > 
> > and
> > 
> > > > > >> index at the end of a generate-fetch-parse-update-linkdb
> > > > > >> sequence. You don't need any plugins for that
> > > > > >> 
> > > > > >> HTH
> > > > > >> 
> > > > > >> Julien
> > > > > >> 
> > > > > >> On 12 July 2011 13:35, Matthew Painter <
> > 
> > matthew.pain...@kusiri.com
> > 
> > > > >wrote:
> > > > > >>> Hi all,
> > > > > >>> 
> > > > > >>> I was wondering about the feasibility of creating a plugin for
> > > > > >>> nutch that create a solr update command, and added it to a
> > > > > >>> queue for indexing after it first parses the page, rather than
> > > > > >>> when crawling
> > > > 
> > > > has
> > > > 
> > > > > >>> finished.
> > > > > >>> 
> > > > > >>> This would allow you to do "real-time" indexing when crawling.
> > > > > >>> 
> > > > > >>> Drawbacks: Not able to use the graph to give relevancy
> > 
> > information.
> > 
> > > > > >>> Wondering what initial thoughts are about this?
> > > > > >>> 
> > > > > >>> Thanks :)
> > > > > >> 
> > > > > >> --
> > > > > >> *
> > > > > >> *Open Source Solutions for Text Engineering
> > > > > >> 
> > > > > >> http://digitalpebble.blogspot.com/
> > > > > >> http://www.digitalpebble.com
> > > > > > 
> > > > > > --
> > > > > > *Lewis*
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Real-time Solr integration

Reply via email to