Re: Real-time Solr integration

Julien Nioche Fri, 15 Jul 2011 02:08:20 -0700

>
>
> On Thursday 14 July 2011 15:03:34 Julien Nioche wrote:
> > Have been thinking about this again. We could make so that the indexer
> does
> > not necessarily require a linkDB : some people are not particularly
> > interested in getting the anchors. At the moment you have to have a
> linkDB.
> >
> > This would make it a bit simpler (and quicker) to index within a crawl
> > iteration. Any thoughts on this?
>
> It still requires the CrawlDB right? Or are you suggesting we index without
> mapping through the CrawlDB?
>


My suggestion was only to make the linkDB optional to start with and would
still require the crawldb.


> And at which point during the crawl cycle? The fetcher with parsing
> enabled?
>

It would still require the parsing to have been done (as part of fetching or
separately - it does not matter) and an updated crawldb so after the update.
Think about it in the following way : the indexer remains as it is but
people will be able to do without the linkdb if they don't need the anchors.
That's all


> In that case, do we need url filtering and normalizing in the parse job?
>

See other thread - we already have that


>
> Anyway, take care of memory. The indexer can fill up your heap real quick,
> even with smaller add buffers. And of course handing indexing failure. It's
> not uncommon for requests to time out. And there is also the problem of an
> unhappily timed commit, which currently stops all indexing to Solr (there's
> an
> issue for this i remember).
>

This are general issues with the indexing regardless of where it is called
(end of crawl vs end of loop)

Julien



>
> >
> > On 12 July 2011 18:23, Markus Jelsma <markus.jel...@openindex.io> wrote:
> > > > Thanks for the responses :)
> > > >
> > > > So the size of the segments then i guess would determine the latency
> > > > between crawling and indexing.
> > >
> > > The size of your crawldb may matter even more in some cases. If you
> > > segment has just on file and your crawldb many millions, the indexing
> > > takes forever.
> > >
> > > > I and my colleague will look more into the scripts to see how the
> diffs
> > >
> > > get
> > >
> > > > pushed to Solr.
> > > >
> > > > Thanks again
> > > >
> > > > M
> > > >
> > > >
> > > > On Tue, Jul 12, 2011 at 6:12 PM, lewis john mcgibbney <
> > > >
> > > > lewis.mcgibb...@gmail.com> wrote:
> > > > > To add to Julien's comments there was a contribution made by
> Gabriele
> > > > > a while ago which addressed this issue (however I have not used his
> > >
> > > scripts
> > >
> > > > > extensively). They might be of interest for a look. Try the link
> > > > > below
> > >
> > >
> http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script
> > >
> > > > > On Tue, Jul 12, 2011 at 2:15 PM, Julien Nioche <
> > > > >
> > > > > lists.digitalpeb...@gmail.com> wrote:
> > > > >> Hi Matthew,
> > > > >>
> > > > >> This is usually achieved by writing a script containing the
> > > > >> individual Nutch commands (as opposed to calling 'nutch crawl')
> and
> > > > >> index at the end of a generate-fetch-parse-update-linkdb sequence.
> > > > >> You don't need any plugins for that
> > > > >>
> > > > >> HTH
> > > > >>
> > > > >> Julien
> > > > >>
> > > > >> On 12 July 2011 13:35, Matthew Painter <
> matthew.pain...@kusiri.com
> > > >
> > > >wrote:
> > > > >>> Hi all,
> > > > >>>
> > > > >>> I was wondering about the feasibility of creating a plugin for
> > > > >>> nutch that create a solr update command, and added it to a queue
> > > > >>> for indexing after it first parses the page, rather than when
> > > > >>> crawling
> > >
> > > has
> > >
> > > > >>> finished.
> > > > >>>
> > > > >>> This would allow you to do "real-time" indexing when crawling.
> > > > >>>
> > > > >>> Drawbacks: Not able to use the graph to give relevancy
> information.
> > > > >>>
> > > > >>> Wondering what initial thoughts are about this?
> > > > >>>
> > > > >>> Thanks :)
> > > > >>
> > > > >> --
> > > > >> *
> > > > >> *Open Source Solutions for Text Engineering
> > > > >>
> > > > >> http://digitalpebble.blogspot.com/
> > > > >> http://www.digitalpebble.com
> > > > >
> > > > > --
> > > > > *Lewis*
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Real-time Solr integration

Reply via email to