Re: Real-time Solr integration

Julien Nioche Thu, 14 Jul 2011 06:04:06 -0700

Have been thinking about this again. We could make so that the indexer does
not necessarily require a linkDB : some people are not particularly
interested in getting the anchors. At the moment you have to have a linkDB.


This would make it a bit simpler (and quicker) to index within a crawl
iteration. Any thoughts on this?

On 12 July 2011 18:23, Markus Jelsma <markus.jel...@openindex.io> wrote:

>
> > Thanks for the responses :)
> >
> > So the size of the segments then i guess would determine the latency
> > between crawling and indexing.
>
> The size of your crawldb may matter even more in some cases. If you segment
> has just on file and your crawldb many millions, the indexing takes
> forever.
>
> >
> > I and my colleague will look more into the scripts to see how the diffs
> get
> > pushed to Solr.
> >
> > Thanks again
> >
> > M
> >
> >
> > On Tue, Jul 12, 2011 at 6:12 PM, lewis john mcgibbney <
> >
> > lewis.mcgibb...@gmail.com> wrote:
> > > To add to Julien's comments there was a contribution made by Gabriele a
> > > while ago which addressed this issue (however I have not used his
> scripts
> > > extensively). They might be of interest for a look. Try the link below
> > >
> > >
> http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script
> > >
> > > On Tue, Jul 12, 2011 at 2:15 PM, Julien Nioche <
> > >
> > > lists.digitalpeb...@gmail.com> wrote:
> > >> Hi Matthew,
> > >>
> > >> This is usually achieved by writing a script containing the individual
> > >> Nutch commands (as opposed to calling 'nutch crawl') and index at the
> > >> end of a generate-fetch-parse-update-linkdb sequence. You don't need
> > >> any plugins for that
> > >>
> > >> HTH
> > >>
> > >> Julien
> > >>
> > >> On 12 July 2011 13:35, Matthew Painter <matthew.pain...@kusiri.com
> >wrote:
> > >>> Hi all,
> > >>>
> > >>> I was wondering about the feasibility of creating a plugin for nutch
> > >>> that create a solr update command, and added it to a queue for
> > >>> indexing after it first parses the page, rather than when crawling
> has
> > >>> finished.
> > >>>
> > >>> This would allow you to do "real-time" indexing when crawling.
> > >>>
> > >>> Drawbacks: Not able to use the graph to give relevancy information.
> > >>>
> > >>> Wondering what initial thoughts are about this?
> > >>>
> > >>> Thanks :)
> > >>
> > >> --
> > >> *
> > >> *Open Source Solutions for Text Engineering
> > >>
> > >> http://digitalpebble.blogspot.com/
> > >> http://www.digitalpebble.com
> > >
> > > --
> > > *Lewis*
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Real-time Solr integration

Reply via email to