Re: Real-time Solr integration

Julien Nioche Fri, 15 Jul 2011 05:05:42 -0700

Will take care of this one later :
https://issues.apache.org/jira/browse/NUTCH-1054


On 15 July 2011 12:46, Markus Jelsma <markus.jel...@openindex.io> wrote:

>
>
> On Friday 15 July 2011 11:07:36 Julien Nioche wrote:
> > > On Thursday 14 July 2011 15:03:34 Julien Nioche wrote:
> > > > Have been thinking about this again. We could make so that the
> indexer
> > >
> > > does
> > >
> > > > not necessarily require a linkDB : some people are not particularly
> > > > interested in getting the anchors. At the moment you have to have a
> > >
> > > linkDB.
> > >
> > > > This would make it a bit simpler (and quicker) to index within a
> crawl
> > > > iteration. Any thoughts on this?
> > >
> > > It still requires the CrawlDB right? Or are you suggesting we index
> > > without mapping through the CrawlDB?
> >
> > My suggestion was only to make the linkDB optional to start with and
> would
> > still require the crawldb.
>
> Not having to use (and create) a link db would be great in many cases.
> Inverting links in every crawl cycle drops average throughput considerably.
>
> >
> > > And at which point during the crawl cycle? The fetcher with parsing
> > > enabled?
> >
> > It would still require the parsing to have been done (as part of fetching
> > or separately - it does not matter) and an updated crawldb so after the
> > update. Think about it in the following way : the indexer remains as it
> is
> > but people will be able to do without the linkdb if they don't need the
> > anchors. That's all
> >
> > > In that case, do we need url filtering and normalizing in the parse
> job?
> >
> > See other thread - we already have that
> >
> > > Anyway, take care of memory. The indexer can fill up your heap real
> > > quick, even with smaller add buffers. And of course handing indexing
> > > failure. It's not uncommon for requests to time out. And there is also
> > > the problem of an unhappily timed commit, which currently stops all
> > > indexing to Solr (there's an
> > > issue for this i remember).
> >
> > This are general issues with the indexing regardless of where it is
> called
> > (end of crawl vs end of loop)
> >
> > Julien
> >
> > > > On 12 July 2011 18:23, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> > > > > > Thanks for the responses :)
> > > > > >
> > > > > > So the size of the segments then i guess would determine the
> > > > > > latency between crawling and indexing.
> > > > >
> > > > > The size of your crawldb may matter even more in some cases. If you
> > > > > segment has just on file and your crawldb many millions, the
> indexing
> > > > > takes forever.
> > > > >
> > > > > > I and my colleague will look more into the scripts to see how the
> > >
> > > diffs
> > >
> > > > > get
> > > > >
> > > > > > pushed to Solr.
> > > > > >
> > > > > > Thanks again
> > > > > >
> > > > > > M
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 12, 2011 at 6:12 PM, lewis john mcgibbney <
> > > > > >
> > > > > > lewis.mcgibb...@gmail.com> wrote:
> > > > > > > To add to Julien's comments there was a contribution made by
> > >
> > > Gabriele
> > >
> > > > > > > a while ago which addressed this issue (however I have not used
> > > > > > > his
> > > > >
> > > > > scripts
> > > > >
> > > > > > > extensively). They might be of interest for a look. Try the
> link
> > > > > > > below
> > >
> > >
> http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script
> > >
> > > > > > > On Tue, Jul 12, 2011 at 2:15 PM, Julien Nioche <
> > > > > > >
> > > > > > > lists.digitalpeb...@gmail.com> wrote:
> > > > > > >> Hi Matthew,
> > > > > > >>
> > > > > > >> This is usually achieved by writing a script containing the
> > > > > > >> individual Nutch commands (as opposed to calling 'nutch
> crawl')
> > >
> > > and
> > >
> > > > > > >> index at the end of a generate-fetch-parse-update-linkdb
> > > > > > >> sequence. You don't need any plugins for that
> > > > > > >>
> > > > > > >> HTH
> > > > > > >>
> > > > > > >> Julien
> > > > > > >>
> > > > > > >> On 12 July 2011 13:35, Matthew Painter <
> > >
> > > matthew.pain...@kusiri.com
> > >
> > > > > >wrote:
> > > > > > >>> Hi all,
> > > > > > >>>
> > > > > > >>> I was wondering about the feasibility of creating a plugin
> for
> > > > > > >>> nutch that create a solr update command, and added it to a
> > > > > > >>> queue for indexing after it first parses the page, rather
> than
> > > > > > >>> when crawling
> > > > >
> > > > > has
> > > > >
> > > > > > >>> finished.
> > > > > > >>>
> > > > > > >>> This would allow you to do "real-time" indexing when
> crawling.
> > > > > > >>>
> > > > > > >>> Drawbacks: Not able to use the graph to give relevancy
> > >
> > > information.
> > >
> > > > > > >>> Wondering what initial thoughts are about this?
> > > > > > >>>
> > > > > > >>> Thanks :)
> > > > > > >>
> > > > > > >> --
> > > > > > >> *
> > > > > > >> *Open Source Solutions for Text Engineering
> > > > > > >>
> > > > > > >> http://digitalpebble.blogspot.com/
> > > > > > >> http://www.digitalpebble.com
> > > > > > >
> > > > > > > --
> > > > > > > *Lewis*
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Real-time Solr integration

Reply via email to