Re: Recrawling with Solr backend

lewis john mcgibbney Thu, 14 Jul 2011 07:15:08 -0700

Pleas seem comments below

On Thu, Jul 14, 2011 at 12:52 PM, Chris Alexander <
chris.alexan...@kusiri.com> wrote:


> Hi Lewis,
>
> First of all, thanks for the fantastic reply, most useful. I am working on
> testing out the functions you mention, of which I was not previously aware.
>

Yes there has been a lot of action recently even between 1.3 release and dev
1.4.


> There are a few offshoot questions from this that the answers to which
> aren't immediately apparent.
>
> When a solrindex is run doing an update of a previous index, is it the case
> that all of the content is copied into the solr index again (overwriting
> unchanged files, for example) or is only the changed data modified in the
> index? We came to this question because we are thinking of running a
> rolling
> crawl (i.e. restart a new crawl when the previous one has terminated) and
> clearly if it re-adds already existing and unchanged data on each loop
> round
> then this would negatively impact performance and would increase the amount
> of compacting required in Solr. From my simple testing it looks like the
> date is not updated in the Solr index, implying that it is not modified?
> But
> I could use confirmation of this as it's a fairly important issue.
>

If we take solrindex and solrdedup and omit solrclean for the time being as
this is a different matter and deals with removing a certain type of
'broken' document rather than comparing docs in our Solr index in acting
accordingly.

Solrindex - No data is technically copied, instead it is indexed from the
crawldb based upon whatever type of content and metadata we wished to
extract from it with our parsers (check out the plugins) along with URLlinks
present in linkdb. When fetching is undertaken each URL is given a unique
fetch time in milliseconds, this way we can disambiguate between several
pages which may be present in the solrindex and run the deduplication
command accordingly. At the moment, committs for all reducers to the solr
instance are handled in one go and yes you are correct this has been
identified as fairly expensive as resources for crawls and subsequently Solr
communication jobs increase proportionately. To prevent Nutch sending
'already existing and unchanged data', every page is given a metatag
relating to a lastModified value. This means that any page which has not
been modfied since the last crawl will be skipped during fetching. Does this
clear any of this up for you?

>
> The second point is relating to removing documents from the index. In the
> scenario we are working on, a list of primary URLs is used to direct the
> start of the crawl. When a new site is to be crawled, its homepage URL is
> added to the seed urls file for the next crawl (it may also have a filter
> added to the filtering file to restrict the crawling spread). When a site
> is
> no longer desired in the index, its URL is removed from the seed urls file.
> When the next index is run, does this mean that the pages crawled under the
> previous URL will be removed from the solr index because they were not
> crawled on that occasion, or will they have to be removed manually by some
> other mechanism? From my simple testing it looks like they are not removed
> automatically.
>

You are correct here, they most certainyl are not removed automatically. I
commented on a similar post a while ago. What happens if you were to remove
an URL from the seed list, recrawl (and automatically remove the pages from
you're index), then find out you are perhaps required to re-add that URL to
your seed list tomorrow or in the near future. This would not be a
sustainable way to maintain an index.

>
> I just found the db.ignore.external.links configuration value - which will
> solve a lot of the issues previously mentioned in passing regarding
> filtering the URLs to crawl.
>

Yes, I would say that experience using properties in nutch-site and your
various URLFIlters in a well tuned fashion should yield better results over
time.

>
> Thanks again for the help (and apologies for the huge e-mail)
>
> Chris
>
> On 14 July 2011 10:59, lewis john mcgibbney <lewis.mcgibb...@gmail.com
> >wrote:
>
> > Hi Chris,
> >
> > Yes a Nutch 1.3 crawl and Solr index bash script is something that has
> not
> > been added to the wiki yet. I think this is partly because there are very
> > few adjustments to be made to the comprehensive Nutch 1.2 scripts
> currently
> > available on the Nutch wiki. This would however be a great addition if we
> > could get the time to post one. The point of focus I pick up from your
> > thread is that you require a script for a "way of re-crawling previously
> > crawled pages only a certain amount of time after they were last crawled
> > etc". Generally speaking (at this stage anyway), I'll assume that "etc"
> > just
> > means various other property changes within nutch-site.xml.
> >
> > My recommended steps would be something like
> >
> > inject
> > generate
> > fetch
> > parse
> > updatedb
> > invertlinks
> > solrindex
> > solrdedup
> > solrclean
> >
> > We can obviously schedule Nutch to crawl regularly in addition to
> > configuration options in nutch-site.xml therefore "pages only a certain
> > amount of time after they were last crawled" can be dealt with. In
> addition
> > (literally within the last few days) there are now various elements
> > included
> > within the Nutch/Solr communication which could be taken into
> consideration
> > such as HTTP authentication for all Solr communication (Markus, 1.4-dev),
> > so
> > this will feature with the next release.
> >
> > Finally Nutch removes pages from a Solr index in two ways. Solrdedup [1,
> 3]
> > and Solrclean [2, 4]... descriptions for both can be read in the
> > accompanying references below.
> >
> > There was a comment on the list recently regarding manually removing a
> page
> > which is not required in the index any more, however I'm guessing that
> this
> > is not what you require?
> >
> > [1] http://wiki.apache.org/nutch/bin/nutch%20solrdedup
> > [2] http://wiki.apache.org/nutch/bin/nutch%20solrclean
> > [3]
> >
> >
> http://svn.apache.org/repos/asf/nutch/tags/release-1.3/src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java
> > [4]
> >
> >
> http://svn.apache.org/repos/asf/nutch/tags/release-1.3/src/java/org/apache/nutch/indexer/solr/SolrClean.java
> >
> >
> > On Wed, Jul 13, 2011 at 6:38 PM, Chris Alexander <
> > chris.alexan...@kusiri.com
> > > wrote:
> >
> > > Hi,
> > >
> > > I have been looking up re-crawling mechanisms with Nutch, and just
> about
> > > all
> > > I have come across is designed for pre-1.3 versions using the non-Solr
> > > index. We're using 1.3 with the Solr index (just because that was the
> > > latest
> > > version we downloaded to try out and we are already using Solr), and
> > seeing
> > > as this is the way that the project will be moving forward, is there a
> > > documented / supported / recommended way of re-crawling previously
> > crawled
> > > pages only a certain amount of time after they were last crawled etc.
> > with
> > > the current release? Please feel free to just point me to any docs or
> > > scripts that do it, but all I have found so far seems to support only
> the
> > > previous version's internal indexing.
> > >
> > > I guess this leads on to the question of how / whether Nutch removes
> > pages
> > > from its index, for example if we want to remove a whole load of pages
> > from
> > > the index, is this something that Nutch supports?
> > >
> > > Thanks again for the assistance.
> > >
> > > Chris
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Recrawling with Solr backend

Reply via email to