Re: [Nutch-general] incremental crawling

Meryl Silverburgh Wed, 18 Apr 2007 11:56:04 -0700

All,

Can you please help me with my problem? I have posted my question a
few time, but I still cant solve it. I appreciate if anyone can help
me with that.

i am trying to setup nutch 0.9 to crawl www.yahoo.com (My setting
works for cnn.com, msn.com, but not yahoo.com) .
I am using this command "bin/nutch crawl urls -dir crawl -depth 3".

 But after the command, no links have been fetch.

the only strange thing I see in the hadoop log is this warning:

2007-04-16 23:22:48,062 WARN  regex.RegexURLNormalizer - can't find
rules for scope 'outlink', using default

Is that something I need to setup before www.yahoo.com can be crawled?

Here is the output:
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070416230326
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20070416230326
Fetcher: threads: 10
fetching http://www.yahoo.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20070416230326]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070416230338
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20070416230326
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20070416230326
 Indexing [http://www.yahoo.com/] with analyzer
[EMAIL PROTECTED] (null)
Optimizing index.
merging segments _ram_0 (1 docs) into _0 (1 docs)
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Dedup: done
merging indexes to: crawl/index
Adding crawl/indexes/part-00000
done merging
crawl finished: crawl
CrawlDb topN: starting (topN=25, min=0.0)
CrawlDb db: crawl/crawldb
CrawlDb topN: collecting topN scores.
CrawlDb topN: done
Match

On 4/18/07, c wanek <[EMAIL PROTECTED]> wrote:
> Thanks Raj,
>
> First of all, here's some info I didn't include in the original question;
> I'm using Nutch .9, and my attempt to add to my index is basically a
> variation of the recrawl script at
> http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
> :
>
> inject new urls
> fetch loop to "depth":
> {
>    generate
>    fetch
>    update db
> }
> merge segments
> invert links
> index
> dedup
> merge indexes
>
> (I wind up with the merged index in 'index/merge-output', instead of
> 'index', but I thought perhaps I could deal with that weirdness when my
> index has the stuff I want...)
>
>
>
> Now, perhaps I don't understand the patch you pointed me to, but It
> seems that it is only meant to avoid recrawling content that hasn't
> changed.  It doesn't really have to do with avoiding a rebuild of the entire
> index if I add a document.  Or does it, and I just missed it?
>
> Does Nutch have the ability to add to an index without a complete rebuild,
> or is a complete rebuild required if I add even a single document?
>
> Furthermore, even if I were to decide that the complete rebuild is
> acceptable, Nutch is still discarding my custom fields from all documents
> that are not being updated.  Why is this happening?
>
> I appreciate the help; thanks.
> -Charlie
>
>
> On 4/14/07, rubdabadub <[EMAIL PROTECTED]> wrote:
> >
> > Hi Cahrlie:
> >
> > On 4/14/07, c wanek <[EMAIL PROTECTED]> wrote:
> > > Greetings,
> > >
> > > Now I'm at the point where I would like to add to my crawl, with a new
> > set
> > > of seed urls.  Using a variation on the recrawl script on the wiki, I
> > can
> > > make this happen, but I am running into a what is, for me, a showstopper
> > > issue.  The custom fields I added to the documents of the first crawl
> > are
> > > lost when the documents from the second crawl are added to the index.
> >
> > Nutch is all about writing once. All operation write once this is how
> > map-reduce
> > works.. This is why incremental crawling is difficult. But :-)
> >
> > http://issues.apache.org/jira/browse/NUTCH-61
> >
> > Like you many others want this to happen. And to the best of my knowledge
> > Andrzej Bialecki will be addressing the issue after 0.9 release .. which
> > is
> > anytime now :-)
> >
> > So you might give it a go with Nutch-61 but NOTE it doesn't work with
> > current trunk.
> >
> > Regards
> > raj
> >
>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] incremental crawling

Reply via email to