Re: [Nutch-general] incremental crawling

RP Thu, 19 Apr 2007 06:57:45 -0700

I've not looked but do they have a robots.txt file or META tag set that 
may be stopping things..??


rp

Meryl Silverburgh wrote:
> All,
>
> Can you please help me with my problem? I have posted my question a
> few time, but I still cant solve it. I appreciate if anyone can help
> me with that.
>
> i am trying to setup nutch 0.9 to crawl www.yahoo.com (My setting
> works for cnn.com, msn.com, but not yahoo.com) .
> I am using this command "bin/nutch crawl urls -dir crawl -depth 3".
>
> But after the command, no links have been fetch.
>
> the only strange thing I see in the hadoop log is this warning:
>
> 2007-04-16 23:22:48,062 WARN  regex.RegexURLNormalizer - can't find
> rules for scope 'outlink', using default
>
> Is that something I need to setup before www.yahoo.com can be crawled?
>
> Here is the output:
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20070416230326
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20070416230326
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20070416230326]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20070416230338
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20070416230326
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20070416230326
> Indexing [http://www.yahoo.com/] with analyzer
> [EMAIL PROTECTED] (null)
> Optimizing index.
> merging segments _ram_0 (1 docs) into _0 (1 docs)
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Dedup: done
> merging indexes to: crawl/index
> Adding crawl/indexes/part-00000
> done merging
> crawl finished: crawl
> CrawlDb topN: starting (topN=25, min=0.0)
> CrawlDb db: crawl/crawldb
> CrawlDb topN: collecting topN scores.
> CrawlDb topN: done
> Match
>
> On 4/18/07, c wanek <[EMAIL PROTECTED]> wrote:
>> Thanks Raj,
>>
>> First of all, here's some info I didn't include in the original 
>> question;
>> I'm using Nutch .9, and my attempt to add to my index is basically a
>> variation of the recrawl script at
>> http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
>>  
>>
>> :
>>
>> inject new urls
>> fetch loop to "depth":
>> {
>>    generate
>>    fetch
>>    update db
>> }
>> merge segments
>> invert links
>> index
>> dedup
>> merge indexes
>>
>> (I wind up with the merged index in 'index/merge-output', instead of
>> 'index', but I thought perhaps I could deal with that weirdness when my
>> index has the stuff I want...)
>>
>>
>>
>> Now, perhaps I don't understand the patch you pointed me to, but It
>> seems that it is only meant to avoid recrawling content that hasn't
>> changed.  It doesn't really have to do with avoiding a rebuild of the 
>> entire
>> index if I add a document.  Or does it, and I just missed it?
>>
>> Does Nutch have the ability to add to an index without a complete 
>> rebuild,
>> or is a complete rebuild required if I add even a single document?
>>
>> Furthermore, even if I were to decide that the complete rebuild is
>> acceptable, Nutch is still discarding my custom fields from all 
>> documents
>> that are not being updated.  Why is this happening?
>>
>> I appreciate the help; thanks.
>> -Charlie
>>
>>
>> On 4/14/07, rubdabadub <[EMAIL PROTECTED]> wrote:
>> >
>> > Hi Cahrlie:
>> >
>> > On 4/14/07, c wanek <[EMAIL PROTECTED]> wrote:
>> > > Greetings,
>> > >
>> > > Now I'm at the point where I would like to add to my crawl, with 
>> a new
>> > set
>> > > of seed urls.  Using a variation on the recrawl script on the 
>> wiki, I
>> > can
>> > > make this happen, but I am running into a what is, for me, a 
>> showstopper
>> > > issue.  The custom fields I added to the documents of the first 
>> crawl
>> > are
>> > > lost when the documents from the second crawl are added to the 
>> index.
>> >
>> > Nutch is all about writing once. All operation write once this is how
>> > map-reduce
>> > works.. This is why incremental crawling is difficult. But :-)
>> >
>> > http://issues.apache.org/jira/browse/NUTCH-61
>> >
>> > Like you many others want this to happen. And to the best of my 
>> knowledge
>> > Andrzej Bialecki will be addressing the issue after 0.9 release .. 
>> which
>> > is
>> > anytime now :-)
>> >
>> > So you might give it a go with Nutch-61 but NOTE it doesn't work with
>> > current trunk.
>> >
>> > Regards
>> > raj
>> >
>>
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] incremental crawling

Reply via email to