All, Can you please help me with my problem? I have posted my question a few time, but I still cant solve it. I appreciate if anyone can help me with that.
i am trying to setup nutch 0.9 to crawl www.yahoo.com (My setting works for cnn.com, msn.com, but not yahoo.com) . I am using this command "bin/nutch crawl urls -dir crawl -depth 3". But after the command, no links have been fetch. the only strange thing I see in the hadoop log is this warning: 2007-04-16 23:22:48,062 WARN regex.RegexURLNormalizer - can't find rules for scope 'outlink', using default Is that something I need to setup before www.yahoo.com can be crawled? Here is the output: crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20070416230326 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20070416230326 Fetcher: threads: 10 fetching http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20070416230326] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20070416230338 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: crawl/segments/20070416230326 LinkDb: done Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20070416230326 Indexing [http://www.yahoo.com/] with analyzer [EMAIL PROTECTED] (null) Optimizing index. merging segments _ram_0 (1 docs) into _0 (1 docs) Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Dedup: done merging indexes to: crawl/index Adding crawl/indexes/part-00000 done merging crawl finished: crawl CrawlDb topN: starting (topN=25, min=0.0) CrawlDb db: crawl/crawldb CrawlDb topN: collecting topN scores. CrawlDb topN: done Match On 4/18/07, c wanek <[EMAIL PROTECTED]> wrote: > Thanks Raj, > > First of all, here's some info I didn't include in the original question; > I'm using Nutch .9, and my attempt to add to my index is basically a > variation of the recrawl script at > http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 > : > > inject new urls > fetch loop to "depth": > { > generate > fetch > update db > } > merge segments > invert links > index > dedup > merge indexes > > (I wind up with the merged index in 'index/merge-output', instead of > 'index', but I thought perhaps I could deal with that weirdness when my > index has the stuff I want...) > > > > Now, perhaps I don't understand the patch you pointed me to, but It > seems that it is only meant to avoid recrawling content that hasn't > changed. It doesn't really have to do with avoiding a rebuild of the entire > index if I add a document. Or does it, and I just missed it? > > Does Nutch have the ability to add to an index without a complete rebuild, > or is a complete rebuild required if I add even a single document? > > Furthermore, even if I were to decide that the complete rebuild is > acceptable, Nutch is still discarding my custom fields from all documents > that are not being updated. Why is this happening? > > I appreciate the help; thanks. > -Charlie > > > On 4/14/07, rubdabadub <[EMAIL PROTECTED]> wrote: > > > > Hi Cahrlie: > > > > On 4/14/07, c wanek <[EMAIL PROTECTED]> wrote: > > > Greetings, > > > > > > Now I'm at the point where I would like to add to my crawl, with a new > > set > > > of seed urls. Using a variation on the recrawl script on the wiki, I > > can > > > make this happen, but I am running into a what is, for me, a showstopper > > > issue. The custom fields I added to the documents of the first crawl > > are > > > lost when the documents from the second crawl are added to the index. > > > > Nutch is all about writing once. All operation write once this is how > > map-reduce > > works.. This is why incremental crawling is difficult. But :-) > > > > http://issues.apache.org/jira/browse/NUTCH-61 > > > > Like you many others want this to happen. And to the best of my knowledge > > Andrzej Bialecki will be addressing the issue after 0.9 release .. which > > is > > anytime now :-) > > > > So you might give it a go with Nutch-61 but NOTE it doesn't work with > > current trunk. > > > > Regards > > raj > > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
