I tried your suggestion too: But it still does not fetch anything:
$ bin/nutch fetch segments/20070417230152 -threads 20 Fetcher: starting Fetcher: segment: segments/20070417230152 Fetcher: threads: 20 fetching http://www.yahoo.com/ Fetcher: done and index command does not work either: $ bin/nutch index ./crawldb ./linkdb segments/20070417230152 Usage: <index> <crawldb> <linkdb> <segment> ... On 4/17/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote: > Thanks. but I did have this in crawl-urlfilter.txt. > > # accept hosts in MY.DOMAIN.NAME > +^http://([a-z0-9]*\.)*(yahoo.com)/ > > > > On 4/17/07, Ian Holsman <[EMAIL PROTECTED]> wrote: > > I'll try > > > > First.. I don't use 'crawl' I do it the long winded way i find it > > works better. > > from what I can guess, I'm thinking you haven't modified the regex- > > urlfilter.txt file to allow yahoo to be crawled. > > you would need to add > > +^http://([a-z0-9]*\.)*yahoo.com/ > > > > > > the easiest documentation on how to get all this working is > > documented here: http://blog.foofactory.fi/2007/02/online-indexing- > > integrating-nutch-with.html > > > > page down to the section 'setting up nutch' > > and follow the 4 step process documented there. > > > > just change the last nutch from: > > bin/nutch org.apache.nutch.indexer.SolrIndexer $BASEDIR/crawldb > > $BASEDIR/linkdb $SEGMENT > > > > to > > bin/nutch index $BASEDIR/crawldb $BASEDIR/linkdb $SEGMENT > > > > and it should do all the right things. > > > > > > regards > > Ian > > (ps... I'm no expert, just 1-2 steps ahead of where you are) > > > > On 18/04/2007, at 12:12 PM, Meryl Silverburgh wrote: > > > > > Ian, > > > > > > can you please help me with my problem too? > > > > > > i am trying to setup nutch 0.9 to crawl www.yahoo.com. > > > I am using this command "bin/nutch crawl urls -dir crawl -depth 3". > > > > > > But after the command, no links have been fetch. > > > > > > the only strange thing I see in the hadoop log is this warning: > > > > > > 2007-04-16 23:22:48,062 WARN regex.RegexURLNormalizer - can't find > > > rules for scope 'outlink', using default > > > > > > Is that something I need to setup before www.yahoo.com can be crawled? > > > > > > Here is the output: > > > crawl started in: crawl > > > rootUrlDir = urls > > > threads = 10 > > > depth = 3 > > > Injector: starting > > > Injector: crawlDb: crawl/crawldb > > > Injector: urlDir: urls > > > Injector: Converting injected urls to crawl db entries. > > > Injector: Merging injected urls into crawl db. > > > Injector: done > > > Generator: Selecting best-scoring urls due for fetch. > > > Generator: starting > > > Generator: segment: crawl/segments/20070416230326 > > > Generator: filtering: false > > > Generator: topN: 2147483647 > > > Generator: jobtracker is 'local', generating exactly one partition. > > > Generator: Partitioning selected urls by host, for politeness. > > > Generator: done. > > > Fetcher: starting > > > Fetcher: segment: crawl/segments/20070416230326 > > > Fetcher: threads: 10 > > > fetching http://www.yahoo.com/ > > > Fetcher: done > > > CrawlDb update: starting > > > CrawlDb update: db: crawl/crawldb > > > CrawlDb update: segments: [crawl/segments/20070416230326] > > > CrawlDb update: additions allowed: true > > > CrawlDb update: URL normalizing: true > > > CrawlDb update: URL filtering: true > > > CrawlDb update: Merging segment data into db. > > > CrawlDb update: done > > > Generator: Selecting best-scoring urls due for fetch. > > > Generator: starting > > > Generator: segment: crawl/segments/20070416230338 > > > Generator: filtering: false > > > Generator: topN: 2147483647 > > > Generator: jobtracker is 'local', generating exactly one partition. > > > Generator: 0 records selected for fetching, exiting ... > > > Stopping at depth=1 - no more URLs to fetch. > > > LinkDb: starting > > > LinkDb: linkdb: crawl/linkdb > > > LinkDb: URL normalize: true > > > LinkDb: URL filter: true > > > LinkDb: adding segment: crawl/segments/20070416230326 > > > LinkDb: done > > > Indexer: starting > > > Indexer: linkdb: crawl/linkdb > > > Indexer: adding segment: crawl/segments/20070416230326 > > > Indexing [http://www.yahoo.com/] with analyzer > > > [EMAIL PROTECTED] (null) > > > Optimizing index. > > > merging segments _ram_0 (1 docs) into _0 (1 docs) > > > Indexer: done > > > Dedup: starting > > > Dedup: adding indexes in: crawl/indexes > > > Dedup: done > > > merging indexes to: crawl/index > > > Adding crawl/indexes/part-00000 > > > done merging > > > crawl finished: crawl > > > CrawlDb topN: starting (topN=25, min=0.0) > > > CrawlDb db: crawl/crawldb > > > CrawlDb topN: collecting topN scores. > > > CrawlDb topN: done > > > Match > > > > > > > > > On 4/17/07, Ian Holsman <[EMAIL PROTECTED]> wrote: > > >> Hi Anita. > > >> > > >> I tried crawling autos.aols.com, and I could find pages similar to > > >> what our looking at in 3 crawls. (I injected http://autos.aol.com/ > > >> and added autos.aol.com to my regex filter to allow it) > > >> > > >> > > >> eg. > > >> fetching http://autos.aol.com/bmw-650-2007:8774-photos > > >> fetching http://autos.aol.com/article/general/v2/_a/auto- > > >> financing-101/20060818153509990001 > > >> fetching http://autos.aol.com/options_trimless?v=8544 > > >> fetching http://autos.aol.com/toyota-camry-hybrid-2007:8322-overviewl > > >> fetching http://autos.aol.com/bmw-m-2007:8905-overview > > >> fetching http://autos.aol.com/getaquote?myid=8623 > > >> fetching http://autos.aol.com/options_trimless?v=8226 > > >> fetching http://autos.aol.com/options_trimless?v=7803 > > >> fetching http://autos.aol.com/article/power/v2/_a/2006-dodge-charger- > > >> srt8/20061030193309990001 > > >> fetching http://autos.aol.com/bmw-x3-2007:8770-specs > > >> fetching http://autos.aol.com/saturn-vue-2007:8371-overview > > >> fetching http://autos.aol.com/aston-martin-vanquish-2006:8115- > > >> overview > > >> fetching http://autos.aol.com/options_trimless?v=8394 > > >> fetching http://autos.aol.com/jaguar-listings:JA--- > > >> fetching http://autos.aol.com/volkswagen-rabbit-2007:8554-overview > > >> fetching http://autos.aol.com/bmw-x5-2007:8817-overview > > >> fetching http://autos.aol.com/audi-a4-2007:8622-specs > > >> fetching http://autos.aol.com/options_trimless?v=8416 > > >> fetching http://autos.aol.com/getaquote?myid=8774 > > >> > > >> the differences is that I am using the latest nutch (SVN head), and > > >> am just using a local store, not hadoop. > > >> > > >> what I would do next if I were you is to check your regex filters to > > >> make sure you are not blocking things with a colon ':' in them for > > >> some strange reason, > > >> and possibly upgrade to the latest and greatest version of nutch. > > >> (0.9.1) > > >> > > >> regards > > >> Ian. > > >> > > >> > > >> > > >> On 18/04/2007, at 5:56 AM, [EMAIL PROTECTED] wrote: > > >> > > >> > Hi > > >> > > > >> > I am a new Nutch user, and am using Nutch 8.1 with Hadoop. The > > >> > domain I am > > >> > trying to crawl _http://autos.aol.com_ (http://autos.aol.com) . I > > >> > am crawling > > >> > to the depth of 10. > > >> > There are certain pages that Nutch could not fetch. An example > > >> > would be > > >> > _http://autos.aol.com/acura-rl-2006:8060-review_ > > >> > (http://autos.aol.com/acura-rl-2006:8060-review) . > > >> > > > >> > The referring url to this page is > > >> > _http://autos.aol.com/acura-rl-2007:8060-review_ (http:// > > >> > autos.aol.com/acura-rl-2007:8060-review) . This url was there > > >> > in the fetch list. > > >> > > > >> > I did a mini crawl pointing directly to > > >> > _http://autos.aol.com/acura-rl-2007:8060-review_ (http:// > > >> > autos.aol.com/acura-rl-2007:8060-review) , then the page > > >> > _http://autos.aol.com/acura-rl-2006:8060-review_ > > >> > (http://autos.aol.com/acura-rl-2006:8060-review) gets fetched. > > >> > > > >> > Does anyone have any ideas on why I am seeing this behavior. > > >> > > > >> > > > >> > Thanks > > >> > Anita Bidari (X55746) > > >> > > > >> > > > >> > > > >> > > > >> > ************************************** See what's free at http:// > > >> > www.aol.com. > > >> > > >> Ian Holsman > > >> [EMAIL PROTECTED] > > >> http://parent-chatter.com -- what do parents know? > > >> > > >> > > >> > > > > -- > > Ian Holsman > > [EMAIL PROTECTED] > > http://zyons.com/ build a Community with Django > > > > > > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
