Re: [Nutch-general] Nutch Crawl Question

Meryl Silverburgh Tue, 17 Apr 2007 21:06:50 -0700

I tried your suggestion too:

But it still does not fetch anything:


$ bin/nutch fetch segments/20070417230152 -threads 20
Fetcher: starting
Fetcher: segment: segments/20070417230152
Fetcher: threads: 20
fetching http://www.yahoo.com/
Fetcher: done

and index command does not work either:
$ bin/nutch index ./crawldb ./linkdb segments/20070417230152
Usage: <index> <crawldb> <linkdb> <segment> ...




On 4/17/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:
> Thanks. but I did have this in crawl-urlfilter.txt.
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*(yahoo.com)/
>
>
>
> On 4/17/07, Ian Holsman <[EMAIL PROTECTED]> wrote:
> > I'll try
> >
> > First.. I don't use 'crawl' I do it the long winded way i find it
> > works better.
> > from what I can guess, I'm thinking you haven't modified the regex-
> > urlfilter.txt file to allow yahoo to be crawled.
> > you would need to add
> > +^http://([a-z0-9]*\.)*yahoo.com/
> >
> >
> > the easiest documentation on how to get all this working is
> > documented here: http://blog.foofactory.fi/2007/02/online-indexing-
> > integrating-nutch-with.html
> >
> > page down to the section 'setting up nutch'
> > and follow the 4 step process documented there.
> >
> > just change the last nutch from:
> > bin/nutch org.apache.nutch.indexer.SolrIndexer $BASEDIR/crawldb
> > $BASEDIR/linkdb $SEGMENT
> >
> > to
> > bin/nutch index $BASEDIR/crawldb $BASEDIR/linkdb $SEGMENT
> >
> > and it should do all the right things.
> >
> >
> > regards
> > Ian
> > (ps... I'm no expert, just 1-2 steps ahead of where you are)
> >
> > On 18/04/2007, at 12:12 PM, Meryl Silverburgh wrote:
> >
> > > Ian,
> > >
> > > can you please help me with my problem too?
> > >
> > > i am trying to setup nutch 0.9 to crawl www.yahoo.com.
> > > I am using this command "bin/nutch crawl urls -dir crawl -depth 3".
> > >
> > > But after the command, no links have been fetch.
> > >
> > > the only strange thing I see in the hadoop log is this warning:
> > >
> > > 2007-04-16 23:22:48,062 WARN  regex.RegexURLNormalizer - can't find
> > > rules for scope 'outlink', using default
> > >
> > > Is that something I need to setup before www.yahoo.com can be crawled?
> > >
> > > Here is the output:
> > > crawl started in: crawl
> > > rootUrlDir = urls
> > > threads = 10
> > > depth = 3
> > > Injector: starting
> > > Injector: crawlDb: crawl/crawldb
> > > Injector: urlDir: urls
> > > Injector: Converting injected urls to crawl db entries.
> > > Injector: Merging injected urls into crawl db.
> > > Injector: done
> > > Generator: Selecting best-scoring urls due for fetch.
> > > Generator: starting
> > > Generator: segment: crawl/segments/20070416230326
> > > Generator: filtering: false
> > > Generator: topN: 2147483647
> > > Generator: jobtracker is 'local', generating exactly one partition.
> > > Generator: Partitioning selected urls by host, for politeness.
> > > Generator: done.
> > > Fetcher: starting
> > > Fetcher: segment: crawl/segments/20070416230326
> > > Fetcher: threads: 10
> > > fetching http://www.yahoo.com/
> > > Fetcher: done
> > > CrawlDb update: starting
> > > CrawlDb update: db: crawl/crawldb
> > > CrawlDb update: segments: [crawl/segments/20070416230326]
> > > CrawlDb update: additions allowed: true
> > > CrawlDb update: URL normalizing: true
> > > CrawlDb update: URL filtering: true
> > > CrawlDb update: Merging segment data into db.
> > > CrawlDb update: done
> > > Generator: Selecting best-scoring urls due for fetch.
> > > Generator: starting
> > > Generator: segment: crawl/segments/20070416230338
> > > Generator: filtering: false
> > > Generator: topN: 2147483647
> > > Generator: jobtracker is 'local', generating exactly one partition.
> > > Generator: 0 records selected for fetching, exiting ...
> > > Stopping at depth=1 - no more URLs to fetch.
> > > LinkDb: starting
> > > LinkDb: linkdb: crawl/linkdb
> > > LinkDb: URL normalize: true
> > > LinkDb: URL filter: true
> > > LinkDb: adding segment: crawl/segments/20070416230326
> > > LinkDb: done
> > > Indexer: starting
> > > Indexer: linkdb: crawl/linkdb
> > > Indexer: adding segment: crawl/segments/20070416230326
> > > Indexing [http://www.yahoo.com/] with analyzer
> > > [EMAIL PROTECTED] (null)
> > > Optimizing index.
> > > merging segments _ram_0 (1 docs) into _0 (1 docs)
> > > Indexer: done
> > > Dedup: starting
> > > Dedup: adding indexes in: crawl/indexes
> > > Dedup: done
> > > merging indexes to: crawl/index
> > > Adding crawl/indexes/part-00000
> > > done merging
> > > crawl finished: crawl
> > > CrawlDb topN: starting (topN=25, min=0.0)
> > > CrawlDb db: crawl/crawldb
> > > CrawlDb topN: collecting topN scores.
> > > CrawlDb topN: done
> > > Match
> > >
> > >
> > > On 4/17/07, Ian Holsman <[EMAIL PROTECTED]> wrote:
> > >> Hi Anita.
> > >>
> > >> I tried crawling autos.aols.com, and I could find pages similar to
> > >> what our looking at in 3 crawls. (I injected http://autos.aol.com/
> > >> and added autos.aol.com to my regex filter to allow it)
> > >>
> > >>
> > >> eg.
> > >> fetching http://autos.aol.com/bmw-650-2007:8774-photos
> > >> fetching http://autos.aol.com/article/general/v2/_a/auto-
> > >> financing-101/20060818153509990001
> > >> fetching http://autos.aol.com/options_trimless?v=8544
> > >> fetching http://autos.aol.com/toyota-camry-hybrid-2007:8322-overviewl
> > >> fetching http://autos.aol.com/bmw-m-2007:8905-overview
> > >> fetching http://autos.aol.com/getaquote?myid=8623
> > >> fetching http://autos.aol.com/options_trimless?v=8226
> > >> fetching http://autos.aol.com/options_trimless?v=7803
> > >> fetching http://autos.aol.com/article/power/v2/_a/2006-dodge-charger-
> > >> srt8/20061030193309990001
> > >> fetching http://autos.aol.com/bmw-x3-2007:8770-specs
> > >> fetching http://autos.aol.com/saturn-vue-2007:8371-overview
> > >> fetching http://autos.aol.com/aston-martin-vanquish-2006:8115-
> > >> overview
> > >> fetching http://autos.aol.com/options_trimless?v=8394
> > >> fetching http://autos.aol.com/jaguar-listings:JA---
> > >> fetching http://autos.aol.com/volkswagen-rabbit-2007:8554-overview
> > >> fetching http://autos.aol.com/bmw-x5-2007:8817-overview
> > >> fetching http://autos.aol.com/audi-a4-2007:8622-specs
> > >> fetching http://autos.aol.com/options_trimless?v=8416
> > >> fetching http://autos.aol.com/getaquote?myid=8774
> > >>
> > >> the differences is that I am using the latest nutch (SVN head), and
> > >> am just using a local store, not hadoop.
> > >>
> > >> what I would do next if I were you is to check your regex filters to
> > >> make sure you are not blocking things with a colon ':' in them for
> > >> some strange reason,
> > >> and possibly upgrade to the latest and greatest version of nutch.
> > >> (0.9.1)
> > >>
> > >> regards
> > >> Ian.
> > >>
> > >>
> > >>
> > >> On 18/04/2007, at 5:56 AM, [EMAIL PROTECTED] wrote:
> > >>
> > >> > Hi
> > >> >
> > >> > I am a new Nutch user, and am using Nutch 8.1 with  Hadoop. The
> > >> > domain I am
> > >> > trying to crawl _http://autos.aol.com_ (http://autos.aol.com) . I
> > >> > am crawling
> > >> > to the depth  of 10.
> > >> > There are certain pages that Nutch could not fetch. An  example
> > >> > would be
> > >> > _http://autos.aol.com/acura-rl-2006:8060-review_
> > >> > (http://autos.aol.com/acura-rl-2006:8060-review) .
> > >> >
> > >> > The referring url to this page is
> > >> > _http://autos.aol.com/acura-rl-2007:8060-review_ (http://
> > >> > autos.aol.com/acura-rl-2007:8060-review) .  This url was there
> > >> > in the fetch list.
> > >> >
> > >> > I did a mini crawl pointing directly to
> > >> > _http://autos.aol.com/acura-rl-2007:8060-review_ (http://
> > >> > autos.aol.com/acura-rl-2007:8060-review) ,  then the page
> > >> > _http://autos.aol.com/acura-rl-2006:8060-review_
> > >> > (http://autos.aol.com/acura-rl-2006:8060-review)  gets  fetched.
> > >> >
> > >> > Does anyone have any ideas on why I am seeing this  behavior.
> > >> >
> > >> >
> > >> > Thanks
> > >> > Anita Bidari (X55746)
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > ************************************** See what's free at http://
> > >> > www.aol.com.
> > >>
> > >> Ian Holsman
> > >> [EMAIL PROTECTED]
> > >> http://parent-chatter.com -- what do parents know?
> > >>
> > >>
> > >>
> >
> > --
> > Ian Holsman
> > [EMAIL PROTECTED]
> > http://zyons.com/ build a Community with Django
> >
> >
> >
>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch Crawl Question

Reply via email to