no. I am not. Can you please tell me how can I use Luke to troble
shoot my problem?
On 4/16/07, songjue <[EMAIL PROTECTED]> wrote:
Really strang. Did you try Luke? Its's much more convenient for debugging.
songjue
2007-04-16
发件人: Meryl Silverburgh
发送时间: 2007-04-16 12:15:39
收件人: [EMAIL PROTECTED]
抄送:
主题: Re: Crawl www.yahoo.com with nutch
I have use this command to crawl, up to the depth of 6
$ bin/nutch crawl urls -dir crawl -depth 6
and then this commad to read the link
$ bin/nutch readdb crawl/crawldb -topN 50 test
but I only 10 links, can you please tell me why?
$ more test/part-00000
2.1111112 http://www.yahoo.com/
0.11111111 http://srd.yahoo.com/hp5-v
0.11111111 http://www.yahoo.com/+document.cookie+
0.11111111 http://www.yahoo.com/1.0
0.11111111 http://www.yahoo.com/2.0.0
0.11111111 http://www.yahoo.com/r/hf
0.11111111 http://www.yahoo.com/r/hq
0.11111111 http://www.yahoo.com/r\/1m
0.11111111 http://www.yahoo.com/s/557762
0.11111111 http://www.yahoo.com/s/557770
On 4/15/07, Meryl Silverburgh <[EMAIL PROTECTED] > wrote:
> I am using 0.9 too.
>
> I am now getting further, but I get a bunch of NullPointerException:
>
> fetch of http://www.yahoo.com/s/557760 failed with:
> java.lang.NullPointerException
> fetch of http://www.yahoo.com/r/hq failed with: java.lang.NullPointerException
> fetch of http://www.yahoo.com/s/557762 failed with:
> java.lang.NullPointerException
>
>
> On 4/15/07, songjue <[EMAIL PROTECTED] > wrote:
> > I try this, Nutch0.9 works just fine. What's your Nutch version?
> >
> >
> >
> > songjue
> > 2007-04-16
> >
> >
> >
> > 发件人: Meryl Silverburgh
> > 发送时间: 2007-04-16 11:33:05
> > 收件人: [EMAIL PROTECTED]
> > 抄送:
> > 主题: Crawl www.yahoo.com with nutch
> >
> > I setup nutch to crawl, in my input file, I only have 1 site
> > "http://www.yahoo.com"
> >
> > $ bin/nutch crawl urls -dir crawl -depth 3
> >
> > and I have added 'yahoo.com' as my domain name in crawl-urlfilter.txt
> >
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-zA-Z0-9]*\.)*(cnn.com|yahoo.com)/
> >
> > But no links is being fetched, when I change the link to www.cnn.com,
> > it works. Can you please tell me what do I need to work to make
> > www.yahoo.com works?
> >
> > $ bin/nutch crawl urls -dir crawl -depth 3
> > crawl started in: crawl
> > rootUrlDir = urls
> > threads = 10
> > depth = 3
> > Injector: starting
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: done
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: starting
> > Generator: segment: crawl/segments/20070415222440
> > Generator: filtering: false
> > Generator: topN: 2147483647
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls by host, for politeness.
> > Generator: done.
> > Fetcher: starting
> > Fetcher: segment: crawl/segments/20070415222440
> > Fetcher: threads: 10
> > fetching http://www.yahoo.com/
> > Fetcher: done
> > CrawlDb update: starting
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/segments/20070415222440]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: true
> > CrawlDb update: URL filtering: true
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: done
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: starting
> > Generator: segment: crawl/segments/20070415222449
> > Generator: filtering: false
> > Generator: topN: 2147483647
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: 0 records selected for fetching, exiting ...
> > Stopping at depth=1 - no more URLs to fetch.
> > LinkDb: starting
> > LinkDb: linkdb: crawl/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment: crawl/segments/20070415222440
> > LinkDb: done
> > Indexer: starting
> > Indexer: linkdb: crawl/linkdb
> > Indexer: adding segment: crawl/segments/20070415222440
> > Indexing [http://www.yahoo.com/] with analyzer
> > [EMAIL PROTECTED] (null)
> > Optimizing index.
> > merging segments _ram_0 (1 docs) into _0 (1 docs)
> > Indexer: done
> > Dedup: starting
> > Dedup: adding indexes in: crawl/indexes
> > Dedup: done
> > merging indexes to: crawl/index
> > Adding crawl/indexes/part-00000
> > done merging
> > crawl finished: crawl
> >
>
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general