Re: [Nutch-general] Crawl www.yahoo.com with nutch

songjue Mon, 16 Apr 2007 19:33:00 -0700

It's a cool tool, you can browse and manage index you crawl down, e.g., 
it can display number of documents, fields, terms in your index, do search,
and much more. Wish it be helpful for you.





songjue
2007-04-17



发件人： Meryl Silverburgh
发送时间： 2007-04-17 02:35:34
收件人： [EMAIL PROTECTED]
抄送： 
主题： Re: Re: Crawl www.yahoo.com with nutch

no. I am not. Can you please tell me how can I use Luke to troble
shoot my problem?

On 4/16/07, songjue  <[EMAIL PROTECTED] > wrote:
> Really strang. Did you try Luke? Its's much more convenient for debugging.
>
>
>
>
> songjue
> 2007-04-16
>
>
>
> 发件人： Meryl Silverburgh
> 发送时间： 2007-04-16 12:15:39
> 收件人： [EMAIL PROTECTED]
> 抄送：
> 主题： Re: Crawl www.yahoo.com with nutch
>
> I have use this command to crawl, up to the depth of 6
> $ bin/nutch crawl urls -dir crawl -depth 6
> and then this commad to read the link
> $ bin/nutch readdb crawl/crawldb -topN 50 test
>
> but I only 10 links, can you please tell me why?
> $ more test/part-00000
> 2.1111112       http://www.yahoo.com/
> 0.11111111      http://srd.yahoo.com/hp5-v
> 0.11111111      http://www.yahoo.com/+document.cookie+
> 0.11111111      http://www.yahoo.com/1.0
> 0.11111111      http://www.yahoo.com/2.0.0
> 0.11111111      http://www.yahoo.com/r/hf
> 0.11111111      http://www.yahoo.com/r/hq
> 0.11111111      http://www.yahoo.com/r\/1m
> 0.11111111      http://www.yahoo.com/s/557762
> 0.11111111      http://www.yahoo.com/s/557770
>
>
> On 4/15/07, Meryl Silverburgh   <[EMAIL PROTECTED]  > wrote:
>  > I am using 0.9 too.
>  >
>  > I am now getting further, but I get a bunch of NullPointerException:
>  >
>  > fetch of http://www.yahoo.com/s/557760 failed with:
>  > java.lang.NullPointerException
>  > fetch of http://www.yahoo.com/r/hq failed with: 
> java.lang.NullPointerException
>  > fetch of http://www.yahoo.com/s/557762 failed with:
>  > java.lang.NullPointerException
>  >
>  >
>  > On 4/15/07, songjue   <[EMAIL PROTECTED]  > wrote:
>  >   > I try this, Nutch0.9 works just fine. What's your Nutch version?
>  >   >
>  >   >
>  >   >
>  >   > songjue
>  >   > 2007-04-16
>  >   >
>  >   >
>  >   >
>  >   > 发件人： Meryl Silverburgh
>  >   > 发送时间： 2007-04-16 11:33:05
>  >   > 收件人： [EMAIL PROTECTED]
>  >   > 抄送：
>  >   > 主题： Crawl www.yahoo.com with nutch
>  >   >
>  >   > I setup nutch to crawl, in my input file, I only have 1 site
>  >   > "http://www.yahoo.com";
>  >   >
>  >   > $ bin/nutch crawl urls -dir crawl -depth 3
>  >   >
>  >   > and I have added 'yahoo.com' as my domain name in crawl-urlfilter.txt
>  >   >
>  >   > # accept hosts in MY.DOMAIN.NAME
>  >   > +^http://([a-zA-Z0-9]*\.)*(cnn.com|yahoo.com)/
>  >   >
>  >   > But no links is being fetched, when I change the link to www.cnn.com,
>  >   > it works. Can you please tell me what do I need to work to make
>  >   > www.yahoo.com works?
>  >   >
>  >   > $ bin/nutch crawl urls -dir crawl -depth 3
>  >   > crawl started in: crawl
>  >   > rootUrlDir = urls
>  >   > threads = 10
>  >   > depth = 3
>  >   > Injector: starting
>  >   > Injector: crawlDb: crawl/crawldb
>  >   > Injector: urlDir: urls
>  >   > Injector: Converting injected urls to crawl db entries.
>  >   > Injector: Merging injected urls into crawl db.
>  >   > Injector: done
>  >   > Generator: Selecting best-scoring urls due for fetch.
>  >   > Generator: starting
>  >   > Generator: segment: crawl/segments/20070415222440
>  >   > Generator: filtering: false
>  >   > Generator: topN: 2147483647
>  >   > Generator: jobtracker is 'local', generating exactly one partition.
>  >   > Generator: Partitioning selected urls by host, for politeness.
>  >   > Generator: done.
>  >   > Fetcher: starting
>  >   > Fetcher: segment: crawl/segments/20070415222440
>  >   > Fetcher: threads: 10
>  >   > fetching http://www.yahoo.com/
>  >   > Fetcher: done
>  >   > CrawlDb update: starting
>  >   > CrawlDb update: db: crawl/crawldb
>  >   > CrawlDb update: segments: [crawl/segments/20070415222440]
>  >   > CrawlDb update: additions allowed: true
>  >   > CrawlDb update: URL normalizing: true
>  >   > CrawlDb update: URL filtering: true
>  >   > CrawlDb update: Merging segment data into db.
>  >   > CrawlDb update: done
>  >   > Generator: Selecting best-scoring urls due for fetch.
>  >   > Generator: starting
>  >   > Generator: segment: crawl/segments/20070415222449
>  >   > Generator: filtering: false
>  >   > Generator: topN: 2147483647
>  >   > Generator: jobtracker is 'local', generating exactly one partition.
>  >   > Generator: 0 records selected for fetching, exiting ...
>  >   > Stopping at depth=1 - no more URLs to fetch.
>  >   > LinkDb: starting
>  >   > LinkDb: linkdb: crawl/linkdb
>  >   > LinkDb: URL normalize: true
>  >   > LinkDb: URL filter: true
>  >   > LinkDb: adding segment: crawl/segments/20070415222440
>  >   > LinkDb: done
>  >   > Indexer: starting
>  >   > Indexer: linkdb: crawl/linkdb
>  >   > Indexer: adding segment: crawl/segments/20070415222440
>  >   >  Indexing [http://www.yahoo.com/] with analyzer
>  >   > [EMAIL PROTECTED] (null)
>  >   > Optimizing index.
>  >   > merging segments _ram_0 (1 docs) into _0 (1 docs)
>  >   > Indexer: done
>  >   > Dedup: starting
>  >   > Dedup: adding indexes in: crawl/indexes
>  >   > Dedup: done
>  >   > merging indexes to: crawl/index
>  >   > Adding crawl/indexes/part-00000
>  >   > done merging
>  >   > crawl finished: crawl
>  >   >
>  >
>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Crawl www.yahoo.com with nutch

Reply via email to