Re: Re: What urls does Nutch crawl?

Alvaro Cabrerizo Tue, 02 Apr 2013 13:20:18 -0700

Hi:

What i've detected  using nutch2 (bin/nutch crawl ...) is that on every
cycle (generate, fetch, update) the system fetches all the urls stored in
the database (accumulo, mysql or whatever you use). For example, if my
seeds.txt contains http://localhost that points to a local apache with a
welcome page (no outlinks) and i run:


 ./bin/nutch crawl  conf/urls -depth 10

On every step, nutch will fetch http://localhost. It means ten times. When
wild crawling, every step will fetch all the links fetched before plus all
the new links it should fetch during the current round. Looking the
 fetcher code (FetcherJob.java) the method "run" gets the batchid
identifier from the arguments, but the generator stores this number on the
configuration. As the fetcher cant get the generator id, it fetches "all".
A simple solution could be (it works for me :) ) check wether the batchid
argument is null, and if it is the case get the value from the
configuration:


 public Map<String,Object> run(Map<String,Object> args) throws Exception {
   ...
    String batchId = (String)args.get(Nutch.ARG_BATCH);
    ...
    if (batchId == null) { //the argument value is null
      batchId = getConf().get(GeneratorJob.BATCH_ID,
Nutch.ALL_BATCH_ID_STR); //get the value stored in the configuration or
fetch all if that is null.
    }



Regards.






On Thu, Jan 17, 2013 at 11:47 AM, 高睿 <[email protected]> wrote:

> Yes, that is my case.
> Remove all previous data is an option, but the data will be lost.
> I want to write a plugin to empty 'outlinks' for the article page. So, the
> crawling will be terminated at the article urls, therefore no additional
> links will be stored in DB.
>
>
>
> At 2013-01-16 04:24:11,"Sebastian Nagel" <[email protected]>
> wrote: >Hi,no > >did I understood you correctly? >- feed.txt is placed in
> the seed url folder and >- contains URLs of the 50 article lists >If yes: >
> -depth 2 >will crawl these 50 URLs and for each article list all its 30
> outlinks, >in total 50 + 50*30 = 1550 documents. > >If you continue
> crawling Nutch fetch the outlinks of the 1500 docs fetched >in the second
> cycle, and then the links found again, and so on: it will >continue to
> crawl the whole web. To limit the crawl to exactly the 1550 docs >either
> remove all previously crawled data to start again from scratch >or have a
> look at the plugin "scoring-depth" (it's new and, >unfortunately, not yet
> adapted to 2.x, see https://issues.apache.org/jira/browse/NUTCH-1331 >and
> https://issues.apache.org/jira/browse/NUTCH-1508). > >The option name
> -depth does not mean a "limitation of a certain linkage depth" (that's >the
> meaning in "scoring-depth") but the number of crawl cycles or rounds. >If a
> crawl is started from scratch the results are identical in most cases. >
> >Sebastian > >On 01/15/2013 06:53 PM, 高睿 wrote: >> I'm not quite sure about
> your question here. I'm using the Nutch2.1 default configuration, and run
> command: bin/nutch crawl urls -solr 
> http://localhost:8080/solr/collection2-threads 10 -depth 2 -topN 1000 >> The 
> 'urls' folder includes the blog
> index pages (each index page includes a list of article pages). >> I think
> the plugin 'parse-html' and 'parse-tika' are currently responsible for
> parse the links from the html. Should I clean the outlinks in an additional
> Parse plugin in order to prevent nutch from crawling the outlinks in the
> article page? >>  >>  >>  >> At 2013-01-15 13:31:11,"Lewis John Mcgibbney" <
> [email protected]> wrote: >>> I take it you are updating the
> database with the crawl data? This will mark >>> all links extracted during
> parse phase (depending upon your config) as due >>> for fetching. When you
> generate these links will be populated within the >>> batchId's and Nutch
> will attempt to fetch them. >>> Please also search out list archives for
> the definition of the depth >>> parameter. >>> Lewis >>> >>> On Monday,
> January 14, 2013, 高睿 <[email protected]> wrote: >>>> Hi, >>>> >>>> I'm
> customizing nutch 2.1 for crawling blogs from several authors. Each >>>
> author's blog has list page and article pages. >>>> >>>> Say, I want to
> crawl articles in 50 article lists (each have 30 >>> articles). I add the
> article list links in the feed.txt, and specify >>> '-depth 2' and '-topN
> 2000'. My expectation is each time I run nutch, it >>> will crawl all the
> list pages and the articles in each list. But, actually, >>> it seems the
> urls that nutch crawled becomes more and more, and takes more >>> and more
> time (3 hours -> more than 24 hours). >>>> >>>> Could someone explain me
> what happens? Does nutch 2.1 always start >>> crawling from the seed folder
> and follow the 'depth' parameter? What should >>> I do to meet my
> requirement? >>>> Thanks. >>>> >>>> Regards, >>>> Rui >>>> >>> >>> --  >>>
> *Lewis*

Re: Re: What urls does Nutch crawl?

Reply via email to