Re: Re: What urls does Nutch crawl?

Lewis John Mcgibbney Tue, 02 Apr 2013 13:51:00 -0700

Hi Alvaro,
Are you please able to open a Jira ticket and sutmit a patch for this?
I would be very willing to work help integrate this into the development
branch.
Thank you for dropping in on this one.
Lewis



On Tue, Apr 2, 2013 at 1:19 PM, Alvaro Cabrerizo <[email protected]> wrote:

> Hi:
>
> What i've detected  using nutch2 (bin/nutch crawl ...) is that on every
> cycle (generate, fetch, update) the system fetches all the urls stored in
> the database (accumulo, mysql or whatever you use). For example, if my
> seeds.txt contains http://localhost that points to a local apache with a
> welcome page (no outlinks) and i run:
>
>  ./bin/nutch crawl  conf/urls -depth 10
>
> On every step, nutch will fetch http://localhost. It means ten times. When
> wild crawling, every step will fetch all the links fetched before plus all
> the new links it should fetch during the current round. Looking the
>  fetcher code (FetcherJob.java) the method "run" gets the batchid
> identifier from the arguments, but the generator stores this number on the
> configuration. As the fetcher cant get the generator id, it fetches "all".
> A simple solution could be (it works for me :) ) check wether the batchid
> argument is null, and if it is the case get the value from the
> configuration:
>
>
>  public Map<String,Object> run(Map<String,Object> args) throws Exception {
>    ...
>     String batchId = (String)args.get(Nutch.ARG_BATCH);
>     ...
>     if (batchId == null) { //the argument value is null
>       batchId = getConf().get(GeneratorJob.BATCH_ID,
> Nutch.ALL_BATCH_ID_STR); //get the value stored in the configuration or
> fetch all if that is null.
>     }
>
>
>
> Regards.
>
>
>
>
>
>
> On Thu, Jan 17, 2013 at 11:47 AM, 高睿 <[email protected]> wrote:
>
> > Yes, that is my case.
> > Remove all previous data is an option, but the data will be lost.
> > I want to write a plugin to empty 'outlinks' for the article page. So,
> the
> > crawling will be terminated at the article urls, therefore no additional
> > links will be stored in DB.
> >
> >
> >
> > At 2013-01-16 04:24:11,"Sebastian Nagel" <[email protected]>
> > wrote: >Hi,no > >did I understood you correctly? >- feed.txt is placed in
> > the seed url folder and >- contains URLs of the 50 article lists >If
> yes: >
> > -depth 2 >will crawl these 50 URLs and for each article list all its 30
> > outlinks, >in total 50 + 50*30 = 1550 documents. > >If you continue
> > crawling Nutch fetch the outlinks of the 1500 docs fetched >in the second
> > cycle, and then the links found again, and so on: it will >continue to
> > crawl the whole web. To limit the crawl to exactly the 1550 docs >either
> > remove all previously crawled data to start again from scratch >or have a
> > look at the plugin "scoring-depth" (it's new and, >unfortunately, not yet
> > adapted to 2.x, see https://issues.apache.org/jira/browse/NUTCH-1331>and
> > https://issues.apache.org/jira/browse/NUTCH-1508). > >The option name
> > -depth does not mean a "limitation of a certain linkage depth" (that's
> >the
> > meaning in "scoring-depth") but the number of crawl cycles or rounds.
> >If a
> > crawl is started from scratch the results are identical in most cases. >
> > >Sebastian > >On 01/15/2013 06:53 PM, 高睿 wrote: >> I'm not quite sure
> about
> > your question here. I'm using the Nutch2.1 default configuration, and run
> > command: bin/nutch crawl urls -solr
> http://localhost:8080/solr/collection2-threads 10 -depth 2 -topN 1000 >>
> The 'urls' folder includes the blog
> > index pages (each index page includes a list of article pages). >> I
> think
> > the plugin 'parse-html' and 'parse-tika' are currently responsible for
> > parse the links from the html. Should I clean the outlinks in an
> additional
> > Parse plugin in order to prevent nutch from crawling the outlinks in the
> > article page? >>  >>  >>  >> At 2013-01-15 13:31:11,"Lewis John
> Mcgibbney" <
> > [email protected]> wrote: >>> I take it you are updating the
> > database with the crawl data? This will mark >>> all links extracted
> during
> > parse phase (depending upon your config) as due >>> for fetching. When
> you
> > generate these links will be populated within the >>> batchId's and Nutch
> > will attempt to fetch them. >>> Please also search out list archives for
> > the definition of the depth >>> parameter. >>> Lewis >>> >>> On Monday,
> > January 14, 2013, 高睿 <[email protected]> wrote: >>>> Hi, >>>> >>>> I'm
> > customizing nutch 2.1 for crawling blogs from several authors. Each >>>
> > author's blog has list page and article pages. >>>> >>>> Say, I want to
> > crawl articles in 50 article lists (each have 30 >>> articles). I add the
> > article list links in the feed.txt, and specify >>> '-depth 2' and '-topN
> > 2000'. My expectation is each time I run nutch, it >>> will crawl all the
> > list pages and the articles in each list. But, actually, >>> it seems the
> > urls that nutch crawled becomes more and more, and takes more >>> and
> more
> > time (3 hours -> more than 24 hours). >>>> >>>> Could someone explain me
> > what happens? Does nutch 2.1 always start >>> crawling from the seed
> folder
> > and follow the 'depth' parameter? What should >>> I do to meet my
> > requirement? >>>> Thanks. >>>> >>>> Regards, >>>> Rui >>>> >>> >>> --
>  >>>
> > *Lewis*
>



-- 
*Lewis*

Re: Re: What urls does Nutch crawl?

Reply via email to