Hi Alvaro, Are you please able to open a Jira ticket and sutmit a patch for this? I would be very willing to work help integrate this into the development branch. Thank you for dropping in on this one. Lewis
On Tue, Apr 2, 2013 at 1:19 PM, Alvaro Cabrerizo <[email protected]> wrote: > Hi: > > What i've detected using nutch2 (bin/nutch crawl ...) is that on every > cycle (generate, fetch, update) the system fetches all the urls stored in > the database (accumulo, mysql or whatever you use). For example, if my > seeds.txt contains http://localhost that points to a local apache with a > welcome page (no outlinks) and i run: > > ./bin/nutch crawl conf/urls -depth 10 > > On every step, nutch will fetch http://localhost. It means ten times. When > wild crawling, every step will fetch all the links fetched before plus all > the new links it should fetch during the current round. Looking the > fetcher code (FetcherJob.java) the method "run" gets the batchid > identifier from the arguments, but the generator stores this number on the > configuration. As the fetcher cant get the generator id, it fetches "all". > A simple solution could be (it works for me :) ) check wether the batchid > argument is null, and if it is the case get the value from the > configuration: > > > public Map<String,Object> run(Map<String,Object> args) throws Exception { > ... > String batchId = (String)args.get(Nutch.ARG_BATCH); > ... > if (batchId == null) { //the argument value is null > batchId = getConf().get(GeneratorJob.BATCH_ID, > Nutch.ALL_BATCH_ID_STR); //get the value stored in the configuration or > fetch all if that is null. > } > > > > Regards. > > > > > > > On Thu, Jan 17, 2013 at 11:47 AM, 高睿 <[email protected]> wrote: > > > Yes, that is my case. > > Remove all previous data is an option, but the data will be lost. > > I want to write a plugin to empty 'outlinks' for the article page. So, > the > > crawling will be terminated at the article urls, therefore no additional > > links will be stored in DB. > > > > > > > > At 2013-01-16 04:24:11,"Sebastian Nagel" <[email protected]> > > wrote: >Hi,no > >did I understood you correctly? >- feed.txt is placed in > > the seed url folder and >- contains URLs of the 50 article lists >If > yes: > > > -depth 2 >will crawl these 50 URLs and for each article list all its 30 > > outlinks, >in total 50 + 50*30 = 1550 documents. > >If you continue > > crawling Nutch fetch the outlinks of the 1500 docs fetched >in the second > > cycle, and then the links found again, and so on: it will >continue to > > crawl the whole web. To limit the crawl to exactly the 1550 docs >either > > remove all previously crawled data to start again from scratch >or have a > > look at the plugin "scoring-depth" (it's new and, >unfortunately, not yet > > adapted to 2.x, see https://issues.apache.org/jira/browse/NUTCH-1331>and > > https://issues.apache.org/jira/browse/NUTCH-1508). > >The option name > > -depth does not mean a "limitation of a certain linkage depth" (that's > >the > > meaning in "scoring-depth") but the number of crawl cycles or rounds. > >If a > > crawl is started from scratch the results are identical in most cases. > > > >Sebastian > >On 01/15/2013 06:53 PM, 高睿 wrote: >> I'm not quite sure > about > > your question here. I'm using the Nutch2.1 default configuration, and run > > command: bin/nutch crawl urls -solr > http://localhost:8080/solr/collection2-threads 10 -depth 2 -topN 1000 >> > The 'urls' folder includes the blog > > index pages (each index page includes a list of article pages). >> I > think > > the plugin 'parse-html' and 'parse-tika' are currently responsible for > > parse the links from the html. Should I clean the outlinks in an > additional > > Parse plugin in order to prevent nutch from crawling the outlinks in the > > article page? >> >> >> >> At 2013-01-15 13:31:11,"Lewis John > Mcgibbney" < > > [email protected]> wrote: >>> I take it you are updating the > > database with the crawl data? This will mark >>> all links extracted > during > > parse phase (depending upon your config) as due >>> for fetching. When > you > > generate these links will be populated within the >>> batchId's and Nutch > > will attempt to fetch them. >>> Please also search out list archives for > > the definition of the depth >>> parameter. >>> Lewis >>> >>> On Monday, > > January 14, 2013, 高睿 <[email protected]> wrote: >>>> Hi, >>>> >>>> I'm > > customizing nutch 2.1 for crawling blogs from several authors. Each >>> > > author's blog has list page and article pages. >>>> >>>> Say, I want to > > crawl articles in 50 article lists (each have 30 >>> articles). I add the > > article list links in the feed.txt, and specify >>> '-depth 2' and '-topN > > 2000'. My expectation is each time I run nutch, it >>> will crawl all the > > list pages and the articles in each list. But, actually, >>> it seems the > > urls that nutch crawled becomes more and more, and takes more >>> and > more > > time (3 hours -> more than 24 hours). >>>> >>>> Could someone explain me > > what happens? Does nutch 2.1 always start >>> crawling from the seed > folder > > and follow the 'depth' parameter? What should >>> I do to meet my > > requirement? >>>> Thanks. >>>> >>>> Regards, >>>> Rui >>>> >>> >>> -- > >>> > > *Lewis* > -- *Lewis*

