Re: [nodejs] Web scraping and Memory leaking issue

rhasson Mon, 02 Jul 2012 21:12:42 -0700

Have you looked at Cheerio (https://github.com/MatthewMueller/cheerio) ? 
 I've been using it over JSDom and it's faster and lightweight.  If you're 
doing heaving scraping I would recommend checking it out.


On Monday, July 2, 2012 10:30:25 AM UTC-4, ec.developer wrote:
>
> Charles, Thanks for your suggestion. About global links_grabbed - I am 
> sure there could be a better solution, but in my case it is not so 
> significant. 
> I have tried, just for testing, to store 200 thousands of large links in an 
> array, and then I've outputed the used memory, and it's amount is very 
> small. So I didn't focused on this :)*
> *
> *
> *
>
> On Monday, July 2, 2012 4:44:24 PM UTC+3, Charles Care wrote:
>>
>> Hi, 
>>
>> I had a play with your code and found a couple of things. It's 
>> probably worth trying to avoid the global variable links_grabbed as 
>> it's just getting larger and larger as you crawl. I know you need it 
>> to avoid parsing the same site twice, but perhaps you could find a 
>> more lightweight data structure? I'd probably be tempted to keep this 
>> state in a Redis set (or something similar). 
>>
>> Also, I'm not an expert on scraper, but I *seemed* to get a 
>> performance improvement when I modified the code that pushed new urls 
>> onto your links stack. 
>>
>> I added a String() conversion: e.g. 
>>
>> ... 
>> links.push(String(link)); 
>> ... 
>>
>> which meant I wasn't keeping the jquery link around in a stack. Hope that 
>> helps, 
>>
>> Charles 
>>
>>
>>
>> On 2 July 2012 14:08, ec.developer <ec.develo...@gmail.com> wrote: 
>> > Hi all, 
>> > I've created a small app, which searches for Not Found [404] exceptions 
>> on a 
>> > specified website. I use the node-scraper module 
>> > (https://github.com/mape/node-scraper/), which uses native node's 
>> request 
>> > module and jsdom for parsing the html). 
>> > My app recursively searches for links on the each webpage, and then 
>> calls 
>> > the Scraping stuff for each found link. The problem is that after 
>> scanning 
>> > 100 pages (and collecting over 200 links to be scanned) the RSS memory 
>> usage 
>> > is >200MB (and it still increases on each iteration). So after scanning 
>> over 
>> > 300-400 pages, I got memory allocation error. 
>> > The code is provided below. 
>> > Any hints? 
>> > 
>> > var scraper = require('scraper'), 
>> > util = require('util'); 
>> > 
>> > var checkDomain = process.argv[2].replace("https://";, 
>> "").replace("http://";, 
>> > ""), 
>> > links = [process.argv[2]], 
>> > links_grabbed = []; 
>> > 
>> > var link_check = links.pop(); 
>> > links_grabbed.push(link_check); 
>> > scraper(link_check, parseData); 
>> > 
>> > function parseData(err, jQuery, url) 
>> > { 
>> > var ramUsage = bytesToSize(process.memoryUsage().rss); 
>> > process.stdout.write("\rLinks checked: " + 
>> > (Object.keys(links_grabbed).length) + "/" + links.length + " ["+ 
>> ramUsage 
>> > +"] "); 
>> > 
>> > if( err ) { 
>> > console.log("%s [%s], source - %s", err.uri, err.http_status, 
>> > links_grabbed[err.uri].src); 
>> > } 
>> > else { 
>> > jQuery('a').each(function() { 
>> > var link = jQuery(this).attr("href").trim(); 
>> > 
>> > if( link.indexOf("/")==0 ) 
>> > link = "http://"; + checkDomain + link; 
>> > 
>> > if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-1 && ["#", 
>> > ""].indexOf(link)==-1 && (link.indexOf("http://"; + checkDomain)==0 || 
>> > link.indexOf("https://"+checkDomain)==0) ) 
>> > links.push(link); 
>> > }); 
>> > } 
>> > 
>> > if( links.length>0 ) { 
>> > var link_check = links.pop(); 
>> > links_grabbed.push(link_check); 
>> > scraper(link_check, parseData); 
>> > } 
>> > else { 
>> > util.log("Scraping is done. Bye bye =)"); 
>> > process.exit(0); 
>> > } 
>> > } 
>> > 
>> > -- 
>> > Job Board: http://jobs.nodejs.org/ 
>> > Posting guidelines: 
>> > https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines 
>> > You received this message because you are subscribed to the Google 
>> > Groups "nodejs" group. 
>> > To post to this group, send email to nodejs@googlegroups.com 
>> > To unsubscribe from this group, send email to 
>> > nodejs+unsubscr...@googlegroups.com 
>> > For more options, visit this group at 
>> > http://groups.google.com/group/nodejs?hl=en?hl=en 
>>
>

-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs@googlegroups.com
To unsubscribe from this group, send email to
nodejs+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Re: [nodejs] Web scraping and Memory leaking issue

Reply via email to