Have you looked at Cheerio (https://github.com/MatthewMueller/cheerio) ? I've been using it over JSDom and it's faster and lightweight. If you're doing heaving scraping I would recommend checking it out.
On Monday, July 2, 2012 10:30:25 AM UTC-4, ec.developer wrote: > > Charles, Thanks for your suggestion. About global links_grabbed - I am > sure there could be a better solution, but in my case it is not so > significant. > I have tried, just for testing, to store 200 thousands of large links in an > array, and then I've outputed the used memory, and it's amount is very > small. So I didn't focused on this :)* > * > * > * > > On Monday, July 2, 2012 4:44:24 PM UTC+3, Charles Care wrote: >> >> Hi, >> >> I had a play with your code and found a couple of things. It's >> probably worth trying to avoid the global variable links_grabbed as >> it's just getting larger and larger as you crawl. I know you need it >> to avoid parsing the same site twice, but perhaps you could find a >> more lightweight data structure? I'd probably be tempted to keep this >> state in a Redis set (or something similar). >> >> Also, I'm not an expert on scraper, but I *seemed* to get a >> performance improvement when I modified the code that pushed new urls >> onto your links stack. >> >> I added a String() conversion: e.g. >> >> ... >> links.push(String(link)); >> ... >> >> which meant I wasn't keeping the jquery link around in a stack. Hope that >> helps, >> >> Charles >> >> >> >> On 2 July 2012 14:08, ec.developer <[email protected]> wrote: >> > Hi all, >> > I've created a small app, which searches for Not Found [404] exceptions >> on a >> > specified website. I use the node-scraper module >> > (https://github.com/mape/node-scraper/), which uses native node's >> request >> > module and jsdom for parsing the html). >> > My app recursively searches for links on the each webpage, and then >> calls >> > the Scraping stuff for each found link. The problem is that after >> scanning >> > 100 pages (and collecting over 200 links to be scanned) the RSS memory >> usage >> > is >200MB (and it still increases on each iteration). So after scanning >> over >> > 300-400 pages, I got memory allocation error. >> > The code is provided below. >> > Any hints? >> > >> > var scraper = require('scraper'), >> > util = require('util'); >> > >> > var checkDomain = process.argv[2].replace("https://", >> "").replace("http://", >> > ""), >> > links = [process.argv[2]], >> > links_grabbed = []; >> > >> > var link_check = links.pop(); >> > links_grabbed.push(link_check); >> > scraper(link_check, parseData); >> > >> > function parseData(err, jQuery, url) >> > { >> > var ramUsage = bytesToSize(process.memoryUsage().rss); >> > process.stdout.write("\rLinks checked: " + >> > (Object.keys(links_grabbed).length) + "/" + links.length + " ["+ >> ramUsage >> > +"] "); >> > >> > if( err ) { >> > console.log("%s [%s], source - %s", err.uri, err.http_status, >> > links_grabbed[err.uri].src); >> > } >> > else { >> > jQuery('a').each(function() { >> > var link = jQuery(this).attr("href").trim(); >> > >> > if( link.indexOf("/")==0 ) >> > link = "http://" + checkDomain + link; >> > >> > if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-1 && ["#", >> > ""].indexOf(link)==-1 && (link.indexOf("http://" + checkDomain)==0 || >> > link.indexOf("https://"+checkDomain)==0) ) >> > links.push(link); >> > }); >> > } >> > >> > if( links.length>0 ) { >> > var link_check = links.pop(); >> > links_grabbed.push(link_check); >> > scraper(link_check, parseData); >> > } >> > else { >> > util.log("Scraping is done. Bye bye =)"); >> > process.exit(0); >> > } >> > } >> > >> > -- >> > Job Board: http://jobs.nodejs.org/ >> > Posting guidelines: >> > https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines >> > You received this message because you are subscribed to the Google >> > Groups "nodejs" group. >> > To post to this group, send email to [email protected] >> > To unsubscribe from this group, send email to >> > [email protected] >> > For more options, visit this group at >> > http://groups.google.com/group/nodejs?hl=en?hl=en >> > -- Job Board: http://jobs.nodejs.org/ Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines You received this message because you are subscribed to the Google Groups "nodejs" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/nodejs?hl=en?hl=en
