Re: [nodejs] Web scraping and Memory leaking issue

Dan Milon Tue, 03 Jul 2012 05:13:31 -0700

It is hard to read your code inside the email (you can use gist etc) youpop your links array all the time and check page and push page's linksto links arrayBut in this way the pages that are under investigation at a specifictime are increasing exponentially, thus the high memory footprint. Soyou need to throttle this.

What i would do is use async.queue [1] with a task function that callsjsdom/cheerio and after the links are pushed to the array it calls itscallback.You should also bind q.drain, q.saturated so you know when to stopsending new tasks to the queue and when to start again.

This way you have a constantly low memory footprint and the sameperformance.


Hope i helped,
Dan Milon.

[1] https://github.com/caolan/async/#queue

On 07/03/2012 01:42 PM, ec.developer wrote:

Thanks for cheerio =)) Have removed the jsdom with cheerio. Now after6000 pages are checked - only ~200MB of RSS memory is used. Itcontinue growing, but not so fast as it was earlier.


On Tuesday, July 3, 2012 7:13:54 AM UTC+3, node-code wrote:

    +1 for Cheerio.

    On Tue, Jul 3, 2012 at 9:42 AM, rhasson <rhas...@gmail.com
    <mailto:rhas...@gmail.com>> wrote:

        Have you looked at Cheerio
        (https://github.com/MatthewMueller/cheerio
        <https://github.com/MatthewMueller/cheerio>) ?  I've been
        using it over JSDom and it's faster and lightweight.  If
        you're doing heaving scraping I would recommend checking it out.


        On Monday, July 2, 2012 10:30:25 AM UTC-4, ec.developer wrote:

            Charles, Thanks for your suggestion. About global
            links_grabbed - I am sure there could be a better
            solution, but in my case it is not so significant. I have
            tried, just for testing, to store 200 thousands of large
            links in an array, and then I've outputed the used memory,
            and it's amount is very small. So I didn't focused on this :)*
            *
            *
            *

            On Monday, July 2, 2012 4:44:24 PM UTC+3, Charles Care wrote:

                Hi,

                I had a play with your code and found a couple of
                things. It's
                probably worth trying to avoid the global variable
                links_grabbed as
                it's just getting larger and larger as you crawl. I
                know you need it
                to avoid parsing the same site twice, but perhaps you
                could find a
                more lightweight data structure? I'd probably be
                tempted to keep this
                state in a Redis set (or something similar).

                Also, I'm not an expert on scraper, but I *seemed* to
                get a
                performance improvement when I modified the code that
                pushed new urls
                onto your links stack.

                I added a String() conversion: e.g.

                ...
                links.push(String(link));
                ...

                which meant I wasn't keeping the jquery link around in
                a stack. Hope that helps,

                Charles



                On 2 July 2012 14:08, ec.developer
                <ec.develo...@gmail.com
                <mailto:ec.develo...@gmail.com>> wrote:
                > Hi all,
                > I've created a small app, which searches for Not
                Found [404] exceptions on a
                > specified website. I use the node-scraper module
                > (https://github.com/mape/node-scraper/
                <https://github.com/mape/node-scraper/>), which uses
                native node's request
                > module and jsdom for parsing the html).
                > My app recursively searches for links on the each
                webpage, and then calls
                > the Scraping stuff for each found link. The problem
                is that after scanning
                > 100 pages (and collecting over 200 links to be
                scanned) the RSS memory usage
                > is >200MB (and it still increases on each
                iteration). So after scanning over
                > 300-400 pages, I got memory allocation error.
                > The code is provided below.
                > Any hints?
                >
                > var scraper = require('scraper'),
                > util = require('util');
                >
                > var checkDomain =
                process.argv[2].replace("https://";,
                "").replace("http://";,
                > ""),
                > links = [process.argv[2]],
                > links_grabbed = [];
                >
                > var link_check = links.pop();
                > links_grabbed.push(link_check);
                > scraper(link_check, parseData);
                >
                > function parseData(err, jQuery, url)
                > {
                > var ramUsage = bytesToSize(process.memoryUsage().rss);
                > process.stdout.write("\rLinks checked: " +
                > (Object.keys(links_grabbed).length) + "/" +
                links.length + " ["+ ramUsage
                > +"] ");
                >
                > if( err ) {
                > console.log("%s [%s], source - %s", err.uri,
                err.http_status,
                > links_grabbed[err.uri].src);
                > }
                > else {
                > jQuery('a').each(function() {
                > var link = jQuery(this).attr("href").trim();
                >
                > if( link.indexOf("/")==0 )
                > link = "http://"; + checkDomain + link;
                >
                > if( links.indexOf(link)==-1 &&
                links_grabbed.indexOf(link)==-1 && ["#",
                > ""].indexOf(link)==-1 && (link.indexOf("http://"; +
                checkDomain)==0 ||
                > link.indexOf("https://"+checkDomain)==0) )
                > links.push(link);
                > });
                > }
                >
                > if( links.length>0 ) {
                > var link_check = links.pop();
                > links_grabbed.push(link_check);
                > scraper(link_check, parseData);
                > }
                > else {
                > util.log("Scraping is done. Bye bye =)");
                > process.exit(0);
                > }
                > }
                >
                > --
                > Job Board: http://jobs.nodejs.org/
                > Posting guidelines:
                >
                
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
                
<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>

                > You received this message because you are subscribed
                to the Google
                > Groups "nodejs" group.
                > To post to this group, send email to
                nodejs@googlegroups.com <mailto:nodejs@googlegroups.com>
                > To unsubscribe from this group, send email to
                > nodejs+unsubscr...@googlegroups.com
                <mailto:nodejs%2bunsubscr...@googlegroups.com>
                > For more options, visit this group at
                > http://groups.google.com/group/nodejs?hl=en?hl=en
                <http://groups.google.com/group/nodejs?hl=en?hl=en>

--Job Board: http://jobs.nodejs.org/

        Posting guidelines:
        https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
        <https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>
        You received this message because you are subscribed to the Google
        Groups "nodejs" group.
        To post to this group, send email to nodejs@googlegroups.com
        <mailto:nodejs@googlegroups.com>
        To unsubscribe from this group, send email to
        nodejs+unsubscr...@googlegroups.com
        <mailto:nodejs%2bunsubscr...@googlegroups.com>
        For more options, visit this group at
        http://groups.google.com/group/nodejs?hl=en?hl=en
        <http://groups.google.com/group/nodejs?hl=en?hl=en>


--
Job Board: http://jobs.nodejs.org/

Posting guidelines:https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines

You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs@googlegroups.com
To unsubscribe from this group, send email to
nodejs+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en



--
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs@googlegroups.com
To unsubscribe from this group, send email to
nodejs+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Re: [nodejs] Web scraping and Memory leaking issue

Reply via email to