It is hard to read your code inside the email (you can use gist etc) you
pop your links array all the time and check page and push page's links
to links array
But in this way the pages that are under investigation at a specific
time are increasing exponentially, thus the high memory footprint. So
you need to throttle this.
What i would do is use async.queue [1] with a task function that calls
jsdom/cheerio and after the links are pushed to the array it calls its
callback.
You should also bind q.drain, q.saturated so you know when to stop
sending new tasks to the queue and when to start again.
This way you have a constantly low memory footprint and the same
performance.
Hope i helped,
Dan Milon.
[1] https://github.com/caolan/async/#queue
On 07/03/2012 01:42 PM, ec.developer wrote:
Thanks for cheerio =)) Have removed the jsdom with cheerio. Now after
6000 pages are checked - only ~200MB of RSS memory is used. It
continue growing, but not so fast as it was earlier.
On Tuesday, July 3, 2012 7:13:54 AM UTC+3, node-code wrote:
+1 for Cheerio.
On Tue, Jul 3, 2012 at 9:42 AM, rhasson <rhas...@gmail.com
<mailto:rhas...@gmail.com>> wrote:
Have you looked at Cheerio
(https://github.com/MatthewMueller/cheerio
<https://github.com/MatthewMueller/cheerio>) ? I've been
using it over JSDom and it's faster and lightweight. If
you're doing heaving scraping I would recommend checking it out.
On Monday, July 2, 2012 10:30:25 AM UTC-4, ec.developer wrote:
Charles, Thanks for your suggestion. About global
links_grabbed - I am sure there could be a better
solution, but in my case it is not so significant. I have
tried, just for testing, to store 200 thousands of large
links in an array, and then I've outputed the used memory,
and it's amount is very small. So I didn't focused on this :)*
*
*
*
On Monday, July 2, 2012 4:44:24 PM UTC+3, Charles Care wrote:
Hi,
I had a play with your code and found a couple of
things. It's
probably worth trying to avoid the global variable
links_grabbed as
it's just getting larger and larger as you crawl. I
know you need it
to avoid parsing the same site twice, but perhaps you
could find a
more lightweight data structure? I'd probably be
tempted to keep this
state in a Redis set (or something similar).
Also, I'm not an expert on scraper, but I *seemed* to
get a
performance improvement when I modified the code that
pushed new urls
onto your links stack.
I added a String() conversion: e.g.
...
links.push(String(link));
...
which meant I wasn't keeping the jquery link around in
a stack. Hope that helps,
Charles
On 2 July 2012 14:08, ec.developer
<ec.develo...@gmail.com
<mailto:ec.develo...@gmail.com>> wrote:
> Hi all,
> I've created a small app, which searches for Not
Found [404] exceptions on a
> specified website. I use the node-scraper module
> (https://github.com/mape/node-scraper/
<https://github.com/mape/node-scraper/>), which uses
native node's request
> module and jsdom for parsing the html).
> My app recursively searches for links on the each
webpage, and then calls
> the Scraping stuff for each found link. The problem
is that after scanning
> 100 pages (and collecting over 200 links to be
scanned) the RSS memory usage
> is >200MB (and it still increases on each
iteration). So after scanning over
> 300-400 pages, I got memory allocation error.
> The code is provided below.
> Any hints?
>
> var scraper = require('scraper'),
> util = require('util');
>
> var checkDomain =
process.argv[2].replace("https://",
"").replace("http://",
> ""),
> links = [process.argv[2]],
> links_grabbed = [];
>
> var link_check = links.pop();
> links_grabbed.push(link_check);
> scraper(link_check, parseData);
>
> function parseData(err, jQuery, url)
> {
> var ramUsage = bytesToSize(process.memoryUsage().rss);
> process.stdout.write("\rLinks checked: " +
> (Object.keys(links_grabbed).length) + "/" +
links.length + " ["+ ramUsage
> +"] ");
>
> if( err ) {
> console.log("%s [%s], source - %s", err.uri,
err.http_status,
> links_grabbed[err.uri].src);
> }
> else {
> jQuery('a').each(function() {
> var link = jQuery(this).attr("href").trim();
>
> if( link.indexOf("/")==0 )
> link = "http://" + checkDomain + link;
>
> if( links.indexOf(link)==-1 &&
links_grabbed.indexOf(link)==-1 && ["#",
> ""].indexOf(link)==-1 && (link.indexOf("http://" +
checkDomain)==0 ||
> link.indexOf("https://"+checkDomain)==0) )
> links.push(link);
> });
> }
>
> if( links.length>0 ) {
> var link_check = links.pop();
> links_grabbed.push(link_check);
> scraper(link_check, parseData);
> }
> else {
> util.log("Scraping is done. Bye bye =)");
> process.exit(0);
> }
> }
>
> --
> Job Board: http://jobs.nodejs.org/
> Posting guidelines:
>
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>
> You received this message because you are subscribed
to the Google
> Groups "nodejs" group.
> To post to this group, send email to
nodejs@googlegroups.com <mailto:nodejs@googlegroups.com>
> To unsubscribe from this group, send email to
> nodejs+unsubscr...@googlegroups.com
<mailto:nodejs%2bunsubscr...@googlegroups.com>
> For more options, visit this group at
> http://groups.google.com/group/nodejs?hl=en?hl=en
<http://groups.google.com/group/nodejs?hl=en?hl=en>
--
Job Board: http://jobs.nodejs.org/
Posting guidelines:
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs@googlegroups.com
<mailto:nodejs@googlegroups.com>
To unsubscribe from this group, send email to
nodejs+unsubscr...@googlegroups.com
<mailto:nodejs%2bunsubscr...@googlegroups.com>
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
<http://groups.google.com/group/nodejs?hl=en?hl=en>
--
Job Board: http://jobs.nodejs.org/
Posting guidelines:
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs@googlegroups.com
To unsubscribe from this group, send email to
nodejs+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
--
Job Board: http://jobs.nodejs.org/
Posting guidelines:
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs@googlegroups.com
To unsubscribe from this group, send email to
nodejs+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en