In my case phantom.js had issues like: - sometimes site uses broken HTML and jQuery gives different result in phantom than in the Chrome - there was cases when I can't trigger 'click' event by phantom, when the site uses some strange ways to register onclick function. - pahntom.js is incompatible with node.js, there are some non-standard bindings, I tried 3 such bindings but for me all of them worked very unstable, so I just given up at the end.
import.io - an interesting idea, seems like usefull service. But, sadly in our case it was a little more complicated, there are lots of complex interactions (like click here wait till something appears there if it appears next go here if it not appears go there etc.). I doubth you can program such behavior using GUI or some sort of DSL. Also, we use it heavily and it consumes huge amount of resources (99% consumes Selenium + Browser Emulators), it costly even if you pay only for the physical servers. If on the other hands you use services provided by other company and pay twice - for the servers and for their service - it would cost us even more. In our case it was cheaper to spend one month in developing such service by ourselves. On Saturday, 26 April 2014 17:29:56 UTC+4, Duy Nguyen wrote: > > I did a scraper with phantomjs before, it works great but I think you > should take a look at https://import.io/ > > > > > On Sat, Apr 26, 2014 at 7:42 AM, Alexey Petrushin > <alexey.p...@gmail.com<javascript:> > > wrote: > >> I finished such project recently - Crawler for JavaScript Sites, with >> Browser Emulator (Selenium). >> >> It's a private project, but I wrote some details about it and how it >> works, maybe it will be interested for someone. >> >> http://alex-craft.com/blog/2014/crawling-javascript-sites >> >> On Thursday, 16 January 2014 06:09:48 UTC+4, Victor Hooi wrote: >> >>> Hi, >>> >>> I'm wondering if anybody knows of any web-scraping frameworks in Node.JS? >>> >>> Previously, there was node.io (https://github.com/chriso/node.io), >>> however, the project was recently discontinued. >>> >>> Googling for Node.JS and web scraping, most of the guides online just >>> talk about using requests and cheerio - it works, but you need to handle a >>> whole bunch of things yourself (throttling, distributing jobs, >>> configuration, managing jobs etc.). >>> >>> On the Python side, I know of Scrapy (https://github.com/scrapy/scrapy), >>> which is using Twisted for asynchronicity >>> >>> On the Ruby side, Nokogiri (http://nokogiri.org/) is meant to be good, >>> although I haven't dived into it much. >>> >>> Is there anything equivalent in the Node world? Or what approaches are >>> people using to tackle this problem? >>> >>> Cheers, >>> Victor >>> >> -- >> -- >> Job Board: http://jobs.nodejs.org/ >> Posting guidelines: >> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines >> You received this message because you are subscribed to the Google >> Groups "nodejs" group. >> To post to this group, send email to nod...@googlegroups.com<javascript:> >> To unsubscribe from this group, send email to >> nodejs+un...@googlegroups.com <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/nodejs?hl=en?hl=en >> >> --- >> You received this message because you are subscribed to the Google Groups >> "nodejs" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to nodejs+un...@googlegroups.com <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Nguyen Hai Duy > Mobile : 0914 72 1900 > Skype: nguyenhd2107 > Yahoo: nguyenhd_lucky > -- -- Job Board: http://jobs.nodejs.org/ Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines You received this message because you are subscribed to the Google Groups "nodejs" group. To post to this group, send email to nodejs@googlegroups.com To unsubscribe from this group, send email to nodejs+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/nodejs?hl=en?hl=en --- You received this message because you are subscribed to the Google Groups "nodejs" group. To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.