Interesting project! It's nice to see the bits on Scrapy in your paper - thanks! We're delighted it was so useful for the BlogForever crawler. It's great to see your crawler released as open source too.
I thought Scrapely could have been a nice comparison.. your approach takes better advantage of the fact that you have many examples (from the feeds) where Scrapely is designed for working with very little example data so I expect your approach would compare favorably. I see you favor using id and class attributes - something we are considering for Scrapely too as it currently relies exclusively on HTML structure. Do you plan to release the testing / evaluation part? Should we put a link to BlogForever on the companies page<http://scrapy.org/companies/> ? Good luck with the conference submission! Shane On 1 February 2014 16:52, <[email protected]> wrote: > Hello everyone, > > I've very happy to announce the release of the BlogForever crawler! Our > work > is entirely based on Scrapy, and we wanted to thank you for the amazing > work > you did on this framework, without which we could not have accomplished a > fraction of what we did during the last 6 months. > > The crawler targets web blogs, and is able to automatically extract blog > post > articles, title, authors and comments. It's open source and comes with > tests > and documentation: <https://github.com/BlogForever/crawler>. We've also > written and submitted a paper for the WIMS14 conference where we present > our > algorithm for content extraction and a high level overview of the crawler > architecture, available at > < > https://github.com/OlivierBlanvillain/blogforever-crawler-publication/blob/master/tex/main.pdf > >. > > I believe that the following parts of our project might be useful for other > application: > > - The content extractor interface is similar to the one of Scrapely > <https://github.com/scrapy/scrapely> (Sadly we discovered Scrapely too > late > to include it in our evaluation). It's very fast and robust: we got to > 93% > success rate on blog articles extraction over 2300 blog posts. > > - To crawl blogs mixed up with other resources (such as wiki or a forum), > we > use a simple machine-learning based priority predictor to favor crawling > URLs with links to blog posts. This allows to get the best out of a > limited > number of page download, which might otherwise get stuck into unrelevant > portions of the blog. > > - We use PhantomJS do JavaScript rendering, take screenshots and fake some > user interaction to deal with Disqus comments. We have a pool of reusable > browser which allows to take full advantage of processors (with > JavaScript > rendering on, the crawling bottleneck is CPU). > > If you take the time to read the paper (8 pages) or the code, don't > hesitate > to send comments or feedback! > > Regards, > Olivier Blanvillain > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/groups/opt_out. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
