Re: Announcing the BlogForever crawler

Shane Evans Mon, 03 Feb 2014 14:52:23 -0800

Interesting project!

It's nice to see the bits on Scrapy in your paper - thanks! We're delighted
it was so useful for the BlogForever crawler. It's great to see
your crawler released as open source too.


I thought Scrapely could have been a nice comparison.. your approach takes
better advantage of the fact that you have many examples (from the feeds)
where Scrapely is designed for working with very little example data so I
expect your approach would compare favorably. I see you favor using id and
class attributes - something we are considering for Scrapely too as it
currently relies exclusively on HTML structure.  Do you plan to release the
testing / evaluation part?

Should we put a link to BlogForever on the companies
page<http://scrapy.org/companies/>
?

Good luck with the conference submission!

Shane



On 1 February 2014 16:52, <[email protected]> wrote:

> Hello everyone,
>
> I've very happy to announce the release of the BlogForever crawler! Our
> work
> is entirely based on Scrapy, and we wanted to thank you for the amazing
> work
> you did on this framework, without which we could not have accomplished a
> fraction of what we did during the last 6 months.
>
> The crawler targets web blogs, and is able to automatically extract blog
> post
> articles, title, authors and comments. It's open source and comes with
> tests
> and documentation: <https://github.com/BlogForever/crawler>. We've also
> written and submitted a paper for the WIMS14 conference where we present
> our
> algorithm for content extraction and a high level overview of the crawler
> architecture, available at
> <
> https://github.com/OlivierBlanvillain/blogforever-crawler-publication/blob/master/tex/main.pdf
> >.
>
> I believe that the following parts of our project might be useful for other
> application:
>
> - The content extractor interface is similar to the one of Scrapely
>   <https://github.com/scrapy/scrapely> (Sadly we discovered Scrapely too
> late
>   to include it in our evaluation). It's very fast and robust: we got to
> 93%
>   success rate on blog articles extraction over 2300 blog posts.
>
> - To crawl blogs mixed up with other resources (such as wiki or a forum),
> we
>   use a simple machine-learning based priority predictor to favor crawling
>   URLs with links to blog posts. This allows to get the best out of a
> limited
>   number of page download, which might otherwise get stuck into unrelevant
>   portions of the blog.
>
> - We use PhantomJS do JavaScript rendering, take screenshots and fake some
>   user interaction to deal with Disqus comments. We have a pool of reusable
>   browser which allows to take full advantage of processors (with
> JavaScript
>   rendering on, the crawling bottleneck is CPU).
>
> If you take the time to read the paper (8 pages) or the code, don't
> hesitate
> to send comments or feedback!
>
> Regards,
> Olivier Blanvillain
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Announcing the BlogForever crawler

Reply via email to