Re: Announcing the BlogForever crawler

Olivier Blanvillain Fri, 03 Jul 2015 01:46:46 -0700

Hi, as far as I know nobody continued the development after me, so
what's on GitHub should be the latest version.


On Fri, Jul 3, 2015 at 10:16 AM, Atrijit Dasgupta
<[email protected]> wrote:
> We are evaluating the BlogForEver crawler for a project, and from the
> following page we cannot get the source code:
>
> http://blogforever.eu/blog/2013/10/08/blogforever-platform-released/
>
> - the http://invenio-software.org/repo/blogforever/ resource link gives an
> error.
>
> However, a copy of the source code is available at
> https://github.com/BlogForever/crawler, uploaded there by Mr. Olivier
> BlanVillain.
>
> Is the Github source code the final version? We downloaded and ran it, and
> while it perfectly works with the example sites provided, when we try to
> crawl other blogs etc, it seems to get into infinite loops and does not
> produce any output.
>
> And if we cannot get the source code from the blogforever.eu page, is there
> another location where from we can get it?
>
> Thanks
>
> Atrijit Dasgupta
>
> On Tuesday, 4 February 2014 21:28:43 UTC+5:30, Olivier Blanvillain wrote:
>>
>> > Do you plan to release the testing / evaluation part?
>>
>> The GitHub repository of the paper [1] contains scripts and instructions
>> to
>> reproduce the results we present in our evaluation. I could not include
>> the
>> dataset we used because it's not publicly available, but it should be
>> reasonably easy to request it or manually build a small one. (I've not yet
>> included a license because I'm not really sure how it works with the text
>> of
>> the paper, but everything else will be MIT)
>>
>>
>> > Should we put a link to BlogForever on the companies page?
>>
>> At the moment the BlogForever web site not really up-to-date, and we still
>> a
>> bit of work (mostly the connection to our back-end) before putting the
>> crawler
>> in production. The first instance will likely be hosted on CERN servers to
>> monitor high-energy physics related blogs. I suggest to wait for this one
>> to
>> be up and running before adding a link, we will get back to you when it
>> is!
>>
>> Cheers,
>> Olivier
>>
>> [1]: https://github.com/OlivierBlanvillain/blogforever-crawler-publication
>>
>>
>> On Monday, February 3, 2014 11:51:56 PM UTC+1, shane wrote:
>>>
>>> Interesting project!
>>>
>>> It's nice to see the bits on Scrapy in your paper - thanks! We're
>>> delighted it was so useful for the BlogForever crawler. It's great to see
>>> your crawler released as open source too.
>>>
>>> I thought Scrapely could have been a nice comparison.. your approach
>>> takes better advantage of the fact that you have many examples (from the
>>> feeds) where Scrapely is designed for working with very little example data
>>> so I expect your approach would compare favorably. I see you favor using id
>>> and class attributes - something we are considering for Scrapely too as it
>>> currently relies exclusively on HTML structure.  Do you plan to release the
>>> testing / evaluation part?
>>>
>>> Should we put a link to BlogForever on the companies page?
>>>
>>> Good luck with the conference submission!
>>>
>>> Shane
>>>
>>>
>>>
>>> On 1 February 2014 16:52, <[email protected]> wrote:
>>>>
>>>> Hello everyone,
>>>>
>>>> I've very happy to announce the release of the BlogForever crawler! Our
>>>> work
>>>> is entirely based on Scrapy, and we wanted to thank you for the amazing
>>>> work
>>>> you did on this framework, without which we could not have accomplished
>>>> a
>>>> fraction of what we did during the last 6 months.
>>>>
>>>> The crawler targets web blogs, and is able to automatically extract blog
>>>> post
>>>> articles, title, authors and comments. It's open source and comes with
>>>> tests
>>>> and documentation: <https://github.com/BlogForever/crawler>. We've also
>>>> written and submitted a paper for the WIMS14 conference where we present
>>>> our
>>>> algorithm for content extraction and a high level overview of the
>>>> crawler
>>>> architecture, available at
>>>>
>>>> <https://github.com/OlivierBlanvillain/blogforever-crawler-publication/blob/master/tex/main.pdf>.
>>>>
>>>> I believe that the following parts of our project might be useful for
>>>> other
>>>> application:
>>>>
>>>> - The content extractor interface is similar to the one of Scrapely
>>>>   <https://github.com/scrapy/scrapely> (Sadly we discovered Scrapely too
>>>> late
>>>>   to include it in our evaluation). It's very fast and robust: we got to
>>>> 93%
>>>>   success rate on blog articles extraction over 2300 blog posts.
>>>>
>>>> - To crawl blogs mixed up with other resources (such as wiki or a
>>>> forum), we
>>>>   use a simple machine-learning based priority predictor to favor
>>>> crawling
>>>>   URLs with links to blog posts. This allows to get the best out of a
>>>> limited
>>>>   number of page download, which might otherwise get stuck into
>>>> unrelevant
>>>>   portions of the blog.
>>>>
>>>> - We use PhantomJS do JavaScript rendering, take screenshots and fake
>>>> some
>>>>   user interaction to deal with Disqus comments. We have a pool of
>>>> reusable
>>>>   browser which allows to take full advantage of processors (with
>>>> JavaScript
>>>>   rendering on, the crawling bottleneck is CPU).
>>>>
>>>> If you take the time to read the paper (8 pages) or the code, don't
>>>> hesitate
>>>> to send comments or feedback!
>>>>
>>>> Regards,
>>>> Olivier Blanvillain
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "scrapy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "scrapy-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/scrapy-users/hKsr4BKRlzo/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Announcing the BlogForever crawler

Reply via email to