Re: Announcing the BlogForever crawler

Atrijit Dasgupta Fri, 03 Jul 2015 01:42:05 -0700

We are evaluating the BlogForEver crawler for a project, and from the 
following page we cannot get the source code:


http://blogforever.eu/blog/2013/10/08/blogforever-platform-released/

- the http://invenio-software.org/repo/blogforever/ resource link gives an 
error.

However, a copy of the source code is available at 
https://github.com/BlogForever/crawler, uploaded there by Mr. Olivier 
BlanVillain.

Is the Github source code the final version? We downloaded and ran it, and 
while it perfectly works with the example sites provided, when we try to 
crawl other blogs etc, it seems to get into infinite loops and does not 
produce any output.

And if we cannot get the source code from the blogforever.eu page, is there 
another location where from we can get it?

Thanks

Atrijit Dasgupta

On Tuesday, 4 February 2014 21:28:43 UTC+5:30, Olivier Blanvillain wrote:
>
> > Do you plan to release the testing / evaluation part?
>
> The GitHub repository of the paper [1] contains scripts and instructions to
> reproduce the results we present in our evaluation. I could not include the
> dataset we used because it's not publicly available, but it should be
> reasonably easy to request it or manually build a small one. (I've not yet
> included a license because I'm not really sure how it works with the text 
> of
> the paper, but everything else will be MIT)
>
>
> > Should we put a link to BlogForever on the companies page?
>
> At the moment the BlogForever web site not really up-to-date, and we still 
> a
> bit of work (mostly the connection to our back-end) before putting the 
> crawler
> in production. The first instance will likely be hosted on CERN servers to
> monitor high-energy physics related blogs. I suggest to wait for this one 
> to
> be up and running before adding a link, we will get back to you when it is!
>
> Cheers,
> Olivier
>
> [1]: https://github.com/OlivierBlanvillain/blogforever-crawler-publication
>
>
> On Monday, February 3, 2014 11:51:56 PM UTC+1, shane wrote:
>>
>> Interesting project!
>>
>> It's nice to see the bits on Scrapy in your paper - thanks! We're 
>> delighted it was so useful for the BlogForever crawler. It's great to see 
>> your crawler released as open source too.
>>
>> I thought Scrapely could have been a nice comparison.. your approach 
>> takes better advantage of the fact that you have many examples (from the 
>> feeds) where Scrapely is designed for working with very little example data 
>> so I expect your approach would compare favorably. I see you favor using id 
>> and class attributes - something we are considering for Scrapely too as it 
>> currently relies exclusively on HTML structure.  Do you plan to release the 
>> testing / evaluation part?
>>
>> Should we put a link to BlogForever on the companies page 
>> <http://scrapy.org/companies/>?
>>
>> Good luck with the conference submission!
>>
>> Shane
>>
>>
>>
>> On 1 February 2014 16:52, <[email protected]> wrote:
>>
>>> Hello everyone,
>>>
>>> I've very happy to announce the release of the BlogForever crawler! Our 
>>> work
>>> is entirely based on Scrapy, and we wanted to thank you for the amazing 
>>> work
>>> you did on this framework, without which we could not have accomplished a
>>> fraction of what we did during the last 6 months.
>>>
>>> The crawler targets web blogs, and is able to automatically extract blog 
>>> post
>>> articles, title, authors and comments. It's open source and comes with 
>>> tests
>>> and documentation: <https://github.com/BlogForever/crawler>. We've also
>>> written and submitted a paper for the WIMS14 conference where we present 
>>> our
>>> algorithm for content extraction and a high level overview of the crawler
>>> architecture, available at
>>> <
>>> https://github.com/OlivierBlanvillain/blogforever-crawler-publication/blob/master/tex/main.pdf
>>> >.
>>>
>>> I believe that the following parts of our project might be useful for 
>>> other
>>> application:
>>>
>>> - The content extractor interface is similar to the one of Scrapely
>>>   <https://github.com/scrapy/scrapely> (Sadly we discovered Scrapely 
>>> too late
>>>   to include it in our evaluation). It's very fast and robust: we got to 
>>> 93%
>>>   success rate on blog articles extraction over 2300 blog posts.
>>>
>>> - To crawl blogs mixed up with other resources (such as wiki or a 
>>> forum), we
>>>   use a simple machine-learning based priority predictor to favor 
>>> crawling
>>>   URLs with links to blog posts. This allows to get the best out of a 
>>> limited
>>>   number of page download, which might otherwise get stuck into 
>>> unrelevant
>>>   portions of the blog.
>>>
>>> - We use PhantomJS do JavaScript rendering, take screenshots and fake 
>>> some
>>>   user interaction to deal with Disqus comments. We have a pool of 
>>> reusable
>>>   browser which allows to take full advantage of processors (with 
>>> JavaScript
>>>   rendering on, the crawling bottleneck is CPU).
>>>
>>> If you take the time to read the paper (8 pages) or the code, don't 
>>> hesitate
>>> to send comments or feedback!
>>>
>>> Regards,
>>> Olivier Blanvillain
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Announcing the BlogForever crawler

Reply via email to