Re: Announcing the BlogForever crawler

Atrijit Dasgupta Fri, 03 Jul 2015 07:31:38 -0700

Thanks Olivier ...

On Friday, 3 July 2015 14:16:01 UTC+5:30, Olivier Blanvillain wrote:
>
> Hi, as far as I know nobody continued the development after me, so 
> what's on GitHub should be the latest version. 
>
> On Fri, Jul 3, 2015 at 10:16 AM, Atrijit Dasgupta 
> <[email protected] <javascript:>> wrote: 
> > We are evaluating the BlogForEver crawler for a project, and from the 
> > following page we cannot get the source code: 
> > 
> > http://blogforever.eu/blog/2013/10/08/blogforever-platform-released/ 
> > 
> > - the http://invenio-software.org/repo/blogforever/ resource link gives 
> an 
> > error. 
> > 
> > However, a copy of the source code is available at 
> > https://github.com/BlogForever/crawler, uploaded there by Mr. Olivier 
> > BlanVillain. 
> > 
> > Is the Github source code the final version? We downloaded and ran it, 
> and 
> > while it perfectly works with the example sites provided, when we try to 
> > crawl other blogs etc, it seems to get into infinite loops and does not 
> > produce any output. 
> > 
> > And if we cannot get the source code from the blogforever.eu page, is 
> there 
> > another location where from we can get it? 
> > 
> > Thanks 
> > 
> > Atrijit Dasgupta 
> > 
> > On Tuesday, 4 February 2014 21:28:43 UTC+5:30, Olivier Blanvillain 
> wrote: 
> >> 
> >> > Do you plan to release the testing / evaluation part? 
> >> 
> >> The GitHub repository of the paper [1] contains scripts and 
> instructions 
> >> to 
> >> reproduce the results we present in our evaluation. I could not include 
> >> the 
> >> dataset we used because it's not publicly available, but it should be 
> >> reasonably easy to request it or manually build a small one. (I've not 
> yet 
> >> included a license because I'm not really sure how it works with the 
> text 
> >> of 
> >> the paper, but everything else will be MIT) 
> >> 
> >> 
> >> > Should we put a link to BlogForever on the companies page? 
> >> 
> >> At the moment the BlogForever web site not really up-to-date, and we 
> still 
> >> a 
> >> bit of work (mostly the connection to our back-end) before putting the 
> >> crawler 
> >> in production. The first instance will likely be hosted on CERN servers 
> to 
> >> monitor high-energy physics related blogs. I suggest to wait for this 
> one 
> >> to 
> >> be up and running before adding a link, we will get back to you when it 
> >> is! 
> >> 
> >> Cheers, 
> >> Olivier 
> >> 
> >> [1]: 
> https://github.com/OlivierBlanvillain/blogforever-crawler-publication 
> >> 
> >> 
> >> On Monday, February 3, 2014 11:51:56 PM UTC+1, shane wrote: 
> >>> 
> >>> Interesting project! 
> >>> 
> >>> It's nice to see the bits on Scrapy in your paper - thanks! We're 
> >>> delighted it was so useful for the BlogForever crawler. It's great to 
> see 
> >>> your crawler released as open source too. 
> >>> 
> >>> I thought Scrapely could have been a nice comparison.. your approach 
> >>> takes better advantage of the fact that you have many examples (from 
> the 
> >>> feeds) where Scrapely is designed for working with very little example 
> data 
> >>> so I expect your approach would compare favorably. I see you favor 
> using id 
> >>> and class attributes - something we are considering for Scrapely too 
> as it 
> >>> currently relies exclusively on HTML structure.  Do you plan to 
> release the 
> >>> testing / evaluation part? 
> >>> 
> >>> Should we put a link to BlogForever on the companies page? 
> >>> 
> >>> Good luck with the conference submission! 
> >>> 
> >>> Shane 
> >>> 
> >>> 
> >>> 
> >>> On 1 February 2014 16:52, <[email protected]> wrote: 
> >>>> 
> >>>> Hello everyone, 
> >>>> 
> >>>> I've very happy to announce the release of the BlogForever crawler! 
> Our 
> >>>> work 
> >>>> is entirely based on Scrapy, and we wanted to thank you for the 
> amazing 
> >>>> work 
> >>>> you did on this framework, without which we could not have 
> accomplished 
> >>>> a 
> >>>> fraction of what we did during the last 6 months. 
> >>>> 
> >>>> The crawler targets web blogs, and is able to automatically extract 
> blog 
> >>>> post 
> >>>> articles, title, authors and comments. It's open source and comes 
> with 
> >>>> tests 
> >>>> and documentation: <https://github.com/BlogForever/crawler>. We've 
> also 
> >>>> written and submitted a paper for the WIMS14 conference where we 
> present 
> >>>> our 
> >>>> algorithm for content extraction and a high level overview of the 
> >>>> crawler 
> >>>> architecture, available at 
> >>>> 
> >>>> <
> https://github.com/OlivierBlanvillain/blogforever-crawler-publication/blob/master/tex/main.pdf>.
>  
>
> >>>> 
> >>>> I believe that the following parts of our project might be useful for 
> >>>> other 
> >>>> application: 
> >>>> 
> >>>> - The content extractor interface is similar to the one of Scrapely 
> >>>>   <https://github.com/scrapy/scrapely> (Sadly we discovered Scrapely 
> too 
> >>>> late 
> >>>>   to include it in our evaluation). It's very fast and robust: we got 
> to 
> >>>> 93% 
> >>>>   success rate on blog articles extraction over 2300 blog posts. 
> >>>> 
> >>>> - To crawl blogs mixed up with other resources (such as wiki or a 
> >>>> forum), we 
> >>>>   use a simple machine-learning based priority predictor to favor 
> >>>> crawling 
> >>>>   URLs with links to blog posts. This allows to get the best out of a 
> >>>> limited 
> >>>>   number of page download, which might otherwise get stuck into 
> >>>> unrelevant 
> >>>>   portions of the blog. 
> >>>> 
> >>>> - We use PhantomJS do JavaScript rendering, take screenshots and fake 
> >>>> some 
> >>>>   user interaction to deal with Disqus comments. We have a pool of 
> >>>> reusable 
> >>>>   browser which allows to take full advantage of processors (with 
> >>>> JavaScript 
> >>>>   rendering on, the crawling bottleneck is CPU). 
> >>>> 
> >>>> If you take the time to read the paper (8 pages) or the code, don't 
> >>>> hesitate 
> >>>> to send comments or feedback! 
> >>>> 
> >>>> Regards, 
> >>>> Olivier Blanvillain 
> >>>> 
> >>>> -- 
> >>>> You received this message because you are subscribed to the Google 
> >>>> Groups "scrapy-users" group. 
> >>>> To unsubscribe from this group and stop receiving emails from it, 
> send 
> >>>> an email to [email protected]. 
> >>>> To post to this group, send email to [email protected]. 
> >>>> Visit this group at http://groups.google.com/group/scrapy-users. 
> >>>> For more options, visit https://groups.google.com/groups/opt_out. 
> >>> 
> >>> 
> > -- 
> > You received this message because you are subscribed to a topic in the 
> > Google Groups "scrapy-users" group. 
> > To unsubscribe from this topic, visit 
> > https://groups.google.com/d/topic/scrapy-users/hKsr4BKRlzo/unsubscribe. 
> > To unsubscribe from this group and all its topics, send an email to 
> > [email protected] <javascript:>. 
> > To post to this group, send email to [email protected] 
> <javascript:>. 
> > Visit this group at http://groups.google.com/group/scrapy-users. 
> > For more options, visit https://groups.google.com/d/optout. 
>


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Announcing the BlogForever crawler

Reply via email to