Thanks Olivier ... On Friday, 3 July 2015 14:16:01 UTC+5:30, Olivier Blanvillain wrote: > > Hi, as far as I know nobody continued the development after me, so > what's on GitHub should be the latest version. > > On Fri, Jul 3, 2015 at 10:16 AM, Atrijit Dasgupta > <[email protected] <javascript:>> wrote: > > We are evaluating the BlogForEver crawler for a project, and from the > > following page we cannot get the source code: > > > > http://blogforever.eu/blog/2013/10/08/blogforever-platform-released/ > > > > - the http://invenio-software.org/repo/blogforever/ resource link gives > an > > error. > > > > However, a copy of the source code is available at > > https://github.com/BlogForever/crawler, uploaded there by Mr. Olivier > > BlanVillain. > > > > Is the Github source code the final version? We downloaded and ran it, > and > > while it perfectly works with the example sites provided, when we try to > > crawl other blogs etc, it seems to get into infinite loops and does not > > produce any output. > > > > And if we cannot get the source code from the blogforever.eu page, is > there > > another location where from we can get it? > > > > Thanks > > > > Atrijit Dasgupta > > > > On Tuesday, 4 February 2014 21:28:43 UTC+5:30, Olivier Blanvillain > wrote: > >> > >> > Do you plan to release the testing / evaluation part? > >> > >> The GitHub repository of the paper [1] contains scripts and > instructions > >> to > >> reproduce the results we present in our evaluation. I could not include > >> the > >> dataset we used because it's not publicly available, but it should be > >> reasonably easy to request it or manually build a small one. (I've not > yet > >> included a license because I'm not really sure how it works with the > text > >> of > >> the paper, but everything else will be MIT) > >> > >> > >> > Should we put a link to BlogForever on the companies page? > >> > >> At the moment the BlogForever web site not really up-to-date, and we > still > >> a > >> bit of work (mostly the connection to our back-end) before putting the > >> crawler > >> in production. The first instance will likely be hosted on CERN servers > to > >> monitor high-energy physics related blogs. I suggest to wait for this > one > >> to > >> be up and running before adding a link, we will get back to you when it > >> is! > >> > >> Cheers, > >> Olivier > >> > >> [1]: > https://github.com/OlivierBlanvillain/blogforever-crawler-publication > >> > >> > >> On Monday, February 3, 2014 11:51:56 PM UTC+1, shane wrote: > >>> > >>> Interesting project! > >>> > >>> It's nice to see the bits on Scrapy in your paper - thanks! We're > >>> delighted it was so useful for the BlogForever crawler. It's great to > see > >>> your crawler released as open source too. > >>> > >>> I thought Scrapely could have been a nice comparison.. your approach > >>> takes better advantage of the fact that you have many examples (from > the > >>> feeds) where Scrapely is designed for working with very little example > data > >>> so I expect your approach would compare favorably. I see you favor > using id > >>> and class attributes - something we are considering for Scrapely too > as it > >>> currently relies exclusively on HTML structure. Do you plan to > release the > >>> testing / evaluation part? > >>> > >>> Should we put a link to BlogForever on the companies page? > >>> > >>> Good luck with the conference submission! > >>> > >>> Shane > >>> > >>> > >>> > >>> On 1 February 2014 16:52, <[email protected]> wrote: > >>>> > >>>> Hello everyone, > >>>> > >>>> I've very happy to announce the release of the BlogForever crawler! > Our > >>>> work > >>>> is entirely based on Scrapy, and we wanted to thank you for the > amazing > >>>> work > >>>> you did on this framework, without which we could not have > accomplished > >>>> a > >>>> fraction of what we did during the last 6 months. > >>>> > >>>> The crawler targets web blogs, and is able to automatically extract > blog > >>>> post > >>>> articles, title, authors and comments. It's open source and comes > with > >>>> tests > >>>> and documentation: <https://github.com/BlogForever/crawler>. We've > also > >>>> written and submitted a paper for the WIMS14 conference where we > present > >>>> our > >>>> algorithm for content extraction and a high level overview of the > >>>> crawler > >>>> architecture, available at > >>>> > >>>> < > https://github.com/OlivierBlanvillain/blogforever-crawler-publication/blob/master/tex/main.pdf>. > > > >>>> > >>>> I believe that the following parts of our project might be useful for > >>>> other > >>>> application: > >>>> > >>>> - The content extractor interface is similar to the one of Scrapely > >>>> <https://github.com/scrapy/scrapely> (Sadly we discovered Scrapely > too > >>>> late > >>>> to include it in our evaluation). It's very fast and robust: we got > to > >>>> 93% > >>>> success rate on blog articles extraction over 2300 blog posts. > >>>> > >>>> - To crawl blogs mixed up with other resources (such as wiki or a > >>>> forum), we > >>>> use a simple machine-learning based priority predictor to favor > >>>> crawling > >>>> URLs with links to blog posts. This allows to get the best out of a > >>>> limited > >>>> number of page download, which might otherwise get stuck into > >>>> unrelevant > >>>> portions of the blog. > >>>> > >>>> - We use PhantomJS do JavaScript rendering, take screenshots and fake > >>>> some > >>>> user interaction to deal with Disqus comments. We have a pool of > >>>> reusable > >>>> browser which allows to take full advantage of processors (with > >>>> JavaScript > >>>> rendering on, the crawling bottleneck is CPU). > >>>> > >>>> If you take the time to read the paper (8 pages) or the code, don't > >>>> hesitate > >>>> to send comments or feedback! > >>>> > >>>> Regards, > >>>> Olivier Blanvillain > >>>> > >>>> -- > >>>> You received this message because you are subscribed to the Google > >>>> Groups "scrapy-users" group. > >>>> To unsubscribe from this group and stop receiving emails from it, > send > >>>> an email to [email protected]. > >>>> To post to this group, send email to [email protected]. > >>>> Visit this group at http://groups.google.com/group/scrapy-users. > >>>> For more options, visit https://groups.google.com/groups/opt_out. > >>> > >>> > > -- > > You received this message because you are subscribed to a topic in the > > Google Groups "scrapy-users" group. > > To unsubscribe from this topic, visit > > https://groups.google.com/d/topic/scrapy-users/hKsr4BKRlzo/unsubscribe. > > To unsubscribe from this group and all its topics, send an email to > > [email protected] <javascript:>. > > To post to this group, send email to [email protected] > <javascript:>. > > Visit this group at http://groups.google.com/group/scrapy-users. > > For more options, visit https://groups.google.com/d/optout. >
-- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
