Hi, as far as I know nobody continued the development after me, so what's on GitHub should be the latest version.
On Fri, Jul 3, 2015 at 10:16 AM, Atrijit Dasgupta <[email protected]> wrote: > We are evaluating the BlogForEver crawler for a project, and from the > following page we cannot get the source code: > > http://blogforever.eu/blog/2013/10/08/blogforever-platform-released/ > > - the http://invenio-software.org/repo/blogforever/ resource link gives an > error. > > However, a copy of the source code is available at > https://github.com/BlogForever/crawler, uploaded there by Mr. Olivier > BlanVillain. > > Is the Github source code the final version? We downloaded and ran it, and > while it perfectly works with the example sites provided, when we try to > crawl other blogs etc, it seems to get into infinite loops and does not > produce any output. > > And if we cannot get the source code from the blogforever.eu page, is there > another location where from we can get it? > > Thanks > > Atrijit Dasgupta > > On Tuesday, 4 February 2014 21:28:43 UTC+5:30, Olivier Blanvillain wrote: >> >> > Do you plan to release the testing / evaluation part? >> >> The GitHub repository of the paper [1] contains scripts and instructions >> to >> reproduce the results we present in our evaluation. I could not include >> the >> dataset we used because it's not publicly available, but it should be >> reasonably easy to request it or manually build a small one. (I've not yet >> included a license because I'm not really sure how it works with the text >> of >> the paper, but everything else will be MIT) >> >> >> > Should we put a link to BlogForever on the companies page? >> >> At the moment the BlogForever web site not really up-to-date, and we still >> a >> bit of work (mostly the connection to our back-end) before putting the >> crawler >> in production. The first instance will likely be hosted on CERN servers to >> monitor high-energy physics related blogs. I suggest to wait for this one >> to >> be up and running before adding a link, we will get back to you when it >> is! >> >> Cheers, >> Olivier >> >> [1]: https://github.com/OlivierBlanvillain/blogforever-crawler-publication >> >> >> On Monday, February 3, 2014 11:51:56 PM UTC+1, shane wrote: >>> >>> Interesting project! >>> >>> It's nice to see the bits on Scrapy in your paper - thanks! We're >>> delighted it was so useful for the BlogForever crawler. It's great to see >>> your crawler released as open source too. >>> >>> I thought Scrapely could have been a nice comparison.. your approach >>> takes better advantage of the fact that you have many examples (from the >>> feeds) where Scrapely is designed for working with very little example data >>> so I expect your approach would compare favorably. I see you favor using id >>> and class attributes - something we are considering for Scrapely too as it >>> currently relies exclusively on HTML structure. Do you plan to release the >>> testing / evaluation part? >>> >>> Should we put a link to BlogForever on the companies page? >>> >>> Good luck with the conference submission! >>> >>> Shane >>> >>> >>> >>> On 1 February 2014 16:52, <[email protected]> wrote: >>>> >>>> Hello everyone, >>>> >>>> I've very happy to announce the release of the BlogForever crawler! Our >>>> work >>>> is entirely based on Scrapy, and we wanted to thank you for the amazing >>>> work >>>> you did on this framework, without which we could not have accomplished >>>> a >>>> fraction of what we did during the last 6 months. >>>> >>>> The crawler targets web blogs, and is able to automatically extract blog >>>> post >>>> articles, title, authors and comments. It's open source and comes with >>>> tests >>>> and documentation: <https://github.com/BlogForever/crawler>. We've also >>>> written and submitted a paper for the WIMS14 conference where we present >>>> our >>>> algorithm for content extraction and a high level overview of the >>>> crawler >>>> architecture, available at >>>> >>>> <https://github.com/OlivierBlanvillain/blogforever-crawler-publication/blob/master/tex/main.pdf>. >>>> >>>> I believe that the following parts of our project might be useful for >>>> other >>>> application: >>>> >>>> - The content extractor interface is similar to the one of Scrapely >>>> <https://github.com/scrapy/scrapely> (Sadly we discovered Scrapely too >>>> late >>>> to include it in our evaluation). It's very fast and robust: we got to >>>> 93% >>>> success rate on blog articles extraction over 2300 blog posts. >>>> >>>> - To crawl blogs mixed up with other resources (such as wiki or a >>>> forum), we >>>> use a simple machine-learning based priority predictor to favor >>>> crawling >>>> URLs with links to blog posts. This allows to get the best out of a >>>> limited >>>> number of page download, which might otherwise get stuck into >>>> unrelevant >>>> portions of the blog. >>>> >>>> - We use PhantomJS do JavaScript rendering, take screenshots and fake >>>> some >>>> user interaction to deal with Disqus comments. We have a pool of >>>> reusable >>>> browser which allows to take full advantage of processors (with >>>> JavaScript >>>> rendering on, the crawling bottleneck is CPU). >>>> >>>> If you take the time to read the paper (8 pages) or the code, don't >>>> hesitate >>>> to send comments or feedback! >>>> >>>> Regards, >>>> Olivier Blanvillain >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "scrapy-users" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>> For more options, visit https://groups.google.com/groups/opt_out. >>> >>> > -- > You received this message because you are subscribed to a topic in the > Google Groups "scrapy-users" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/scrapy-users/hKsr4BKRlzo/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
