Re: Frontera: large-scale, distributed web crawling framework

Mattmann, Chris A (3980) Fri, 02 Oct 2015 08:53:23 -0700

Hi,

I don’t think Alexander is doing anything wrong. In fact, he’s
asking for input on his web crawling framework on the Nutch user
list which I imagine contains many people interested in distributed
web crawling.


There doesn’t appear to be a direct Nutch connection here in his
framework, however it uses other Apache technologies, Kafka, HBase,
etc., that we are using (or thinking of using) and are interested in
at least from my perspective as a Nutch developer and PMC Member.
There are also several efforts to figure out how to use Scrapy
with Nutch and this may be an interesting connection.

If Alexander and people like him who aren’t using Nutch per-se never
came to the Nutch list and discussed common web crawling topics of
interest, we’d continue to have our silos and our own separate lists,
and our own discussions, etc., instead of trying to work together
as a broader community of folks and we’d miss out on potential
opportunities where in the future, perhaps we could actually share
more than simply ideas, but also software too.

I applaud Alexander for coming to this list and not staying in his
own silo and trying to get input from the Apache Nutch community.

Thank you Alexander.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Jessica Glover <glover.jessic...@gmail.com>
Reply-To: "user@nutch.apache.org" <user@nutch.apache.org>
Date: Friday, October 2, 2015 at 8:45 AM
To: "user@nutch.apache.org" <user@nutch.apache.org>
Subject: Re: Frontera: large-scale, distributed web crawling framework

>Hmm... you're asking for a free consultation on an open source software
>user mailing list? First, this doesn't exactly seem like the appropriate
>place for that. Second, offer some incentive if you want someone to help
>you with your business.
>
>On Fri, Oct 2, 2015 at 11:33 AM, Alexander Sibiryakov
><sixty-...@yandex.ru>
>wrote:
>
>> Hi Nutch users!
>>
>> Last 8 months at Scrapinghub we’ve been working on a new web crawling
>> framework called Frontera. This is a distributed implementation of crawl
>> frontier part of web crawler, the component which decides what to crawl
>> next, when and when to stop. So, it’s not a complete web crawler.
>>However,
>> it suggests overall crawler design. There is a clean and tested way how
>>to
>> build a such crawler in half of the day from existing components.
>>
>> Here is a list of main features:
>> Online operation: scheduling of new batch, updating of DB state. No need
>> to stop crawling to change the crawling strategy.
>> Storage abstraction: write your own backend (sqlalchemy, HBase is
>> included).
>> Canonical URLs resolution abstraction: each document has many URLs,
>>which
>> to use? We provide a place where you can code your own logic.
>> Scrapy ecosystem: good documentation, big community, ease of
>>customization.
>> Communication layer is Apache Kafka: easy to plug somewhere and debug.
>> Crawling strategy abstraction: crawling goal, url ordering, scoring
>>model
>> is coded in separate module.
>> Polite by design: each website is downloaded by at most one spider
>>process.
>> Workers are implemented in Python.
>> In general, such a web crawler should be very easy for customization,
>>easy
>> to plug in existing infrastructure and it’s online operation could be
>> useful for crawling frequently changing web pages: news websites for
>> example. We tested it at some scale, by crawling part of Spanish
>>internet,
>> you can find details in my presentation.
>>
>> 
>>http://events.linuxfoundation.org/sites/events/files/slides/Frontera-craw
>>ling%20the%20spanish%20web.pdf
>>
>> This project currently on a github, it’s an open source, under own
>>license.
>> https://github.com/scrapinghub/frontera
>> https://github.com/scrapinghub/distributed-frontera
>>
>> The questions are, what you guys think? Is this a useful thing? If yes,
>> what kind of use cases do you see? Currently, I’m looking for a
>>businesses
>> who can benefit from it, please write me if you have any ideas on that.
>>
>> A.

Re: Frontera: large-scale, distributed web crawling framework

Reply via email to