Frontera: large-scale, distributed web crawling framework

2015-10-02 Thread Alexander Sibiryakov
Hi Nutch users! Last 8 months at Scrapinghub we’ve been working on a new web crawling framework called Frontera. This is a distributed implementation of crawl frontier part of web crawler, the component which decides what to crawl next, when and when to stop. So, it’s not a complete web crawler

Re: Frontera: large-scale, distributed web crawling framework

2015-10-02 Thread Jessica Glover
Hmm... you're asking for a free consultation on an open source software user mailing list? First, this doesn't exactly seem like the appropriate place for that. Second, offer some incentive if you want someone to help you with your business. On Fri, Oct 2, 2015 at 11:33 AM, Alexander Sibiryakov w

Re: Frontera: large-scale, distributed web crawling framework

2015-10-02 Thread Jessica Glover
Sorry, just re-read and saw that it's open source and under what license? I apologize if you're not trying to sell this. On Fri, Oct 2, 2015 at 11:45 AM, Jessica Glover wrote: > Hmm... you're asking for a free consultation on an open source software > user mailing list? First, this doesn't exact

Re: Frontera: large-scale, distributed web crawling framework

2015-10-02 Thread Mattmann, Chris A (3980)
Hi, I don’t think Alexander is doing anything wrong. In fact, he’s asking for input on his web crawling framework on the Nutch user list which I imagine contains many people interested in distributed web crawling. There doesn’t appear to be a direct Nutch connection here in his framework, howeve

Re: Frontera: large-scale, distributed web crawling framework

2015-10-02 Thread Jessica Glover
Alexander, I apologize. I misunderstood the intent of your message and I was very rude in my response. I will think about what you've asked and get back to you. Also, I enjoyed your slide presentation. It's very pleasing to the eye. Sincerely, Jessica On Fri, Oct 2, 2015 at 11:51 AM, Mattmann, C

Subscription to nutch list

2015-10-02 Thread Disha Punjabi
Hi, I want to subscribe to the nutch mailing list. Best, Disha

Re: Subscription to nutch list

2015-10-02 Thread Girish Rao
Send an email to user-subscr...@nutch.apache.org and if you want to join the dev mailing list send email to: dev-subscr...@nutch.apache.org Instructions on: http://nutch.apache.org/mailing_lists.html Regards Girish On Fri, Oct 2, 2015 at 12:09 PM, Disha Punjabi wrote: > Hi, > I want to subs

Re: Remove Header Footer and Menus from crawled content

2015-10-02 Thread Camilo Tejeiro
@marora: I am glad it helps! @john: I think you don't have to patch or modify the parse-html plugin, you can build a parse-filter that is executed afterwards, this is the way I am doing it currently, because I read somewhere (not remember where) that it is good practice to extend the parse-html plu

Re: nutch 2.3.1 doesn't crawl

2015-10-02 Thread Drulea, Sherban
Seems like the problem is with the generator. It doesn¹t generate any links to crawl. Is there any way to debug why the generator doesn¹t work? On 10/1/15, 6:39 PM, "Drulea, Sherban" wrote: >Hi All, > >Thanks for pointing me to the 2.3.1 release. It works without error but >doesn¹t crawl. I¹m

Re: Apache Nutch Output structure

2015-10-02 Thread Lewis John Mcgibbney
Hi Folks, On Fri, Oct 2, 2015 at 4:33 PM, wrote: > > I already went through the page but it gives only technical information > about the directories but no information related to relation amongst these > folders and what they really mean in terms of crawled output. > I agree to an extent. I've

Re: Apache Nutch Python-Nutchpy

2015-10-02 Thread Lewis John Mcgibbney
Hi Sanjay, On Fri, Oct 2, 2015 at 4:33 PM, wrote: > > I want to use the apache nutch python nutchpy library for analyzing the > crawl data generated from apache nutch. > Can anyone please point me to the documentation for nutchpy library that > how I can interact with crawl data using python nut