Hi,
First of all, thanks! We greatly appreciate it.
We are using Nutch for a very long time now, but we have diverged a lot
from the default codebase in order to make it suited for our purposes.
Therefore we could never really integrate it with Nutch development
itself. For example a custom component we build is a "component fetcher"
which directly fetches outlink urls within a fetcher job itself to speed
up certain vertical crawls. The way we implemented it prevented us from
integrating it in Nutch itself. (Although sometimes we did make
attempts, see details in mailing list [1]). Some other things include
persisting parsed results to HBase and creating a Lucene index from HBase.
However, the recent developments with Nutchgora sparked our interest to
decided to become more involved. Especially the fact that crawling can
be fully maintained within HBase itself is very cool. (We are a big fan
of Hadoop and Lucene too). Leaning more closely to an activily
maintained codebase is of course the best way to go. Our main goal for
now is having an healthy Nutchgora branch that is able to perform
crawling on a large scale (40+ machines) using HBase as a backend!
By the way, Mathijs and I will be attending the upcoming HadoopWorld, so
if any of you guys are going too please let us know so maybe we could
join for a meet and greet.
Cheers!
1.
http://lucene.472066.n3.nabble.com/Component-fetching-during-parsing-vertical-crawling-td981098.html
On 10/28/2011 02:26 PM, Markus Jelsma wrote:
Cheers!
On Friday 28 October 2011 14:21:25 Julien Nioche wrote:
Hi,
A while back the NUTCH PMC nominated Ferdy Galema for Nutch committership
and PMC membership. The VOTE tallies in Nutch PMC have occurred and I'm
happy to announce that Ferdy is now a Nutch committer.
Ferdy, feel free to say a little bit about yourself. Your account has been
created and you should have committer rights. Your first task will be to
check that it works by adding yourself to the list of committers on the
website (see Wiki for instructions).
Well done and welcome on board
Julien