Re: [Nutch-general] Strategic Direction of Nutch

Nitin Borwankar Mon, 13 Nov 2006 17:05:55 -0800


Hi all,


First an intro. I am another Nutch newbie and am finding 0.7.2 to be
quite an effective single machine crawler. 
I am not new to Java or data or the Internet. I run an email list called
'tagdb' for people interested in db problems in creating folksonomy
applications, also a blog called tagschema ( http://tagschema.com )

For completely different reasons I am interested in MapReduce (outside
the Nutch context) so I am also interested in seeing how Hadoop evolves.
Personally I see a lot of value in retaining the 0.7.2 code base while
evolving 0.8 into the medium to high end space as a *separate* code
line.
The ability to keep db formats compatible would be nice to allow reuse
of existing results but is not necessary.

As a potential developer I would like to volunteer for the ongoing
maintenance and evolution of 0.7.2 as an effective single machine
crawler.
I understand that the current developer community is more interested in
moving MapReduce based architecture forward and as I said I am also
interested in that.
But it would be a shame if the just fine 0.7.2 code was orphaned and I
would like to step forward and put my money where my mouth is.
I don't know what it would take to maintain separate versions like the
Tomcat folks do but it seems there is a need.

Consider this a proposal to maintain two separate versions by continuing
bug fix versions of 0.7  until one of two things happen

a) 0.8 evolves to something satisfactory for use as also as a single
machine search engine and everyone is happy moving to it
b) a critical mass of developers steps forward to support the ongoing
development of 0.7.2 into say Nutch-lite always and only meant for
single machine use.

Please feel free to shoot down if I am "smoking rope" as famous
newscaster says ....


Nitin Borwankar
http://tagschema.com




On Tue, 14 Nov 2006 00:53:27 +0100, "Nutch Newbie"
<[EMAIL PROTECTED]> said:
> Actually we are saying the same thing. Sorry I was not really pointing
> any fingers, apology if It came across that away. I was just stating
> the fact why things didn't get solved because as you pointed out
> active developers are on large install and not on small install.
> 
> However if the ambition of the project is to address medium size
> install, then there has to be some effort from comitters to make sure
> not to introduce code that just benefit the big 1000 machine install
> or the active developers Correct? (Again no pointing fingers :-).
> Otherwise you are just forgetting the little guys and not giving them
> the chance to develop and contribute.
> 
> I completely understand your view and I am aware of Hadoop work in
> progress.
> 
> Regards,
> On 11/14/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> > (Sorry for the long post, but I felt this issue needs to be made very
> > clear ...)
> >
> > Nutch Newbie wrote:
> > > Here is some general comments:
> > >
> > > The problem is in Hadoop i.e. map-reduce, i.e. processing. Hadoop-206
> > > is not solved..Have a look.
> > >
> > > http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00521.html
> > >
> > > Well, again its a wishful thinking to ask for many developers, patch
> > > and bug reporting and bug fixes - without focusing on the need of such
> > > developers.  Same example again!  hadoop-206 was reported and it is
> > > still not solved. So how do you expect to get more developers? when
> >
> > Before we get carried away, let me state clearly that reporting a
> > problem and providing a fix for a problem are two different things -
> > Hadoop-206 is a problem report, but without a fix. If there was a fix
> > for it, it would be most probably applied long time ago. The reason it's
> > not solved is that it's not a high priority issue for active developers,
> > and there is no easy fix to be applied.
> >
> > If this issue is a high priority for you, then fix it and provide a
> > patch so that others may benefit from it - that's how Open Source
> > projects work. Pointing fingers and saying "you should have done this or
> > that long time ago" won't fix the stuff by itself. Are you a developer?
> > Then fix it. If not, then you should now understand why we kindly _ask_
> > for more developers to get involved. Reporting problems is very useful
> > and crucial, but so is having the skilled manpower to fix them.
> >
> > >
> > > See when the focus of the development is to solve 1000 machine/ large
> > > install,  then the issues like 206 is never solved. Thus asking for
> > > more developer to provide bug fixes is a wishful thinking.
> >
> > No, we ask because we really need developers who could help us, who take
> > initiative to fix something if it's broken in their particular use case.
> >
> > The focus is on large clusters because that's what majority of active
> > developers use. If there were more active developers with focus on small
> > clusters (or single machine deployments) - hint, hint - the focus would
> > move in this direction. There is no conspiracy here, nor do we willfully
> > ignore the needs of people with small deployments - it's just a matter
> > of what is the priority among active developers.
> >
> > Complaining about this won't help as much as providing actual patches to
> > solve issues. Until then, a faster single-machine deployment is a "nice
> > to have" thing, but not the top priority.
> >
> > >
> > > Sorry if I knew how to solve map/reduce problem i would fix it and
> > > submit patch and I am sure I am not the only one here. Map/reduce
> > > stuff is not really walk in the park :-).
> > >
> > > The current direction of nutch development is geared towards large
> > > install and its a great software.  However lets not pretend/preach
> > > Nutch is good for small install, Nutch left that life when it embraced
> > > Map/Reduce i.e. starting from 0.8.
> >
> > You need to take into account that this is the first official release of
> > Nutch after a major brain surgery, so it's no wonder things are a little
> > bit twitchy ;) There are in fact very few, if any, places in Nutch that
> > still use the same data models and algorithms as they did in 0.7 era.
> >
> > Having said that, I just did a crawl of 1 mln pages within ~30 hours, on
> > a single machine, which should give me a 100 mln collection within 2
> > months. This speed is acceptable for me, even if it's slower than 0.7,
> > and if one day I want to go beyond 100 mln pages I know that I will be
> > able to do it - which _cannot_ be said about 0.7 ... So, you can look at
> > it as a tradeoff.
> >
> > (BTW: the issue with slow reduce phase is well known, and people from
> > the Hadoop project are working on it even as we speak).
> >
> > Oh, and regarding the subject of this thread - the strategic direction
> > of Nutch is to provide a viable platform for medium to large scale
> > search engines, be they Internet-wide or Intranet / constrained to a
> > specific area. This was the original goal of the project, and it still
> > reflects our ambitions. HOWEVER, if a significant part of active
> > community is focused on small / embedded deployments, then you need to
> > make your voice heard _and_ start contributing to the project so that it
> > becomes a viable solution also to your needs.
> >
> > I hope this long answer helps you to understand why things are the way
> > they are ... ;)
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
-- 
  Nitin Borwankar
  [EMAIL PROTECTED]


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Strategic Direction of Nutch

Reply via email to