Re: [Nutch-general] Strategic Direction of Nutch

Anthony May Mon, 13 Nov 2006 17:38:02 -0800

This is one of the options that I have suggested for our organisation to
adopt.


Anthony May
Web Developer
NZQA

>>> [EMAIL PROTECTED] 14/11/2006 2:05 p.m. >>>


Hi all,

First an intro. I am another Nutch newbie and am finding 0.7.2 to be
quite an effective single machine crawler. 
I am not new to Java or data or the Internet. I run an email list
called
'tagdb' for people interested in db problems in creating folksonomy
applications, also a blog called tagschema ( http://tagschema.com )

For completely different reasons I am interested in MapReduce (outside
the Nutch context) so I am also interested in seeing how Hadoop
evolves.
Personally I see a lot of value in retaining the 0.7.2 code base while
evolving 0.8 into the medium to high end space as a *separate* code
line.
The ability to keep db formats compatible would be nice to allow reuse
of existing results but is not necessary.

As a potential developer I would like to volunteer for the ongoing
maintenance and evolution of 0.7.2 as an effective single machine
crawler.
I understand that the current developer community is more interested
in
moving MapReduce based architecture forward and as I said I am also
interested in that.
But it would be a shame if the just fine 0.7.2 code was orphaned and I
would like to step forward and put my money where my mouth is.
I don't know what it would take to maintain separate versions like the
Tomcat folks do but it seems there is a need.

Consider this a proposal to maintain two separate versions by
continuing
bug fix versions of 0.7  until one of two things happen

a) 0.8 evolves to something satisfactory for use as also as a single
machine search engine and everyone is happy moving to it
b) a critical mass of developers steps forward to support the ongoing
development of 0.7.2 into say Nutch-lite always and only meant for
single machine use.

Please feel free to shoot down if I am "smoking rope" as famous
newscaster says ....


Nitin Borwankar
http://tagschema.com 




On Tue, 14 Nov 2006 00:53:27 +0100, "Nutch Newbie"
<[EMAIL PROTECTED]> said:
> Actually we are saying the same thing. Sorry I was not really
pointing
> any fingers, apology if It came across that away. I was just stating
> the fact why things didn't get solved because as you pointed out
> active developers are on large install and not on small install.
> 
> However if the ambition of the project is to address medium size
> install, then there has to be some effort from comitters to make
sure
> not to introduce code that just benefit the big 1000 machine install
> or the active developers Correct? (Again no pointing fingers :-).
> Otherwise you are just forgetting the little guys and not giving
them
> the chance to develop and contribute.
> 
> I completely understand your view and I am aware of Hadoop work in
> progress.
> 
> Regards,
> On 11/14/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> > (Sorry for the long post, but I felt this issue needs to be made
very
> > clear ...)
> >
> > Nutch Newbie wrote:
> > > Here is some general comments:
> > >
> > > The problem is in Hadoop i.e. map-reduce, i.e. processing.
Hadoop-206
> > > is not solved..Have a look.
> > >
> > >
http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00521.html

> > >
> > > Well, again its a wishful thinking to ask for many developers,
patch
> > > and bug reporting and bug fixes - without focusing on the need of
such
> > > developers.  Same example again!  hadoop-206 was reported and it
is
> > > still not solved. So how do you expect to get more developers?
when
> >
> > Before we get carried away, let me state clearly that reporting a
> > problem and providing a fix for a problem are two different things
-
> > Hadoop-206 is a problem report, but without a fix. If there was a
fix
> > for it, it would be most probably applied long time ago. The reason
it's
> > not solved is that it's not a high priority issue for active
developers,
> > and there is no easy fix to be applied.
> >
> > If this issue is a high priority for you, then fix it and provide
a
> > patch so that others may benefit from it - that's how Open Source
> > projects work. Pointing fingers and saying "you should have done
this or
> > that long time ago" won't fix the stuff by itself. Are you a
developer?
> > Then fix it. If not, then you should now understand why we kindly
_ask_
> > for more developers to get involved. Reporting problems is very
useful
> > and crucial, but so is having the skilled manpower to fix them.
> >
> > >
> > > See when the focus of the development is to solve 1000 machine/
large
> > > install,  then the issues like 206 is never solved. Thus asking
for
> > > more developer to provide bug fixes is a wishful thinking.
> >
> > No, we ask because we really need developers who could help us, who
take
> > initiative to fix something if it's broken in their particular use
case.
> >
> > The focus is on large clusters because that's what majority of
active
> > developers use. If there were more active developers with focus on
small
> > clusters (or single machine deployments) - hint, hint - the focus
would
> > move in this direction. There is no conspiracy here, nor do we
willfully
> > ignore the needs of people with small deployments - it's just a
matter
> > of what is the priority among active developers.
> >
> > Complaining about this won't help as much as providing actual
patches to
> > solve issues. Until then, a faster single-machine deployment is a
"nice
> > to have" thing, but not the top priority.
> >
> > >
> > > Sorry if I knew how to solve map/reduce problem i would fix it
and
> > > submit patch and I am sure I am not the only one here.
Map/reduce
> > > stuff is not really walk in the park :-).
> > >
> > > The current direction of nutch development is geared towards
large
> > > install and its a great software.  However lets not
pretend/preach
> > > Nutch is good for small install, Nutch left that life when it
embraced
> > > Map/Reduce i.e. starting from 0.8.
> >
> > You need to take into account that this is the first official
release of
> > Nutch after a major brain surgery, so it's no wonder things are a
little
> > bit twitchy ;) There are in fact very few, if any, places in Nutch
that
> > still use the same data models and algorithms as they did in 0.7
era.
> >
> > Having said that, I just did a crawl of 1 mln pages within ~30
hours, on
> > a single machine, which should give me a 100 mln collection within
2
> > months. This speed is acceptable for me, even if it's slower than
0.7,
> > and if one day I want to go beyond 100 mln pages I know that I will
be
> > able to do it - which _cannot_ be said about 0.7 ... So, you can
look at
> > it as a tradeoff.
> >
> > (BTW: the issue with slow reduce phase is well known, and people
from
> > the Hadoop project are working on it even as we speak).
> >
> > Oh, and regarding the subject of this thread - the strategic
direction
> > of Nutch is to provide a viable platform for medium to large scale
> > search engines, be they Internet-wide or Intranet / constrained to
a
> > specific area. This was the original goal of the project, and it
still
> > reflects our ambitions. HOWEVER, if a significant part of active
> > community is focused on small / embedded deployments, then you need
to
> > make your voice heard _and_ start contributing to the project so
that it
> > becomes a viable solution also to your needs.
> >
> > I hope this long answer helps you to understand why things are the
way
> > they are ... ;)
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
-- 
  Nitin Borwankar
  [EMAIL PROTECTED] 


********************************************************************************
This email may contain legally privileged information and is intended only for 
the addressee. It is not necessarily the official view or 
communication of the New Zealand Qualifications Authority. If you are not the 
intended recipient you must not use, disclose, copy or distribute this email or 
information in it. If you have received this email in error, please contact the 
sender immediately. NZQA does not accept any liability for changes made to this 
email or attachments after sending by NZQA. 

All emails have been scanned for viruses and content by MailMarshal. 
NZQA reserves the right to monitor all email communications through its network.

********************************************************************************

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Strategic Direction of Nutch

Reply via email to