Hi all,
First an intro. I am another Nutch newbie and am finding 0.7.2 to be quite an effective single machine crawler. I am not new to Java or data or the Internet. I run an email list called 'tagdb' for people interested in db problems in creating folksonomy applications, also a blog called tagschema ( http://tagschema.com ) For completely different reasons I am interested in MapReduce (outside the Nutch context) so I am also interested in seeing how Hadoop evolves. Personally I see a lot of value in retaining the 0.7.2 code base while evolving 0.8 into the medium to high end space as a *separate* code line. The ability to keep db formats compatible would be nice to allow reuse of existing results but is not necessary. As a potential developer I would like to volunteer for the ongoing maintenance and evolution of 0.7.2 as an effective single machine crawler. I understand that the current developer community is more interested in moving MapReduce based architecture forward and as I said I am also interested in that. But it would be a shame if the just fine 0.7.2 code was orphaned and I would like to step forward and put my money where my mouth is. I don't know what it would take to maintain separate versions like the Tomcat folks do but it seems there is a need. Consider this a proposal to maintain two separate versions by continuing bug fix versions of 0.7 until one of two things happen a) 0.8 evolves to something satisfactory for use as also as a single machine search engine and everyone is happy moving to it b) a critical mass of developers steps forward to support the ongoing development of 0.7.2 into say Nutch-lite always and only meant for single machine use. Please feel free to shoot down if I am "smoking rope" as famous newscaster says .... Nitin Borwankar http://tagschema.com On Tue, 14 Nov 2006 00:53:27 +0100, "Nutch Newbie" <[EMAIL PROTECTED]> said: > Actually we are saying the same thing. Sorry I was not really pointing > any fingers, apology if It came across that away. I was just stating > the fact why things didn't get solved because as you pointed out > active developers are on large install and not on small install. > > However if the ambition of the project is to address medium size > install, then there has to be some effort from comitters to make sure > not to introduce code that just benefit the big 1000 machine install > or the active developers Correct? (Again no pointing fingers :-). > Otherwise you are just forgetting the little guys and not giving them > the chance to develop and contribute. > > I completely understand your view and I am aware of Hadoop work in > progress. > > Regards, > On 11/14/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > (Sorry for the long post, but I felt this issue needs to be made very > > clear ...) > > > > Nutch Newbie wrote: > > > Here is some general comments: > > > > > > The problem is in Hadoop i.e. map-reduce, i.e. processing. Hadoop-206 > > > is not solved..Have a look. > > > > > > http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00521.html > > > > > > Well, again its a wishful thinking to ask for many developers, patch > > > and bug reporting and bug fixes - without focusing on the need of such > > > developers. Same example again! hadoop-206 was reported and it is > > > still not solved. So how do you expect to get more developers? when > > > > Before we get carried away, let me state clearly that reporting a > > problem and providing a fix for a problem are two different things - > > Hadoop-206 is a problem report, but without a fix. If there was a fix > > for it, it would be most probably applied long time ago. The reason it's > > not solved is that it's not a high priority issue for active developers, > > and there is no easy fix to be applied. > > > > If this issue is a high priority for you, then fix it and provide a > > patch so that others may benefit from it - that's how Open Source > > projects work. Pointing fingers and saying "you should have done this or > > that long time ago" won't fix the stuff by itself. Are you a developer? > > Then fix it. If not, then you should now understand why we kindly _ask_ > > for more developers to get involved. Reporting problems is very useful > > and crucial, but so is having the skilled manpower to fix them. > > > > > > > > See when the focus of the development is to solve 1000 machine/ large > > > install, then the issues like 206 is never solved. Thus asking for > > > more developer to provide bug fixes is a wishful thinking. > > > > No, we ask because we really need developers who could help us, who take > > initiative to fix something if it's broken in their particular use case. > > > > The focus is on large clusters because that's what majority of active > > developers use. If there were more active developers with focus on small > > clusters (or single machine deployments) - hint, hint - the focus would > > move in this direction. There is no conspiracy here, nor do we willfully > > ignore the needs of people with small deployments - it's just a matter > > of what is the priority among active developers. > > > > Complaining about this won't help as much as providing actual patches to > > solve issues. Until then, a faster single-machine deployment is a "nice > > to have" thing, but not the top priority. > > > > > > > > Sorry if I knew how to solve map/reduce problem i would fix it and > > > submit patch and I am sure I am not the only one here. Map/reduce > > > stuff is not really walk in the park :-). > > > > > > The current direction of nutch development is geared towards large > > > install and its a great software. However lets not pretend/preach > > > Nutch is good for small install, Nutch left that life when it embraced > > > Map/Reduce i.e. starting from 0.8. > > > > You need to take into account that this is the first official release of > > Nutch after a major brain surgery, so it's no wonder things are a little > > bit twitchy ;) There are in fact very few, if any, places in Nutch that > > still use the same data models and algorithms as they did in 0.7 era. > > > > Having said that, I just did a crawl of 1 mln pages within ~30 hours, on > > a single machine, which should give me a 100 mln collection within 2 > > months. This speed is acceptable for me, even if it's slower than 0.7, > > and if one day I want to go beyond 100 mln pages I know that I will be > > able to do it - which _cannot_ be said about 0.7 ... So, you can look at > > it as a tradeoff. > > > > (BTW: the issue with slow reduce phase is well known, and people from > > the Hadoop project are working on it even as we speak). > > > > Oh, and regarding the subject of this thread - the strategic direction > > of Nutch is to provide a viable platform for medium to large scale > > search engines, be they Internet-wide or Intranet / constrained to a > > specific area. This was the original goal of the project, and it still > > reflects our ambitions. HOWEVER, if a significant part of active > > community is focused on small / embedded deployments, then you need to > > make your voice heard _and_ start contributing to the project so that it > > becomes a viable solution also to your needs. > > > > I hope this long answer helps you to understand why things are the way > > they are ... ;) > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > > -- Nitin Borwankar [EMAIL PROTECTED] ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
