Re: Next Nutch release

Dennis Kubes Sat, 20 Jan 2007 11:55:05 -0800


Andrzej Bialecki wrote:

Dennis Kubes wrote:
I completely agree with this. I am interested in devoting as muchtime as possible to seeing the success of Nutch, Hadoop, and Lucene.As our business grows I would also be willing to devote developersfull time to work on Nutch, Hadoop, and Lucene.
I think that at least one company needs to come out and have aproduction search engine that is competition, however small, to thegoogles and yahoos of the world, built on Nutch and Hadoop. I thoughtthat was the original goal of Nutch. I know there are some out thereright now like Mozdex, but I mean a true billion page system. I thinkthe .8 codebase, and yes improvements could be made, is capable ofsupporting such a system. I think then you will see many moredevelopers become interested in the project. If you build it theywill come.
Sure, I'd love to point people to such a system. But did you do acalculation how much money in the initial investment, and then ongoingcosts, is needed to maintain such an index? It cannot happen justbecause of someone's goodwill, there must be a sound business ideabehind it, and a team of dedicated people to make it happen andpersevere - not just to demonstrate how good Nutch is, but to keep upfor the sake of their own business.

I completely agree. We have been working on this business for almost ayear. We received significant seed capital to build the alpha versionof the search, which is complete, and are in the process of securingfirst round private equity funding to scale to 100M pages this year andup to 1B pages in year 2.

Yes the initial investment for hardware, data center costs, marketingcosts, and most importantly development staff for say a 1 billion pageindex capable of supporting 100 queries per second constant is around 5Mand as it grows into the 10-20 billion range costs can grow as high as 100M.

I think what many people don't understand is that search is as much ahardware (electricity, bandwidth) issue as it is a software issue. Iknow that we couldn't have developed the systems we have without Nutch,Hadoop, and Lucene and that I personally and we as a company arecompletely committed to their development.

I will say that it is difficult for people to understand how to getmore involved. I have been working with Nutch and Hadoop for almost ayear now on a daily basis and only now am I understanding how tocontribute through jira, etc. There needs to be more guidance inhelping developers contribute. For example if you want to develop anew piece of function they do x, y, and z. Here is how to patch yoursystem. If you want to develop a patch then here are the steps. Ihave programmed in Java for many years but haven't worked on many opensource projects before. The process of how they work isn't explicitand it needs to be.
Hmm. I might not be objective here anymore. There is however somedocumentation already on the Wiki, which explains how to contribute - ifyou feel it's inadequate please use your hard-earned experience toimprove it.

I am in the middle of writing a new wiki page for contributing that willgo into much more detail about the process.

We worked up many patches for issues we came up against in the .8 and.4 codebases but they were never contributed because, as stupid as itmight sound, we really don't know how to give it back. The best thingI thought I could do was to help answer questions on the list. Againjust need a little guidance.
Are you willing to spend the time and do the required refactoring?Anyone else?
Yes, I am and I currently have 2 other developers that can help.
Sounds great. We could start by creating a new page on Wiki, which wouldcollect our vision for Nutch - as I mentioned to Stefan, I think weshould take a step back, and think about the strategy for the next 1-2years of Nutch development, and what is the target audience.

I am all for this, just understand this is a new process for me so willneed some guidance.

Sure if we start a 2.x branch and if I'm not developing for the trashor "jira nirvana", I can imaging to contribute. I
Just a quick comment: "jira nirvana" (which I believe refers to patchessitting idle in Jira for a long time) is not caused by ill will ordisrespect for contributors, but foremost by limited human resources. Ifwe want to maintain a certain level of quality, these patches cannot beapplied blindly, but need to be reviewed, analyzed, applied, tested, andcommitted. That's an awful lot of work for 2-3 people, who also haveother things to do ...
It is very less attractive to developers spending weeks to find a buglike the regular expression one. Than such a bug sits there for monthin the jira being rejected. Sure if nobody of the contributors runnutch with a 500 mio url
It's not being rejected - see the comments on that issue, there is anoverall agreement that it's ok; it simply hasn't been applied yet. Seeabove for the why.
I'm slowly coming to a point where I should be able to fix it - butlet's not throw out the baby with the water ...
Wow, I hold my finger crossed!
There is a great book on this. It is 0691122024. Andrzej send meyour address and I will buy and ship you a copy if you don't have it.
Too late :) I found it two weeks ago, and it's already on its merry way- but thanks for the offer.
We would also be willing to help develop this functionality further.
I started working on a testbed as a part of another commercial project,it's likely that I could get a release from the customer to contributethis code to the project. A testbed is a prerequisite for any seriouswork on ranking and web graph.
(It's quite unfortunate that the best-of-breed open source framework forworking with web graphs is licensed under LGPL ...)
I can definitely see a desire to re-write but I think even if youre-write you are still going to have the same problem. Search is hardand without guidance we can't get enough developers to understand whatthey need to know to help.
Indeed. People often don't appreciate how much heuristics and trials,beyond pure academic-level IR, is needed to come up with a system thatgives a decent quality of results, and is manageable. Nutch may not beperfect, but there's a lot of this specific knowledge alreadyaccumulated here.

Absolutely and it is not knowledge that is easily found elsewhere.

At this time I don't think it is a design problem I think it is apeople problem. I will be more than willing to head up training,documenting, and helping developers get up to speed. I just needdirection in this area myself.
I believe that at this point it's crucial to keep the projectwell-focused (at the moment I think the main focus is on largerinstallations, and not the small ones), and also to make Nutchattractive to developers as a reusable "search engine" component.

I think there are two areas. One is to keep the focus as you statedabove. The other is to provide a path to get more people involved. Ifno one objects I will continue working on such a path.

Let's continue the discussion. I'll create the page on Wiki, please feelfree to add your thoughts.

Will do.

Re: Next Nutch release

Reply via email to