Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: I tested it on a 5 mln index. Thanks, this is great data! Can you please tell a bit more about the experiments? In particular: . How were scores assigned to pages? Link analysis? log(number of incoming links) or OPIC? log() . How were

Nutch design queries

2005-12-15 Thread Mike Cannon-Brookes
Hey guys, Been playing with Nutch quite a bit lately, here's a random grab-bag of queries / questions / problems I've encountered. - Classloading - I have had many problems with NutchConf due to the way it loads it's resources. In a J2EE scenario, it's simply evil :) Would there be any great

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Doug Cutting
Andrzej Bialecki wrote: . How were the queries generated? From a log or randomly? Queries have been picked up manually, to test the worst performing cases from a real query log. So, for example, the 50% error rate might not be typical, but could be worst-case. . When results differed

Re: vote results.

2005-12-15 Thread Andrzej Bialecki
Stefan Groschupf wrote: Hi, I counted the votes manually, I hope I didn't oversee something. I didn't filter out issues that are 0.8 related, since it is good to know community wishes anyway. :-) Shouldn't the period for voting be a bit longer? I didn't have time to vote yet... Anyway,

Re: vote for issues to fix in 0.7.2

2005-12-15 Thread Florent Gluck
I hope it's not too late to accept my votes. Here there are: NUTCH-136mapreduce segment generator generates 50 % less than excepted urls +1 NUTCH-121SegmentReader for mapred +1 NUTCH-108tasktracker crashs when reconnecting to a new jobtracker. +1 Thanks, --Flo

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: . How were the queries generated? From a log or randomly? Queries have been picked up manually, to test the worst performing cases from a real query log. So, for example, the 50% error rate might not be typical, but could be worst-case.

Re: mapreduce fetcher doesn't fetch all urls

2005-12-15 Thread Doug Cutting
Stefan Groschupf wrote: In case you setup one thread per host, you have maximal as much connections to one host as you have boxes. In may case that are not that much. Anything more than one is not generally considered polite. Also it is a reproducible bug that the segment is everytime

Re: Nutch design queries

2005-12-15 Thread Doug Cutting
Mike Cannon-Brookes wrote: Hey guys, Hi, Mike! Welcome. - Classloading - I have had many problems with NutchConf due to the way it loads it's resources. In a J2EE scenario, it's simply evil :) Would there be any great problem with switching it's classloader to

Re: [Fwd: Crawler submits forms?]

2005-12-15 Thread Doug Cutting
Andrzej Bialecki wrote: Please also don't forget that the trunk/ will soon be invaded by the code from mapred, I guess some time around the middle of January (Doug?) Thinking about this more, perhaps we should do it sooner. There's already a branch for 0.7.x releases, so what point is there

JUnit test failures

2005-12-15 Thread Piotr Kosiorowski
Hi, I have problems with JUnit tests in trunk and mapred branches. TestFetcher fails in both branches. The same test executes correctly in 0.7 branch. Is it only my problem (environment setup) or others are having it too? I would suspect some changes in redirect handling Regards Piotr

Re: [Fwd: Crawler submits forms?]

2005-12-15 Thread Piotr Kosiorowski
Doug Cutting wrote: Andrzej Bialecki wrote: Please also don't forget that the trunk/ will soon be invaded by the code from mapred, I guess some time around the middle of January (Doug?) Thinking about this more, perhaps we should do it sooner. There's already a branch for 0.7.x releases,

Re: [Fwd: Crawler submits forms?]

2005-12-15 Thread Doug Cutting
Andrzej Bialecki wrote: I agree. I just thought that we would prepare the relase based on the code in trunk/ , and in that case we would like to wait with the merge before we do the release. My definition of trunk is that it should be where the majority of development happens. It is what we

Re: [Fwd: Crawler submits forms?]

2005-12-15 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: I agree. I just thought that we would prepare the relase based on the code in trunk/ , and in that case we would like to wait with the merge before we do the release. My definition of trunk is that it should be where the majority of development

Re: [Fwd: Crawler submits forms?]

2005-12-15 Thread Doug Cutting
Andrzej Bialecki wrote: Yes, we just need to make sure that all important bits from trunk are on the 0.7 branch, before we start. I will sync mapred with the trunk prior to the merge, so we should still be able to get anything we need after mapred is merged back to trunk. BTW, we're pretty

Re: Nutch design queries

2005-12-15 Thread Mike Cannon-Brookes
Wow - great responses all. 0.7 vs 0.8 - apologies if I'm using an old version. I'm using the latest binary release. I'll switch to latest SVN HEAD and see how that works in my application. Is there any concrete timeline on 0.8? I'm very glad to see the statics generally being reduced. I also

[jira] Created: (NUTCH-142) NutchConf should use the thread context classloader

2005-12-15 Thread Mike Cannon-Brookes (JIRA)
NutchConf should use the thread context classloader --- Key: NUTCH-142 URL: http://issues.apache.org/jira/browse/NUTCH-142 Project: Nutch Type: Improvement Versions: 0.7 Reporter: Mike Cannon-Brookes

Re: vote results.

2005-12-15 Thread Jérôme Charron
Just continue voting I will continue with my tally sheet. :-) Why not creating a wiki page... so that you don't have to do this bad work. Jérôme

Re: Nutch design queries

2005-12-15 Thread Mike Cannon-Brookes
Filed as http://issues.apache.org/jira/browse/NUTCH-142 I didn't think there was much point creating a patch for a 1 line fix :) m On 12/16/05, Mike Cannon-Brookes [EMAIL PROTECTED] wrote: Wow - great responses all. 0.7 vs 0.8 - apologies if I'm using an old version. I'm using the latest

Re: Nutch design queries

2005-12-15 Thread Doug Cutting
Mike Cannon-Brookes wrote: 0.7 vs 0.8 - apologies if I'm using an old version. I'm using the latest binary release. I'll switch to latest SVN HEAD and see how that works in my application. The mapred branch will soon be moved to trunk, so you might be better off starting there, since a lot

Re: Nutch design queries

2005-12-15 Thread Doug Cutting
Doug Cutting wrote: Once the mapred branch is folded in then there's a bunch of stuff that's obsoleted that needs to be removed. I'd like to get dynamic configuration in, if possible. For reference, I found the message I posted about this a while back:

mapred merge to trunk

2005-12-15 Thread Doug Cutting
Sami Siren wrote: +1. I think this is good time to merge now as the mapred is fully usable. Barring objections, I will do this tomorrow morning, Pacific time. Doug

Re: version branches / two products

2005-12-15 Thread David Wallace
My apologies if this is the second time I've sent this. Andrzej Bialecki wrote: Please also don't forget that the trunk/ will soon be invaded by the code from mapred, I guess some time around the middle of January (Doug?) Doug Cutting wrote: Thinking about this more, perhaps we should do it

Re: version branches / two products

2005-12-15 Thread Andrzej Bialecki
David Wallace wrote: Would it be worthwhile discussing the pros and cons of having two completely separate Nutch products? If it is, then now is probably the right time to do so. My take on this: * it's too costly (in terms of available human resources) to maintain both versions for a

[jira] Created: (NUTCH-143) Improper error numbers returned on exit

2005-12-15 Thread Rod Taylor (JIRA)
Improper error numbers returned on exit --- Key: NUTCH-143 URL: http://issues.apache.org/jira/browse/NUTCH-143 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Rod Taylor Nutch does not obey standard command line