[Nutch-dev] [ nutch-Bugs-1088877 ] Crawler ends prematurely with an IOException

2004-12-20 Thread SourceForge.net
Bugs item #1088877, was opened at 2004-12-21 08:30 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1088877&group_id=59548 Category: tools Group: None Status: Open Resolution: Non

Re: [Nutch-dev] make BasicUrlNormalizer.java thread safe

2004-12-20 Thread John X
On Mon, Dec 20, 2004 at 03:40:44PM -0800, Doug Cutting wrote: > John X wrote: > >BasicUrlNormalizer.java should be made thread safe as > > > >< public String normalize(String urlString) > >--- > > > >> public synchronized String normalize(String urlString) > > > > > >If no objection, I will c

Re: [Nutch-dev] make BasicUrlNormalizer.java thread safe

2004-12-20 Thread Doug Cutting
John X wrote: BasicUrlNormalizer.java should be made thread safe as < public String normalize(String urlString) --- public synchronized String normalize(String urlString) If no objection, I will commit it late. Good catch. In general, we should be careful not to synchronize too much. I t

Re: [Nutch-dev] make BasicUrlNormalizer.java thread safe

2004-12-20 Thread ogjunk-nutch
I don't think this is necessary - normalize(String) doesn't use any global vars well, it does, but are those regexp-related built-in classes not thread safe? Otis --- John X <[EMAIL PROTECTED]> wrote: > BasicUrlNormalizer.java should be made thread safe as > > < public String normalize(

[Nutch-dev] make BasicUrlNormalizer.java thread safe

2004-12-20 Thread John X
BasicUrlNormalizer.java should be made thread safe as < public String normalize(String urlString) --- > public synchronized String normalize(String urlString) If no objection, I will commit it late. John __ http://www.neasys.com - A Good Place to B

Re: [Nutch-dev] refetching all pages to update anchor text?

2004-12-20 Thread Doug Cutting
Matt Kangas wrote: Hmm... perhaps I'm not clear then on the purpose of these two tools. Is this new tool not simply a SegmentMergeTool that does a smarter merge, albeit at a cost? No, this is a segment updater tool. The anchors and scores are copied from the database into a segment when its fetch

Re: [Nutch-dev] AND/OR/NOT do not work?

2004-12-20 Thread Michael Sashnikov
Thank you everybody for your replies. I agree OR is not that important. I thought Nutch uses 100% of Lucene functionality including its query parser, which is very powerful. Michael From: Doug Cutting <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: Re: [Nutch-dev]

Re: [Nutch-dev] refetching all pages to update anchor text?

2004-12-20 Thread Matt Kangas
Hmm... perhaps I'm not clear then on the purpose of these two tools. Is this new tool not simply a SegmentMergeTool that does a smarter merge, albeit at a cost? On a related note, would a single-site crawler have any need of this new tool? My sense is that inlink-based ranking has little value in

Re: [Nutch-dev] improve fetcher thread handling?

2004-12-20 Thread Luke Baker
On 12/20/2004 05:55 AM, Stefan Groschupf wrote: [snip] Right, I had some scenarios where many pages was on one host but there was still other hosts and since a lot of threads was sleeping the fetcher was running only with half power. While it certainly feels that the fetcher is running with only

Re: [Nutch-dev] improve fetcher thread handling?

2004-12-20 Thread John X
On Mon, Dec 20, 2004 at 11:55:09AM +0100, Stefan Groschupf wrote: > >I don't know the implementation details, but in the extreme scenario > >you described, where number of threads is set to 100, but only 1 is > >active, because there is only 1 host to crawl, there is not much you > >can do - you d

Re: [Nutch-dev] refetching all pages to update anchor text?

2004-12-20 Thread Doug Cutting
Matt Kangas wrote: Doug, if this were implemented, where would be a good place to put it? SegmentMergeTool? I think it should be a new standalone tool. It would be invoked by CrawlTool, and probably also from the tutorial in the "whole web crawling" section. Doug ---

Re: [Nutch-dev] AND/OR/NOT do not work?

2004-12-20 Thread Doug Cutting
Michael Sashnikov wrote: I believe Lucene do support AND/OR/NOT and other keywords. Does anybody know why Nutch does not? Without this feature Nutch is not too useful. Nutch currently supports required terms and prohibited terms (equivalent to AND and NOT). That is what most people use. Google

[Nutch-dev] DistributedAnalysisTool and boost

2004-12-20 Thread Christophe Noel
Please, I need your help ! I tried to find how & where the document boost factor was computed... In class DistributedAnalysisTool I found some methods modifying the Page.score and I read the IndexSegment "compute boost" part... boost = score exp(scorePower) But could you please briefly explain m

Re: [Nutch-dev] improve fetcher thread handling?

2004-12-20 Thread Stefan Groschupf
I don't know the implementation details, but in the extreme scenario you described, where number of threads is set to 100, but only 1 is active, because there is only 1 host to crawl, there is not much you can do - you don't want more than 1 thread hitting the same host. Sure, in case you do a int

Re: [Nutch-dev] Implementing geography-by-IP filtering?

2004-12-20 Thread Stefan Groschupf
Matt, What do you think is the next step? Should I simply write an implementation and post it to the list? Well feel free! The question is if you need a quick solution or a good solution. For a good solution i would suggest change the fetcherFilter from the old but still used interface - configu