Bugs item #1088877, was opened at 2004-12-21 08:30
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1088877&group_id=59548
Category: tools
Group: None
Status: Open
Resolution: Non
On Mon, Dec 20, 2004 at 03:40:44PM -0800, Doug Cutting wrote:
> John X wrote:
> >BasicUrlNormalizer.java should be made thread safe as
> >
> >< public String normalize(String urlString)
> >---
> >
> >> public synchronized String normalize(String urlString)
> >
> >
> >If no objection, I will c
John X wrote:
BasicUrlNormalizer.java should be made thread safe as
< public String normalize(String urlString)
---
public synchronized String normalize(String urlString)
If no objection, I will commit it late.
Good catch. In general, we should be careful not to synchronize too
much. I t
I don't think this is necessary - normalize(String) doesn't use any
global vars well, it does, but are those regexp-related built-in
classes not thread safe?
Otis
--- John X <[EMAIL PROTECTED]> wrote:
> BasicUrlNormalizer.java should be made thread safe as
>
> < public String normalize(
BasicUrlNormalizer.java should be made thread safe as
< public String normalize(String urlString)
---
> public synchronized String normalize(String urlString)
If no objection, I will commit it late.
John
__
http://www.neasys.com - A Good Place to B
Matt Kangas wrote:
Hmm... perhaps I'm not clear then on the purpose of these two tools.
Is this new tool not simply a SegmentMergeTool that does a smarter
merge, albeit at a cost?
No, this is a segment updater tool. The anchors and scores are copied
from the database into a segment when its fetch
Thank you everybody for your replies. I agree OR is not that important. I
thought Nutch uses 100% of Lucene functionality including its query parser,
which is very powerful.
Michael
From: Doug Cutting <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-dev]
Hmm... perhaps I'm not clear then on the purpose of these two tools.
Is this new tool not simply a SegmentMergeTool that does a smarter
merge, albeit at a cost?
On a related note, would a single-site crawler have any need of this
new tool? My sense is that inlink-based ranking has little value in
On 12/20/2004 05:55 AM, Stefan Groschupf wrote:
[snip]
Right, I had some scenarios where many pages was on one host but there
was still other hosts and since a lot of threads was sleeping the
fetcher was running only with half power.
While it certainly feels that the fetcher is running with only
On Mon, Dec 20, 2004 at 11:55:09AM +0100, Stefan Groschupf wrote:
> >I don't know the implementation details, but in the extreme scenario
> >you described, where number of threads is set to 100, but only 1 is
> >active, because there is only 1 host to crawl, there is not much you
> >can do - you d
Matt Kangas wrote:
Doug, if this were implemented, where would be a good place to put it?
SegmentMergeTool?
I think it should be a new standalone tool. It would be invoked by
CrawlTool, and probably also from the tutorial in the "whole web
crawling" section.
Doug
---
Michael Sashnikov wrote:
I believe Lucene do support AND/OR/NOT and other keywords. Does anybody
know
why Nutch does not? Without this feature Nutch is not too useful.
Nutch currently supports required terms and prohibited terms (equivalent
to AND and NOT). That is what most people use. Google
Please, I need your help !
I tried to find how & where the document boost factor was computed...
In class DistributedAnalysisTool I found some methods modifying the
Page.score and I read the IndexSegment "compute boost" part...
boost = score exp(scorePower)
But could you please briefly explain m
I don't know the implementation details, but in the extreme scenario
you described, where number of threads is set to 100, but only 1 is
active, because there is only 1 host to crawl, there is not much you
can do - you don't want more than 1 thread hitting the same host.
Sure, in case you do a int
Matt,
What do you think is the next step? Should I simply write an
implementation and post it to the list?
Well feel free!
The question is if you need a quick solution or a good solution.
For a good solution i would suggest change the fetcherFilter from the
old but still used interface - configu
15 matches
Mail list logo