RE: [Nutch-dev] Exploding number links due to bad sites

2005-01-10 Thread Chirag Chaman
Kashif:   Unfortunately it seems that given a subtle change in the page (in this case the Bold Title name) at the top of the page, this will result in a duplicate page, as the MD5s will not be the same. You can end up with 50-100 pages of such junk.   This goes back to what I brought up a fe

RE: [Nutch-dev] Exploding number links due to bad sites

2005-01-10 Thread Kashif Khadim
Duplicate page is big problem which is also include spam.As i grow my index these duplicate pages grow and i am tired of getting this spam with same content in 100's of sites.They show up on top of search results and keep going on many pages of search results full of duplicate contents.   One examp

Re: [Nutch-dev] About SegmentMergeTool

2005-01-10 Thread Kashif Khadim
Hi Andrzej ,   Thanks for your great help.The main reason i am trying this is to remove duplicates as the DeleteDuplicate tool don't work good for me and end up with many sites with same contents.I will work on what you suggested.   Thanks Again, Kashif  Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

RE: [Nutch-dev] Exploding number links due to bad sites

2005-01-10 Thread Chirag Chaman
Doug: Well, sites that do point to the same content are 90% of the time mirrors. Examples are www.cricinfo.org (located across 8 countries), while the main page is always the same the links that traverse into may be very different (see it while a cricket match is in session and all mirror sites h

Re: [Nutch-dev] refetching all pages to update anchor text?

2005-01-10 Thread Xin-Yi Liu
matt, how far along are you on this tool? i'm basically done (just need to do some more testing). sorry about the duplicated effort -- we needed this quickly so i just banged it out. --- Matt Kangas <[EMAIL PROTECTED]> wrote: > Doug, thanks for the tips! I'll try to take a stab > at it ASAP. >

RE: [Nutch-dev] Exploding number links due to bad sites

2005-01-10 Thread Chirag Chaman
Doug: Well, sites that do point to the same content are 90% of the time mirrors. Examples are www.cricinfo.org (located across 8 countries), while the main page is always the same the links that traverse into may be very different (see it while a cricket match is in session and all mirror sites h

Re: [Nutch-dev] About SegmentMergeTool

2005-01-10 Thread Andrzej Bialecki
Kashif Khadim wrote: Hi, Iam using SegmentMergeTool and it is taking so long, i want to know how much time to expect for this tool to finish.After reading segment with 20 enteries it just sits there for two days, is this normal ?. Definitely not normal... Is the process swapping? You can get

[Nutch-dev] [ nutch-Bugs-1099077 ] [PATCH] ArrayIndexOutOfBounds during fetch

2005-01-10 Thread SourceForge.net
Bugs item #1099077, was opened at 2005-01-09 12:15 Message generated for change (Comment added) made by cutting You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1099077&group_id=59548 Category: fetcher Group: None >Status: Closed >Resolution: Fixed Priorit

Re: [Nutch-dev] Exploding number links due to bad sites

2005-01-10 Thread Doug Cutting
Chirag Chaman wrote: - Do a breath first crawl (which Nutch does) - For each page fetched, generate MD5 hash - IF MD5 hash is in "WebDB" do not add the data to the segment mark the Link for deletion [ ... ] The above will AT MOST add one page that is bad and all the other will be i

[Nutch-dev] Broken link in www.nutch.org

2005-01-10 Thread Roy at SEVENtwentyfour
There appears to be a problem on this page of your site. On page http://www.nutch.org/docs/nl/ when you click on "donateur", the link to http://www.nutch.org/docs/nl/donate.html gives the error: Not found. As recommended by the Robot Guidelines, this email is to explain our robot

Re: [Nutch-dev] Objects to bits in Lucene vs. Nutch

2005-01-10 Thread Jeremy Calvert
JC>> we would like to perform Lucene searches over JC>> several Nutch-generated indexes residing on JC>> separate servers ... [including] lucene queries JC>> such as span, fuzzy, range, etc. DC> It should not be hard to implement these as Nutch DC> QueryFilter plugins. Thus, one could add DC>

Re: [Nutch-dev] ArrayIndexOutOfBoundsException during fetch

2005-01-10 Thread Doug Cutting
I just comitted a variation of this patch. Instead of allocating a new Perl5Matcher for each call, I used a ThreadLocal to cache one Perl5Matcher per thread. Thanks! Doug Piotr Kosiorowski wrote: Hello, I had a look at this issue this week and I think I have identified the problem. I have plan

RE: [Nutch-dev] ArrayIndexOutOfBoundsException during fetch

2005-01-10 Thread Tim England
Piotr, I got the latest nightly build, but it looks like this patch didn't make it in. I will go ahead and apply this manually and give it a try. Thanks! -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Piotr Kosiorowski Sent: Saturday, January 08, 2005

[Nutch-dev] About SegmentMergeTool

2005-01-10 Thread Kashif Khadim
Hi, Iam using SegmentMergeTool and it is taking so long, i want to know how much time to expect for this tool to finish.After reading segment with 20 enteries it just sits there for two days, is this normal ?.   Thanks Kashif Do you Yahoo!? Yahoo! Mail - Easier than ever with enhanced search

Re: [Nutch-dev] PruneIndexTool

2005-01-10 Thread Andrzej Bialecki
Kashif Khadim wrote: Hi, How can i remove url which ends with url like ".org" with this tool,if i try with url then it also deletes sites like http://somesites.com/org/ and this sites dont ends with domain ".org" .I want to have only ".com" sites for some index. Current index structure doesn't

[Nutch-dev] PruneIndexTool

2005-01-10 Thread Kashif Khadim
Hi,   How can i remove url which ends with url like ".org" with this tool,if i try with url then it also deletes sites like http://somesites.com/org/ and this sites dont ends with domain ".org" .I want to have only ".com" sites for some index.   Thanks, Kashif Do you Yahoo!? Read only the mail y

[Nutch-dev] DistributedAnalysisTool basic question

2005-01-10 Thread Christophe Noel
My question is really basic but I cannot find this ! Which class uses DistributedAnalysisTool and LinkAnalysisTool ?! Thanks a lot ! Christophe. --- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceFor

Re: [Nutch-dev] [announce] Parser-Plugin for MS PowerPoint slides

2005-01-10 Thread Stephan Lagraulet
Hi Stephan, I also didn't find anything regarding Powerpoint Parsers, and I'm working on a project using Lucene and Nutch parsers, I also investigated this. I actually created an MS Powerpoint this morning, using POI via the code submitted on this url : http://www.mail-archive.com/poi-user@jakarta.

[Nutch-dev] [announce] Parser-Plugin for MS PowerPoint slides

2005-01-10 Thread Strittmatter, Stephan
Dear Nutch developers,   during the last weeks I investigated into Nutch and found, that currently MS PowerPoint slides are not supported. I am not shure, but I have also not found any hint within the mailing lists, that someone already has implemented a parser plugin for this document type.   I