Kashif:
Unfortunately it seems that given a subtle change in the
page (in this case the Bold Title name) at the top of the page, this will result
in a duplicate page, as the MD5s will not be the same. You can end up with
50-100 pages of such junk.
This goes back to what I brought up a fe
Duplicate page is big problem which is also include spam.As i grow my index these duplicate pages grow and i am tired of getting this spam with same content in 100's of sites.They show up on top of search results and keep going on many pages of search results full of duplicate contents.
One examp
Hi Andrzej ,
Thanks for your great help.The main reason i am trying this is to remove duplicates as the
DeleteDuplicate tool don't work good for me and end up with many sites with same contents.I will work on what you suggested.
Thanks Again,
Kashif
Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Doug:
Well, sites that do point to the same content are 90% of the time mirrors.
Examples are www.cricinfo.org (located across 8 countries), while the main
page is always the same the links that traverse into may be very different
(see it while a cricket match is in session and all mirror sites h
matt,
how far along are you on this tool? i'm basically
done (just need to do some more testing). sorry about
the duplicated effort -- we needed this quickly so i
just banged it out.
--- Matt Kangas <[EMAIL PROTECTED]> wrote:
> Doug, thanks for the tips! I'll try to take a stab
> at it ASAP.
>
Doug:
Well, sites that do point to the same content are 90% of the time mirrors.
Examples are www.cricinfo.org (located across 8 countries), while the main
page is always the same the links that traverse into may be very different
(see it while a cricket match is in session and all mirror sites h
Kashif Khadim wrote:
Hi,
Iam using SegmentMergeTool and it is taking so long, i want to know how
much time to expect for this tool to finish.After reading segment with
20 enteries it just sits there for two days, is this normal ?.
Definitely not normal... Is the process swapping? You can get
Bugs item #1099077, was opened at 2005-01-09 12:15
Message generated for change (Comment added) made by cutting
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1099077&group_id=59548
Category: fetcher
Group: None
>Status: Closed
>Resolution: Fixed
Priorit
Chirag Chaman wrote:
- Do a breath first crawl (which Nutch does)
- For each page fetched, generate MD5 hash
- IF MD5 hash is in "WebDB"
do not add the data to the segment
mark the Link for deletion
[ ... ]
The above will AT MOST add one page that is bad and all the other will be
i
There appears to be a problem on this page of your site.
On page http://www.nutch.org/docs/nl/
when you click on "donateur",
the link to http://www.nutch.org/docs/nl/donate.html
gives the error: Not found.
As recommended by the Robot Guidelines, this email is to explain
our robot
JC>> we would like to perform Lucene searches over
JC>> several Nutch-generated indexes residing on
JC>> separate servers ... [including] lucene queries
JC>> such as span, fuzzy, range, etc.
DC> It should not be hard to implement these as Nutch
DC> QueryFilter plugins. Thus, one could add
DC>
I just comitted a variation of this patch. Instead of allocating a new
Perl5Matcher for each call, I used a ThreadLocal to cache one
Perl5Matcher per thread.
Thanks!
Doug
Piotr Kosiorowski wrote:
Hello,
I had a look at this issue this week and I think I have identified the
problem. I have plan
Piotr,
I got the latest nightly build, but it looks like this patch didn't make it
in. I will go ahead and apply this manually and give it a try.
Thanks!
-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Piotr
Kosiorowski
Sent: Saturday, January 08, 2005
Hi,
Iam using SegmentMergeTool and it is taking so long, i want to know how much time to expect for this tool to finish.After reading segment with 20 enteries it just sits there for two days, is this normal ?.
Thanks
Kashif
Do you Yahoo!?
Yahoo! Mail - Easier than ever with enhanced search
Kashif Khadim wrote:
Hi,
How can i remove url which ends with url like ".org" with this tool,if i
try with url then it also deletes sites like http://somesites.com/org/
and this sites dont ends with domain ".org" .I want to have only ".com"
sites for some index.
Current index structure doesn't
Hi,
How can i remove url which ends with url like ".org" with this tool,if i try with url then it also deletes sites like http://somesites.com/org/ and this sites dont ends with domain ".org" .I want to have only ".com" sites for some index.
Thanks,
Kashif
Do you Yahoo!?
Read only the mail y
My question is really basic but I cannot find this !
Which class uses DistributedAnalysisTool and LinkAnalysisTool ?!
Thanks a lot !
Christophe.
---
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceFor
Hi Stephan,
I also didn't find anything regarding Powerpoint Parsers, and I'm working
on a project using Lucene and Nutch parsers, I also investigated this.
I actually created an MS Powerpoint this morning, using POI via the code
submitted on this url :
http://www.mail-archive.com/poi-user@jakarta.
Dear Nutch developers,
during the last weeks I investigated into Nutch
and found, that currently MS PowerPoint slides are not supported. I am not
shure, but I have also not found any hint within the mailing lists, that someone
already has implemented a parser plugin for this document type.
I
19 matches
Mail list logo