Suppose we have to fetch 3 pages.
Page A is http://something/login.php
Page B is http://yyy/rrr/ which, when fetched, redirects to page A
Page C is http://yyy/ttt/ which, when fetched, redirects to page A
When fetching A, B, C the fetcher will fetch
A
B
A
C
A
Is there any way to prevent the r
Hello all,
Some time ago I sent a patch for:
http://issues.apache.org/jira/browse/NUTCH-7 (analyze runs for an
excessive amount of time and creates huge temp files until it runs out
of disk space (if you let the db grow)).
I know PageRank computation is not activly maintained and we will
probably
Hi,
since my first experiments were sucessful, I m actually starting
implementing Nutch into my Website Visualisation Tool.
So I got now to my first questions:
1. I put a class into my Project that works similar to the
CrawlerTool.java main class.
This works fine if you have written the urls
[ http://issues.apache.org/jira/browse/NUTCH-78?page=all ]
Matthias Jaekle updated NUTCH-78:
-
Attachment: de.properties.tgz
anchors_de.properties, cached_de.properties, explain_de.properties,
search_de.properties, text_de.properties
> German texts on w
German texts on website
---
Key: NUTCH-78
URL: http://issues.apache.org/jira/browse/NUTCH-78
Project: Nutch
Type: Improvement
Components: searcher
Reporter: Matthias Jaekle
Priority: Minor
Attachments: de.properties.tgz
The German pr
I was using a nightly build that Pitor had given me the nutch-nightly.jar
(actually it was nutch-dev0.7.jar or something of that nature) I tested it on
the windows platform, I had 5 machines running it, 2 at 100 mbit both quad p3
xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1
Hello,
In my experience it is very important to use anchor text giving it
quite high boost. It allows me to return http://www.aa.com when user
searches for "American Airlines" - without using anchor text it was
impossible to achieve - a lot of sites (spam or not) with "american
airlines" in url a
Hello,
In my experience it is very important to use anchor text giving it
quite high boost. It allows me to return http://www.aa.com when user
searches for "American Airlines" - without using anchor text it was
impossible to achieve - a lot of sites (spam or not) with "american
airlines" in url an
I think it was already answered by Doug ealier in this thread.
"... Yes. It is alpha-quality, not yet release-worthy, but it works. If
you're an experienced Java developer, I'd encourage you to give it a
try. If you're a user who doesn't want to look beyond the config files,
then I'd wait a bit
is the mapreduce working yet?
I would also like to test it.
-J
- Original Message -
From: "Piotr Kosiorowski" <[EMAIL PROTECTED]>
To:
Sent: Friday, August 05, 2005 8:06 AM
Subject: Re: Fw: Re: near-term plan
> I am not sure what you exactly did in this test but I understand you
> were u
I am not sure what you exactly did in this test but I understand you
were using jar file prepared by me (it was nutch from trunk + ndfs
patches). As these patches were applied by Andrzej some time ago - we
can assume you were using NDFS code from trunk.
Because a lot of work went into mapreduce bra
-- Forwarded Message ---
From: "webmaster" <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Thu, 4 Aug 2005 19:42:53 -0500
Subject: Re: near-term plan
I was using a nightly build that Pitor had given me the nutch-nightly.jar
(actually it was nutch-dev0.7.jar or something
Hello,
A very basic facility seem to be missing in Nutch. If I have a 2000 urls
list in Nutch DB and want to ignore external links, I have to build a
regex-filter with thousands of different domain I want to crawl. No
parameter to only crawl the different domain and ignore external links.
At
13 matches
Mail list logo