[Nutch-dev] fetching redirect bug?

2005-08-05 Thread EM
Suppose we have to fetch 3 pages. Page A is http://something/login.php Page B is http://yyy/rrr/ which, when fetched, redirects to page A Page C is http://yyy/ttt/ which, when fetched, redirects to page A When fetching A, B, C the fetcher will fetch A B A C A Is there any way to prevent the r

[Nutch-dev] NUTCH-7 bug

2005-08-05 Thread Piotr Kosiorowski
Hello all, Some time ago I sent a patch for: http://issues.apache.org/jira/browse/NUTCH-7 (analyze runs for an excessive amount of time and creates huge temp files until it runs out of disk space (if you let the db grow)). I know PageRank computation is not activly maintained and we will probably

[Nutch-dev] Crawling directly from URL and Questions about using the index

2005-08-05 Thread Nils Hoeller
Hi, since my first experiments were sucessful, I m actually starting implementing Nutch into my Website Visualisation Tool. So I got now to my first questions: 1. I put a class into my Project that works similar to the CrawlerTool.java main class. This works fine if you have written the urls

[Nutch-dev] [jira] Updated: (NUTCH-78) German texts on website

2005-08-05 Thread Matthias Jaekle (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-78?page=all ] Matthias Jaekle updated NUTCH-78: - Attachment: de.properties.tgz anchors_de.properties, cached_de.properties, explain_de.properties, search_de.properties, text_de.properties > German texts on w

[Nutch-dev] [jira] Created: (NUTCH-78) German texts on website

2005-08-05 Thread Matthias Jaekle (JIRA)
German texts on website --- Key: NUTCH-78 URL: http://issues.apache.org/jira/browse/NUTCH-78 Project: Nutch Type: Improvement Components: searcher Reporter: Matthias Jaekle Priority: Minor Attachments: de.properties.tgz The German pr

[Nutch-dev] Re: near-term plan

2005-08-05 Thread webmaster
I was using a nightly build that Pitor had given me the nutch-nightly.jar (actually it was nutch-dev0.7.jar or something of that nature) I tested it on the windows platform, I had 5 machines running it, 2 at 100 mbit both quad p3 xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1

[Nutch-dev] Re: Strange search results

2005-08-05 Thread Howie Wang
Hello, In my experience it is very important to use anchor text giving it quite high boost. It allows me to return http://www.aa.com when user searches for "American Airlines" - without using anchor text it was impossible to achieve - a lot of sites (spam or not) with "american airlines" in url a

[Nutch-dev] Re: Strange search results

2005-08-05 Thread Piotr Kosiorowski
Hello, In my experience it is very important to use anchor text giving it quite high boost. It allows me to return http://www.aa.com when user searches for "American Airlines" - without using anchor text it was impossible to achieve - a lot of sites (spam or not) with "american airlines" in url an

[Nutch-dev] Re: Fw: Re: near-term plan

2005-08-05 Thread Piotr Kosiorowski
I think it was already answered by Doug ealier in this thread. "... Yes. It is alpha-quality, not yet release-worthy, but it works. If you're an experienced Java developer, I'd encourage you to give it a try. If you're a user who doesn't want to look beyond the config files, then I'd wait a bit

[Nutch-dev] Re: Fw: Re: near-term plan

2005-08-05 Thread Jay Pound
is the mapreduce working yet? I would also like to test it. -J - Original Message - From: "Piotr Kosiorowski" <[EMAIL PROTECTED]> To: Sent: Friday, August 05, 2005 8:06 AM Subject: Re: Fw: Re: near-term plan > I am not sure what you exactly did in this test but I understand you > were u

[Nutch-dev] Re: Fw: Re: near-term plan

2005-08-05 Thread Piotr Kosiorowski
I am not sure what you exactly did in this test but I understand you were using jar file prepared by me (it was nutch from trunk + ndfs patches). As these patches were applied by Andrzej some time ago - we can assume you were using NDFS code from trunk. Because a lot of work went into mapreduce bra

[Nutch-dev] Fw: Re: near-term plan

2005-08-05 Thread webmaster
-- Forwarded Message --- From: "webmaster" <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Thu, 4 Aug 2005 19:42:53 -0500 Subject: Re: near-term plan I was using a nightly build that Pitor had given me the nutch-nightly.jar (actually it was nutch-dev0.7.jar or something

[Nutch-dev] Ignore external links from crawled domains

2005-08-05 Thread Christophe Noel
Hello, A very basic facility seem to be missing in Nutch. If I have a 2000 urls list in Nutch DB and want to ignore external links, I have to build a regex-filter with thousands of different domain I want to crawl. No parameter to only crawl the different domain and ignore external links. At