[Nutch-dev] Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

2006-03-08 Thread David Wallace
I know this is off on a tangent, but: One huge adavantage to filtering in the FetchListTool (or is that the Generator, I'm still on 0.7?) is that you can generate separate fetch lists for separate "scopes", or subsets of your crawl data. You can then give your users some control over which of se

[Nutch-dev] Re: no static NutchConf

2006-01-04 Thread David Wallace
Hi Stefan, I think these are fine things to be doing. Just two points: (1) Why not just always pass the NutchConf to the constructor of any class that needs it? Instead of distinguishing between the case of whether the class will use 1 or 2 configuration parameters; or more than that. Just for

[Nutch-dev] Re: version branches / two products

2005-12-15 Thread David Wallace
ely separate Nutch products? If it is, then now is probably the right time to do so. Regards, David Wallace. This email may contain legally privileged information and is intended only for the addressee. It is not ne

[Nutch-dev] Re: version branches / two products

2005-12-15 Thread David Wallace
Would it be worthwhile discussing the pros and cons of having two completely separate Nutch products?  If it is, then now is probably the right time to do so.   Regards, David Wallace. This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the

[Nutch-dev] Filter at fetch list time

2005-04-26 Thread David Wallace
Hi all. Has anyone written a version of the FetchListTool that only adds a URL to the fetch list if it complies with a particular Regex URL filter? If so, would they be prepared to share? I need to do something like this, but I dislike re-inventing wheels. Essentially, I'm doing an intranet-ty

Re: [Nutch-dev] Crawl-urlfilter cann't deals with relative urls appropriately ??

2005-04-14 Thread David Wallace
Hello Cao, The problem is not that the URLs are relative - it's the ? and = characters. Try changing the line [EMAIL PROTECTED] to [EMAIL PROTECTED] and the problem will go away. Kind regards, David. From: "cao yuzhong" <[EMAIL PROTECTED]> To: nutch-dev@incubator.apache.org Date: Thu, 14 Apr 2

Re: [Nutch-dev] Feature request - pluggable Analyzer

2005-04-13 Thread David Wallace
OK Jack, but the details of my analyser aren't particularly exciting. I need to index a site that has a mixture of documents in English and Te Reo Maori (indigenous language of New Zealand). Vowels in Te Reo Maori are sometimes written with short overlines (also known as macrons), to indicate a

[Nutch-dev] Feature request - pluggable Analyzer

2005-04-11 Thread David Wallace
Hi all, I have found a need to do document analysis other than that which is provided by the NutchDocumentAnalyzer class. I have written my own Analyzer class, and I need to plug it into the Nutch framework. What I've done is the following, and I'd like to suggest that it be made part of the main

[Nutch-dev] Question re index merge call in crawl tool

2005-04-11 Thread David Wallace
Hi all, I am trying to understand Nutch a little better, so that I can evaluate its suitability for a project I am soon to embark on. I have been studying the code in CrawlTool.java (used for an "intranet search"). The line that bothers me is the call to IndexMerger.main(), near the end of main()