I know this is off on a tangent, but:
One huge adavantage to filtering in the FetchListTool (or is that the
Generator, I'm still on 0.7?) is that you can generate separate fetch
lists for separate "scopes", or subsets of your crawl data. You can
then give your users some control over which of se
Hi Stefan,
I think these are fine things to be doing. Just two points:
(1) Why not just always pass the NutchConf to the constructor of any
class that needs it? Instead of distinguishing between the case of
whether the class will use 1 or 2 configuration parameters; or more than
that. Just for
ely separate Nutch products? If it is, then now is probably the
right time to do so.
Regards,
David Wallace.
This email may contain legally privileged information and is intended only for
the addressee. It is not ne
Would it be worthwhile discussing the pros and cons of having two completely separate Nutch products? If it is, then now is probably the right time to do so.
Regards,
David Wallace.
This email may contain legally privileged information and
is intended only for the addressee. It is not necessarily the
Hi all.
Has anyone written a version of the FetchListTool that only adds a URL
to the fetch list if it complies with a particular Regex URL filter? If
so, would they be prepared to share? I need to do something like this,
but I dislike re-inventing wheels.
Essentially, I'm doing an intranet-ty
Hello Cao,
The problem is not that the URLs are relative - it's the ? and =
characters. Try changing the line
[EMAIL PROTECTED] to [EMAIL PROTECTED]
and the problem will go away.
Kind regards,
David.
From: "cao yuzhong" <[EMAIL PROTECTED]>
To: nutch-dev@incubator.apache.org
Date: Thu, 14 Apr 2
OK Jack, but the details of my analyser aren't particularly exciting.
I need to index a site that has a mixture of documents in English and
Te Reo Maori (indigenous language of New Zealand). Vowels in Te Reo
Maori are sometimes written with short overlines (also known as
macrons), to indicate a
Hi all,
I have found a need to do document analysis other than that which is
provided by the NutchDocumentAnalyzer class. I have written my own
Analyzer class, and I need to plug it into the Nutch framework. What
I've done is the following, and I'd like to suggest that it be made part
of the main
Hi all,
I am trying to understand Nutch a little better, so that I can evaluate
its suitability for a project I am soon to embark on. I have been
studying the code in CrawlTool.java (used for an "intranet search").
The line that bothers me is the call to IndexMerger.main(), near the end
of main()