John:
This is a very good idea -- and one that we currently use as a "hack" (i.e.
very slow)
Here are a few things that we faced:
1. At times we need to reprocess rules. Example:
- Run URL filter and remove URL
- Run RegexURL filter to transform passed url to another URL
- No, it may be required to run URL filter again
Thus, have a way to reject in RegexURL would be nice. That would
also make URLFiler redundant
2. 3xx forwards -- they seem to get by as the first URL gets recorded.
There needs to be a way where getting a 3xx forward should junk the old url
and start taking the new one or both (user defined). Now the resulting URL
should be checked against filters. Thus abilty to call the plugin from
protocol-http.
3. As rules grow filtering becomes slow -- prior to using Nutch we were
using a commercial RETE rules engine in which we have loaded the REs as
rules. This improved speed immensely. Maybe an overkill for now. Below is a
simpler way to do this.
Here's what we're planning on building -- is this helpful? How would this
play in with plugins...
<GROUP> Rule Group Name
<RULE>
<MATCH> RE to match </MATCH>
<ACTION> Discard/Substitution/GoTo </ACTION>
<SUBSTITUTION> Substitution </SUBSTUTION>
<GOTO>RuleGroupToSendProcess</GOTO>
<STOP> 0 or 1 - 0 would mean keep processing more rules <STOP>
</RULE>
</GROUP>
Here's who this would work.
-Each file has a "Default" group, under which all rules are kept.
-For more advanced rules, one could send control to another RuleGroup on
match (helpful when you want specific groups of rules for a certain domain,
extension, etc) -- this will cut down the number of rules to look at.
- the Stop exits upon a match or keeps processing more rules in the same
group.
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of John X
Sent: Monday, January 31, 2005 7:53 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: [Nutch-dev] make URLFilter as plugin
Hi, All,
I propose to define plugin extension point for URLFilter, and convert
current RegexURLFilter.java, PrefixURLFilter.java, etc., into plugins.
However there is one requirement, different from other plugin extensions: we
should be able to specify the order by which plugins are loaded and applied.
I have not checked, but I assume, by default, we can always name plugins in
alphabetical order.
Stefan: any better way to do this?
If no one thinks this is a bad idea, I am going to start work on it right
way.
John
-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool
for open source databases. Create drag-&-drop reports. Save time by over
75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers