John:

We did some research and ran some test on our system to better understand
the needs.

NOTE: The finding below are based on keeping 90% OF NUTCH INSTALLS in mind
-- simple, out of the box, small/medium size index. 

In short, using a rules engine for URL filtering is a waste of resources.
The rules engine should only be used for filtering pages/url based on
Content -- example, remove a pages because it matched Adult content, less
than 2K, text/tag ration low, etc. etc.

Here are the reasons why it's bad for URL filtering:

- First, lets get JESS out of the picture, it's not open source and it's use
in anything even close to commercial requires a commercial license ($$$).

- The startup time involved with loading the Rules engine is high, and
neither are they light-weight (500-800k) memory footprint.

- Most of the rules for URL filtering are Regular Exp.(REs), and REs
executes pretty fast in Java. Thus, we did not see a substantial increase in
performance unless we went to 100+ rules. Even then, with the startup time
for the Engine, plain simple stuff won out. NOTE: speed can be improved by
keeping the Engine loaded in memory at all times (instead being loaded each
time fetch process is run. Given that WE do frequent indexing with segments
of approx. 1000 pages, we'll need it in memory at all times. This may not be
the case if segments are 100,000 or so). Either way, introducing a rules
engine would require a change to the way the plug-in is called as you need
to create the RETE network upon startup/first call.

- The rules engine is best when using a lot of 'if..then..else' statements
and the "facts" are unknown until runtime (and thus, why it's great for
Content filtering and bad for URL filtering). With URLs we know what they
are before we even begin. With Content we get all the details during parsing
and need to make a decision at that point). 

- Even for the regex-filter unless we are talking about 100+ filters the
startup time and the requirement to change the code makes other simpler
options more viable. For example, the XML based ACTION/GROUP option I
described earlier.


So, how does one attack the problem (Assuming you're looking for a larger
deployment):

We found that the bottleneck to a faster crawl and index is due to the
following:
1. WebDB Size
2. Recrawling Blocked URLs (not remembering domain status across crawls)

Point 1 should be intuitive -- the larger the DB, the more time is takes to
sort. The second point relates to the fact that the fetcher does not
remember the status of a domain across crawls -- if you are blocked from a
particular domain, future fetch lists should not even contain URLs from that
domain/directory. Another issue is when a domain is down -- this should also
be stored for a period of time (say 12 hours). 

Also, to reduce the size of the WebDB and only store "fetchable" URL's in
it, I think we should only add those links to the DB that would otherwise
pass the filters specified by the user (i.e run the filters before adding
links to the DB).

To achieve the above we're creating a simple external database, which runs
like a service and keeps the status of domains. The DB will serve 2
functions: a. More or less a cache for robots.txt files, down domains  b.
provide users with a way to block domains/directories.

The goal is to catch and remove non-crawlable URLs before they make it to
the fetchlist, or better yet get added to the WebDB. A simple java API will
allow for a check to be made for a URL (think of this like a DNS server).  

I would appreciate your (and anyone eles's input) on any other needs this
should incorporate. This will be created using hsql or Berkeley DB (unless
there is a better option, both these are GPL), as the underlying database
for simplicity and development speed.  



 
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of John X
Sent: Tuesday, February 08, 2005 6:02 PM
To: Chirag Chaman
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [Nutch-dev] make URLFilter as plugin

On Tue, Feb 08, 2005 at 09:41:28AM -0500, Chirag Chaman wrote:
> John:
> 
> We tested with QuickRules (YasuTech).
> The only non-commercial one I've used is Jess -- though it may have 
> license issues.
> 
> I know there is a big move to get open source XML rules engine made, 
> especially since the RFC is now stabilized, so there should be some 
> strong products coming out (hopefully soon).
> 
> I think for now, something simple that incorporates GROUP and STOP 
> should be sufficient for 80% of the needs (80/20 rule), as it will be 
> flexible and fast (and you can skip over unnecessary rules).
>  
> If you need any help -- please let me know (I'm not the best coder 
> around, but can definitely have one of my engineers follow your lead).

Current interface URLFilter.java may be too simple.
If you or your engineers can make a suggestion/evaluation for typical nutch
need, that will be great. The best would be some sample codes with Jess.
This is only about url filtering.

Thanks,

John




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide Read honest & candid reviews
on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to