Hi, Chirag, Thanks for your detailed report. Do you think rules engine would be good for UrlNormalizer? Can nutch possibly benifit from rules engine in other ways?
John On Mon, Feb 14, 2005 at 04:08:19PM -0500, Chirag Chaman wrote: > John: > > We did some research and ran some test on our system to better understand > the needs. > > NOTE: The finding below are based on keeping 90% OF NUTCH INSTALLS in mind > -- simple, out of the box, small/medium size index. > > In short, using a rules engine for URL filtering is a waste of resources. > The rules engine should only be used for filtering pages/url based on > Content -- example, remove a pages because it matched Adult content, less > than 2K, text/tag ration low, etc. etc. Do you have any benchmark numbers? How about use rules engine for UrlNormalizer? > > Here are the reasons why it's bad for URL filtering: > > - First, lets get JESS out of the picture, it's not open source and it's use > in anything even close to commercial requires a commercial license ($$$). > > - The startup time involved with loading the Rules engine is high, and > neither are they light-weight (500-800k) memory footprint. Any light weight implementation with smaller memory footprint? > > - Most of the rules for URL filtering are Regular Exp.(REs), and REs > executes pretty fast in Java. Thus, we did not see a substantial increase in > performance unless we went to 100+ rules. Even then, with the startup time > for the Engine, plain simple stuff won out. NOTE: speed can be improved by > keeping the Engine loaded in memory at all times (instead being loaded each > time fetch process is run. Given that WE do frequent indexing with segments > of approx. 1000 pages, we'll need it in memory at all times. This may not be > the case if segments are 100,000 or so). Either way, introducing a rules > engine would require a change to the way the plug-in is called as you need > to create the RETE network upon startup/first call. > > - The rules engine is best when using a lot of 'if..then..else' statements > and the "facts" are unknown until runtime (and thus, why it's great for > Content filtering and bad for URL filtering). With URLs we know what they > are before we even begin. With Content we get all the details during parsing > and need to make a decision at that point). > > - Even for the regex-filter unless we are talking about 100+ filters the > startup time and the requirement to change the code makes other simpler > options more viable. For example, the XML based ACTION/GROUP option I > described earlier. > > > So, how does one attack the problem (Assuming you're looking for a larger > deployment): > > We found that the bottleneck to a faster crawl and index is due to the > following: > 1. WebDB Size > 2. Recrawling Blocked URLs (not remembering domain status across crawls) > > Point 1 should be intuitive -- the larger the DB, the more time is takes to > sort. The second point relates to the fact that the fetcher does not > remember the status of a domain across crawls -- if you are blocked from a > particular domain, future fetch lists should not even contain URLs from that > domain/directory. Another issue is when a domain is down -- this should also > be stored for a period of time (say 12 hours). > > Also, to reduce the size of the WebDB and only store "fetchable" URL's in > it, I think we should only add those links to the DB that would otherwise > pass the filters specified by the user (i.e run the filters before adding > links to the DB). > > To achieve the above we're creating a simple external database, which runs > like a service and keeps the status of domains. The DB will serve 2 > functions: a. More or less a cache for robots.txt files, down domains b. > provide users with a way to block domains/directories. > > The goal is to catch and remove non-crawlable URLs before they make it to > the fetchlist, or better yet get added to the WebDB. A simple java API will > allow for a check to be made for a URL (think of this like a DNS server). > > I would appreciate your (and anyone eles's input) on any other needs this > should incorporate. This will be created using hsql or Berkeley DB (unless > there is a better option, both these are GPL), as the underlying database > for simplicity and development speed. > > > > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of John X > Sent: Tuesday, February 08, 2005 6:02 PM > To: Chirag Chaman > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: Re: [Nutch-dev] make URLFilter as plugin > > On Tue, Feb 08, 2005 at 09:41:28AM -0500, Chirag Chaman wrote: > > John: > > > > We tested with QuickRules (YasuTech). > > The only non-commercial one I've used is Jess -- though it may have > > license issues. > > > > I know there is a big move to get open source XML rules engine made, > > especially since the RFC is now stabilized, so there should be some > > strong products coming out (hopefully soon). > > > > I think for now, something simple that incorporates GROUP and STOP > > should be sufficient for 80% of the needs (80/20 rule), as it will be > > flexible and fast (and you can skip over unnecessary rules). > > > > If you need any help -- please let me know (I'm not the best coder > > around, but can definitely have one of my engineers follow your lead). > > Current interface URLFilter.java may be too simple. > If you or your engineers can make a suggestion/evaluation for typical nutch > need, that will be great. The best would be some sample codes with Jess. > This is only about url filtering. > > Thanks, > > John > > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide Read honest & candid reviews > on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Nutch-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Nutch-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > __________________________________________ http://www.neasys.com - A Good Place to Be Come to visit us today! ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
