Just fixed this (and made some other improvements as well). See: http://issues.apache.org/jira/browse/NUTCH-410
Hope this is useful; feedback welcome. -Doug Neal Richter-3 wrote: > > Doug, > > I think it sounds like a good idea. It eliminates the need to order the > rules precisely... > > We don't iterate them in HtDig and it's been on my todo list for a while > as > well. > > I would iterate until no matches, some max iteration number, or the URL is > obviously junk. > > For the max iteration number I would use the number of rewrite rules you > have. So if you have 10 rules, you iterate on all 10 rules 10 times. > That > will cover the case where your rules 'chain' in a 10 step sequence. Sure > it's an edge case to do that, but I can see rule sets where you construct > 3-step chains (like swapping strings or something). > > Thanks > > Neal > > On 8/30/06, Doug Cook <[EMAIL PROTECTED]> wrote: >> >> >> Hi, >> >> I've run across a few patterns in URLs where applying a normalization >> puts >> the URL in a form matching another normalization pattern (or even the >> same >> one). But that pattern won't get executed because the patterns are >> applied >> only once. >> >> Should normalization iterate until no patterns match (with, perhaps, some >> limit to the number of iterations to prevent loops from pattern >> mistakes)? >> >> It's a minor problem; it doesn't seem to affect too many URLs for things >> like session ID removal, since finding two session IDs in the same URL is >> rare (but does happen -- that's how I noticed this). I could imagine it >> being much more significant, however, if other Nutch users out there are >> using "broader" normalization patterns. >> >> Any philosophical/practical objections? (it's early, I've only had 1 >> coffee, >> and I've probably missed something obvious!) >> >> I'll file an issue and add it to my queue of things to do if people think >> its a good idea. >> >> -Doug >> -- >> View this message in context: >> http://www.nabble.com/Should-URL-normalization-iterate--tf2190244.html#a6059957 >> Sent from the Nutch - Dev forum at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Should-URL-normalization-iterate--tf2190244.html#a7606490 Sent from the Nutch - Dev mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
