Re: Much faster RegExp lib needed in nutch?
Doug, Instead I would suggest go a step forward by add a (configurable) timeout mechanism and skip bad records in reducing in general. Processing such big data and losing all data because just of one bad record is very sad. That's a good suggestion. Ideally we could use Thread.interrupt(), but that won't stop a thread in a tight loop. The only other option is thread.stop(), which isn't generally safe. The safest thing to do is to restart the task in such a way that the bad entry is skipped. Sounds like a lot of overhead but I agree there is no other chance. As far I know google's map reduce skip bad records also. Yes, I the paper says that, when a job fails, they can restart it, skipping the bad entry. I don't think they skip without restarting the task. In Hadoop I think this could correspond to removing the task that failed and replacing it with two tasks: one whose input split includes entries before the bad entry, and one whose input split includes those after. It would be very nice if there would be any chance of recycle the already processed records and just add a new task that process the records from badrecord +1 to the end of the split. But determining which entry failed is hard. Unless we report every single entry processed to the TaskTracker (which would be too expensive for many map function) then it is hard to know exactly where things were when the process dies. Something pops up in my mind would be splitting the task until we found the one record that fails. Of course this is expansive sine we have to may to process many small tasks. We could instead include the number of entries processed in each status message, and the maximum count of entries before another status will be sent. This sounds interesting. We would require some more meta data in the reporter, but this is scheduled for hadoop 0.2. In this change I would love to see the ability custom meta data in the report ( MapWritable?) also. In combination with a public API that allows to access these task reports we can have kind of lockmanager as described in the big table talk. This way the task child can try to send, e.g., about one report per second to its parent TaskTracker, and adaptively determine how many entries between reports. So, for the first report it can guess that it will process only 1 entry before the next report. Then it processes the first entry and can now estimate how many entries it can process in the next second, and reports this as the maximum number of entries before the next report. Then it processes entries until either the reported max or one second is exceeded, and then makes its next status report. And so on. If the child hangs, then one can identify the range of entries that it was in down to one second. If each entry takes longer than one second to process then we'd know the exact entry. Unfortunately, this would not work with the Nutch Fetcher, which processes entries in separate threads, not strictly ordered... Well it would work for all map and reduce task. MapRunnable implementations can take care about bad records by itself since here we have fully access to the record reader. Stefan
Re: Much faster RegExp lib needed in nutch?
Stefan Groschupf wrote: Instead I would suggest go a step forward by add a (configurable) timeout mechanism and skip bad records in reducing in general. Processing such big data and losing all data because just of one bad record is very sad. That's a good suggestion. Ideally we could use Thread.interrupt(), but that won't stop a thread in a tight loop. The only other option is thread.stop(), which isn't generally safe. The safest thing to do is to restart the task in such a way that the bad entry is skipped. As far I know google's map reduce skip bad records also. Yes, I the paper says that, when a job fails, they can restart it, skipping the bad entry. I don't think they skip without restarting the task. In Hadoop I think this could correspond to removing the task that failed and replacing it with two tasks: one whose input split includes entries before the bad entry, and one whose input split includes those after. Or we could keep a list of bad entry indexes and send these along with the task. I prefer splitting the task. But determining which entry failed is hard. Unless we report every single entry processed to the TaskTracker (which would be too expensive for many map function) then it is hard to know exactly where things were when the process dies. We could instead include the number of entries processed in each status message, and the maximum count of entries before another status will be sent. This way the task child can try to send, e.g., about one report per second to its parent TaskTracker, and adaptively determine how many entries between reports. So, for the first report it can guess that it will process only 1 entry before the next report. Then it processes the first entry and can now estimate how many entries it can process in the next second, and reports this as the maximum number of entries before the next report. Then it processes entries until either the reported max or one second is exceeded, and then makes its next status report. And so on. If the child hangs, then one can identify the range of entries that it was in down to one second. If each entry takes longer than one second to process then we'd know the exact entry. Unfortunately, this would not work with the Nutch Fetcher, which processes entries in separate threads, not strictly ordered... Doug
Re: Much faster RegExp lib needed in nutch?
Beside that, we may should add a kind of timeout to the url filter in general. I think this is overkill. There is already a Hadoop task timeout. Is that not sufficient? No! What happens is that the url filter hang and than the complete task is time outed instead of just skipping this url. After 4 retries the complete job is killed and all fetched data are lost, in my case any time 5 mio urls. :-( This was the real reason of the described problem in hadoop-dev. Instead I would suggest go a step forward by add a (configurable) timeout mechanism and skip bad records in reducing in general. Processing such big data and losing all data because just of one bad record is very sad. As far I know google's map reduce skip bad records also. Stefan
Re: Much faster RegExp lib needed in nutch?
Stefan Groschupf wrote: Beside that, we may should add a kind of timeout to the url filter in general. I think this is overkill. There is already a Hadoop task timeout. Is that not sufficient? Doug
Re: Much faster RegExp lib needed in nutch?
Jérôme Charron wrote: 3. Add new plugins that use dk.brics.automaton.RegExp, using different default regex file names. Then folks can, if they choose, configure things to use these faster regex libraries, but only if they're willing to write the simpler regexes that it supports. If, over time, we find that the most useful regexes are easily converted, then we could switch the default to this. +1 I will doing it this way. Thanks Doug. Yes, I prefer it this way too, then it's clear that it's different and should be treated differently. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Much faster RegExp lib needed in nutch?
> If it were easy to implement all java regex features in > dk.brics.automaton.RegExp, then they probably would have. Alternately, > if they'd implemented all java regex features, it probably wouldn't be > so fast. So I worry that attempts to translate are doomed. Better to > accept the differences: if you want the speed, you must use restricted > regexes. That's right. It is a deterministic API => more speed, but less functionality. > 3. Add new plugins that use dk.brics.automaton.RegExp, using different > default regex file names. Then folks can, if they choose, configure > things to use these faster regex libraries, but only if they're willing > to write the simpler regexes that it supports. If, over time, we find > that the most useful regexes are easily converted, then we could switch > the default to this. +1 I will doing it this way. Thanks Doug. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
> Beside that, we may should add a kind of timeout to the url filter in > general. > Since it can happen that a user configure a regex for his nutch setup > that run in the same problem as we had run right now. > Something like below attached. > Would you agree? I can create a serious patch and test it if we are > interested to add this as a fail back into the sources. +1 as a short term solution. In the long term, I think we should try to reproduce it and analyze what really happen. (I will commit some minimal unit test in the next few days). Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
Beside that, we may should add a kind of timeout to the url filter in general. Since it can happen that a user configure a regex for his nutch setup that run in the same problem as we had run right now. Something like below attached. Would you agree? I can create a serious patch and test it if we are interested to add this as a fail back into the sources. At least this would save nutch against wrong user configurations. :-) Index: src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/ RegexURLFilter.java === --- src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/ RegexURLFilter.java (revision 383682) +++ src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/ RegexURLFilter.java (working copy) @@ -75,14 +75,20 @@ public synchronized String filter(String url) { Iterator i=rules.iterator(); +MatcherThread mt; while(i.hasNext()) { - Rule r=(Rule) i.next(); - Matcher matcher = r.pattern.matcher(url); - - if (matcher.find()) { -//System.out.println("Matched " + r.regex); -return r.sign ? url : null; - } + mt = new MatcherThread(); + mt.rule=(Rule) i.next(); + mt.start(); + try { +synchronized (mt.monitor) { + if (!mt.done) { +mt.monitor.wait(1000); + } +} + } catch (InterruptedException e) {} + mt.stop(); + return mt.found ? url : null; }; return null; // assume no go @@ -87,6 +93,24 @@ return null; // assume no go } + + class MatcherThread extends Thread { +private Object monitor = new Object(); +private String url; +private Rule rule; +private boolean found = false; +private boolean done = false; +public void run() { + Matcher matcher = this.rule.pattern.matcher(url); + if (matcher.find()) { +this.found = rule.sign; + } + synchronized (monitor) { +this.monitor.notify(); +this.done = true; + } +} + } // // Format of configuration file is Am 16.03.2006 um 18:10 schrieb Jérôme Charron: 1. Keeps the well-known perl syntax for regexp (and then find a way to "simulate" them with automaton "limited" syntax) ? My vote would be for option 1. It's less work for everyone (except for the person incorporating the new library :) That's my prefered solution too. The first challenge is to see how to translate the regexp used in default regexp-urlfilter templates provided by Nutch. For now, in the only thing I don't see how to translate from perl syntax to dk.brics.automaton syntax is this regexp: -.*(/.+?)/.*?\1/.*?\1/.* In fact, automaton doesn't support capturing groups (Anders Moeller has confirmed). We cannot remove this regexp from urlfilter, but we cannot handle it with automaton. So, two solutions: 1. Keep java regexp ... 2. Switch to automaton and provide a java implementation of this regexp (it is more a protection pattern than really a filter pattern, and it could probably be hard-coded). I'm waiting for your suggestions... Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ - blog: http://www.find23.org company: http://www.media-style.com
Re: Much faster RegExp lib needed in nutch?
Jérôme Charron wrote: So, two solutions: 1. Keep java regexp ... 2. Switch to automaton and provide a java implementation of this regexp (it is more a protection pattern than really a filter pattern, and it could probably be hard-coded). If it were easy to implement all java regex features in dk.brics.automaton.RegExp, then they probably would have. Alternately, if they'd implemented all java regex features, it probably wouldn't be so fast. So I worry that attempts to translate are doomed. Better to accept the differences: if you want the speed, you must use restricted regexes. How about: 3. Add new plugins that use dk.brics.automaton.RegExp, using different default regex file names. Then folks can, if they choose, configure things to use these faster regex libraries, but only if they're willing to write the simpler regexes that it supports. If, over time, we find that the most useful regexes are easily converted, then we could switch the default to this. Doug
Re: Much faster RegExp lib needed in nutch?
> >1. Keeps the well-known perl syntax for regexp (and then find a way to >"simulate" them with automaton "limited" syntax) ? My vote would be for option 1. It's less work for everyone > (except for the person incorporating the new library :) That's my prefered solution too. The first challenge is to see how to translate the regexp used in default regexp-urlfilter templates provided by Nutch. For now, in the only thing I don't see how to translate from perl syntax to dk.brics.automaton syntax is this regexp: -.*(/.+?)/.*?\1/.*?\1/.* In fact, automaton doesn't support capturing groups (Anders Moeller has confirmed). We cannot remove this regexp from urlfilter, but we cannot handle it with automaton. So, two solutions: 1. Keep java regexp ... 2. Switch to automaton and provide a java implementation of this regexp (it is more a protection pattern than really a filter pattern, and it could probably be hard-coded). I'm waiting for your suggestions... I've pinged Terence Parr - ANTLR author. I heard that the new version (ANTLR 3) has a fast FSM inside it. If so, somebody could write an ANTLR grammar to convert the Nutch regex into another ANTLR grammar that, when processed by ANLTR, creates a URL parser/validator. It's almost too easy... :) Anyway, waiting to hear back from Ter. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers"
Re: Much faster RegExp lib needed in nutch?
> >1. Keeps the well-known perl syntax for regexp (and then find a way to > >"simulate" them with automaton "limited" syntax) ? > My vote would be for option 1. It's less work for everyone > (except for the person incorporating the new library :) That's my prefered solution too. The first challenge is to see how to translate the regexp used in default regexp-urlfilter templates provided by Nutch. For now, in the only thing I don't see how to translate from perl syntax to dk.brics.automaton syntax is this regexp: -.*(/.+?)/.*?\1/.*?\1/.* In fact, automaton doesn't support capturing groups (Anders Moeller has confirmed). We cannot remove this regexp from urlfilter, but we cannot handle it with automaton. So, two solutions: 1. Keep java regexp ... 2. Switch to automaton and provide a java implementation of this regexp (it is more a protection pattern than really a filter pattern, and it could probably be hard-coded). I'm waiting for your suggestions... Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
The probability of encountering a $ sign somewhere inside URL is not insignificant... I agree that it's very unlikely (perhaps even illegal) to use ^ in URLs, but $ are sometimes used. I'd have to take a look at the spec, but I think both characters should be URL-encoded anyway. Maybe it'd be a good idea to include a URL-normalizing filter that would encode everything properly (according to www-url-encoding) before regexping? D.
Re: Much faster RegExp lib needed in nutch?
Incze Lajos wrote: * simulate ^ and $ operators by prepending and appending special start and end markers to the input string. E.g. String START = "__START__"; String END = "__END__"; inputString = START + inputString + END; What about char START = '^'; char END = '$'; inputString = START + inputString + END; ? The probability of encountering a $ sign somewhere inside URL is not insignificant... I agree that it's very unlikely (perhaps even illegal) to use ^ in URLs, but $ are sometimes used. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Much faster RegExp lib needed in nutch?
Thanks to everybody for your suggestions. But really, my problem is not technical, but "political" : What should we do if we switch to automaton regexp lib ? 1. Keeps the well-known perl syntax for regexp (and then find a way to "simulate" them with automaton "limited" syntax) ? 2. Switch to the automaton "limited" syntax (=>must be well documented) My vote would be for option 1. It's less work for everyone (except for the person incorporating the new library :)
Re: Much faster RegExp lib needed in nutch?
Thanks to everybody for your suggestions. But really, my problem is not technical, but "political" : What should we do if we switch to automaton regexp lib ? 1. Keeps the well-known perl syntax for regexp (and then find a way to "simulate" them with automaton "limited" syntax) ? 2. Switch to the automaton "limited" syntax (=>must be well documented) What is your feeling about this? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
I've been watching discussion of faster regex libs with much interest. But if regex speed seems to be a problem, would using less regexes be a good answer? Protocol and extension filtering could be done by another URLFilter plugin that is dedicated to this task, and uses more lightweight string-chopping techniques. That way full regex support could be retained for the tasks where it's really needed. On Mar 13, 2006, at 12:31 PM, Howie Wang wrote: I have made some quick tests with regex-urlfilter... The major problem is that it doen't use the Perl syntax... For instance, ît doesn't support the boundary matchers ^ and $ (which are used in nutch) Are there other ways to match start/end of string in the other regex library? I use "^http" a lot because a lot of sites pass around urls in the query string, and I don't want them (eg. http://del.icio.us/howie?url=http://lucene.apache.org/nutch) Howie -- Matt Kangas / [EMAIL PROTECTED]
Re: Much faster RegExp lib needed in nutch?
> * simulate ^ and $ operators by prepending and appending special start > and end markers to the input string. > > E.g. >String START = "__START__"; >String END = "__END__"; >inputString = START + inputString + END; What about char START = '^'; char END = '$'; inputString = START + inputString + END; ? incze
Re: Much faster RegExp lib needed in nutch?
I have made some quick tests with regex-urlfilter... The major problem is that it doen't use the Perl syntax... For instance, ît doesn't support the boundary matchers ^ and $ (which are used in nutch) Are there other ways to match start/end of string in the other regex library? I use "^http" a lot because a lot of sites pass around urls in the query string, and I don't want them (eg. http://del.icio.us/howie?url=http://lucene.apache.org/nutch) Howie
Re: Much faster RegExp lib needed in nutch?
Jérôme Charron wrote: I forgot to add: it is also slightly incompatible with Perl regex (and consequently with Java regex), I don't remember the details, they are somewhere in the docs, but the incompatibility is caused by some rarely used operators being not implemented... so I guess we could live with it. I have made some quick tests with regex-urlfilter... The major problem is that it doen't use the Perl syntax... For instance, ît doesn't support the boundary matchers ^ and $ (which are used in nutch) Of course, regexp can be easily rewritten. But how will we face to this if we switch to dk.brics.automaton? * Change the syntax used in Nutch? * Convert perl syntax to dk.brics.automaton? * simulate ^ and $ operators by prepending and appending special start and end markers to the input string. E.g. String START = "__START__"; String END = "__END__"; inputString = START + inputString + END; I will make some benchs based on regexp-urlfilter to evaluate if dk.brics.automaton integration in Nutch is interesting (facing the needed changes in the regexp syntax) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Much faster RegExp lib needed in nutch?
* Change the syntax used in Nutch? +1, my point of view is that we can do that for nutch 0.8 as far we document (see nutch-user ) it. :-) Stefan
Re: Much faster RegExp lib needed in nutch?
> I forgot to add: it is also slightly incompatible with Perl regex (and > consequently with Java regex), I don't remember the details, they are > somewhere in the docs, but the incompatibility is caused by some rarely > used operators being not implemented... so I guess we could live with it. I have made some quick tests with regex-urlfilter... The major problem is that it doen't use the Perl syntax... For instance, ît doesn't support the boundary matchers ^ and $ (which are used in nutch) Of course, regexp can be easily rewritten. But how will we face to this if we switch to dk.brics.automaton? * Change the syntax used in Nutch? * Convert perl syntax to dk.brics.automaton? I will make some benchs based on regexp-urlfilter to evaluate if dk.brics.automaton integration in Nutch is interesting (facing the needed changes in the regexp syntax) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
Jérôme Charron wrote: Thanks for volunteering, you're welcome ... ;-) Good job Andrzej !;-) So, That's now in my todo list to check the perl5 compatibility issue and to provide some benchs to the community... I was kidding (sort of), I'll be glad to help testing / porting our regexps, if you need help. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Much faster RegExp lib needed in nutch?
> Thanks for volunteering, you're welcome ... ;-) Good job Andrzej !;-) So, That's now in my todo list to check the perl5 compatibility issue and to provide some benchs to the community... Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
Jérôme Charron wrote: It's not only faster, it also scales better for large and complex expressions, it is also possible to build automata from several expressions with AND/OR operators, which is the use case we have in regexp-utlfilter. It seems awesome! I forgot to add: it is also slightly incompatible with Perl regex (and consequently with Java regex), I don't remember the details, they are somewhere in the docs, but the incompatibility is caused by some rarely used operators being not implemented... so I guess we could live with it. Does somebody plans to switch to this lib in nutch? Thanks for volunteering, you're welcome ... ;-) Does the BSD license is compatible with ASF one? Yes. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Much faster RegExp lib needed in nutch?
> It's not only faster, it also scales better for large and complex > expressions, it is also possible to build automata from several > expressions with AND/OR operators, which is the use case we have in > regexp-utlfilter. It seems awesome! Does somebody plans to switch to this lib in nutch? Does the BSD license is compatible with ASF one? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
Jack Tang wrote: Hi all RegExp is widely used in nutch, and I now wondering is it jdk/jakarta classes is faster enough? Here is the benchmarks i found on web. http://tusker.org/regex/regex_benchmark.html it seems dk.brics.automaton.RegExp is fastest among the libs. It's not only faster, it also scales better for large and complex expressions, it is also possible to build automata from several expressions with AND/OR operators, which is the use case we have in regexp-utlfilter. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Much faster RegExp lib needed in nutch?
Hi all RegExp is widely used in nutch, and I now wondering is it jdk/jakarta classes is faster enough? Here is the benchmarks i found on web. http://tusker.org/regex/regex_benchmark.html it seems dk.brics.automaton.RegExp is fastest among the libs. /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars