Re: Much faster RegExp lib needed in nutch?
Doug, Instead I would suggest go a step forward by add a (configurable) timeout mechanism and skip bad records in reducing in general. Processing such big data and losing all data because just of one bad record is very sad. That's a good suggestion. Ideally we could use Thread.interrupt(), but that won't stop a thread in a tight loop. The only other option is thread.stop(), which isn't generally safe. The safest thing to do is to restart the task in such a way that the bad entry is skipped. Sounds like a lot of overhead but I agree there is no other chance. As far I know google's map reduce skip bad records also. Yes, I the paper says that, when a job fails, they can restart it, skipping the bad entry. I don't think they skip without restarting the task. In Hadoop I think this could correspond to removing the task that failed and replacing it with two tasks: one whose input split includes entries before the bad entry, and one whose input split includes those after. It would be very nice if there would be any chance of recycle the already processed records and just add a new task that process the records from badrecord +1 to the end of the split. But determining which entry failed is hard. Unless we report every single entry processed to the TaskTracker (which would be too expensive for many map function) then it is hard to know exactly where things were when the process dies. Something pops up in my mind would be splitting the task until we found the one record that fails. Of course this is expansive sine we have to may to process many small tasks. We could instead include the number of entries processed in each status message, and the maximum count of entries before another status will be sent. This sounds interesting. We would require some more meta data in the reporter, but this is scheduled for hadoop 0.2. In this change I would love to see the ability custom meta data in the report ( MapWritable?) also. In combination with a public API that allows to access these task reports we can have kind of lockmanager as described in the big table talk. This way the task child can try to send, e.g., about one report per second to its parent TaskTracker, and adaptively determine how many entries between reports. So, for the first report it can guess that it will process only 1 entry before the next report. Then it processes the first entry and can now estimate how many entries it can process in the next second, and reports this as the maximum number of entries before the next report. Then it processes entries until either the reported max or one second is exceeded, and then makes its next status report. And so on. If the child hangs, then one can identify the range of entries that it was in down to one second. If each entry takes longer than one second to process then we'd know the exact entry. Unfortunately, this would not work with the Nutch Fetcher, which processes entries in separate threads, not strictly ordered... Well it would work for all map and reduce task. MapRunnable implementations can take care about bad records by itself since here we have fully access to the record reader. Stefan
Re: Much faster RegExp lib needed in nutch?
Stefan Groschupf wrote: Instead I would suggest go a step forward by add a (configurable) timeout mechanism and skip bad records in reducing in general. Processing such big data and losing all data because just of one bad record is very sad. That's a good suggestion. Ideally we could use Thread.interrupt(), but that won't stop a thread in a tight loop. The only other option is thread.stop(), which isn't generally safe. The safest thing to do is to restart the task in such a way that the bad entry is skipped. As far I know google's map reduce skip bad records also. Yes, I the paper says that, when a job fails, they can restart it, skipping the bad entry. I don't think they skip without restarting the task. In Hadoop I think this could correspond to removing the task that failed and replacing it with two tasks: one whose input split includes entries before the bad entry, and one whose input split includes those after. Or we could keep a list of bad entry indexes and send these along with the task. I prefer splitting the task. But determining which entry failed is hard. Unless we report every single entry processed to the TaskTracker (which would be too expensive for many map function) then it is hard to know exactly where things were when the process dies. We could instead include the number of entries processed in each status message, and the maximum count of entries before another status will be sent. This way the task child can try to send, e.g., about one report per second to its parent TaskTracker, and adaptively determine how many entries between reports. So, for the first report it can guess that it will process only 1 entry before the next report. Then it processes the first entry and can now estimate how many entries it can process in the next second, and reports this as the maximum number of entries before the next report. Then it processes entries until either the reported max or one second is exceeded, and then makes its next status report. And so on. If the child hangs, then one can identify the range of entries that it was in down to one second. If each entry takes longer than one second to process then we'd know the exact entry. Unfortunately, this would not work with the Nutch Fetcher, which processes entries in separate threads, not strictly ordered... Doug
Re: Much faster RegExp lib needed in nutch?
Beside that, we may should add a kind of timeout to the url filter in general. I think this is overkill. There is already a Hadoop task timeout. Is that not sufficient? No! What happens is that the url filter hang and than the complete task is time outed instead of just skipping this url. After 4 retries the complete job is killed and all fetched data are lost, in my case any time 5 mio urls. :-( This was the real reason of the described problem in hadoop-dev. Instead I would suggest go a step forward by add a (configurable) timeout mechanism and skip bad records in reducing in general. Processing such big data and losing all data because just of one bad record is very sad. As far I know google's map reduce skip bad records also. Stefan
Re: Much faster RegExp lib needed in nutch?
Stefan Groschupf wrote: Beside that, we may should add a kind of timeout to the url filter in general. I think this is overkill. There is already a Hadoop task timeout. Is that not sufficient? Doug
Re: Much faster RegExp lib needed in nutch?
Jérôme Charron wrote: 3. Add new plugins that use dk.brics.automaton.RegExp, using different default regex file names. Then folks can, if they choose, configure things to use these faster regex libraries, but only if they're willing to write the simpler regexes that it supports. If, over time, we find that the most useful regexes are easily converted, then we could switch the default to this. +1 I will doing it this way. Thanks Doug. Yes, I prefer it this way too, then it's clear that it's different and should be treated differently. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Much faster RegExp lib needed in nutch?
> If it were easy to implement all java regex features in > dk.brics.automaton.RegExp, then they probably would have. Alternately, > if they'd implemented all java regex features, it probably wouldn't be > so fast. So I worry that attempts to translate are doomed. Better to > accept the differences: if you want the speed, you must use restricted > regexes. That's right. It is a deterministic API => more speed, but less functionality. > 3. Add new plugins that use dk.brics.automaton.RegExp, using different > default regex file names. Then folks can, if they choose, configure > things to use these faster regex libraries, but only if they're willing > to write the simpler regexes that it supports. If, over time, we find > that the most useful regexes are easily converted, then we could switch > the default to this. +1 I will doing it this way. Thanks Doug. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
> Beside that, we may should add a kind of timeout to the url filter in > general. > Since it can happen that a user configure a regex for his nutch setup > that run in the same problem as we had run right now. > Something like below attached. > Would you agree? I can create a serious patch and test it if we are > interested to add this as a fail back into the sources. +1 as a short term solution. In the long term, I think we should try to reproduce it and analyze what really happen. (I will commit some minimal unit test in the next few days). Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
Beside that, we may should add a kind of timeout to the url filter in general. Since it can happen that a user configure a regex for his nutch setup that run in the same problem as we had run right now. Something like below attached. Would you agree? I can create a serious patch and test it if we are interested to add this as a fail back into the sources. At least this would save nutch against wrong user configurations. :-) Index: src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/ RegexURLFilter.java === --- src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/ RegexURLFilter.java (revision 383682) +++ src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/ RegexURLFilter.java (working copy) @@ -75,14 +75,20 @@ public synchronized String filter(String url) { Iterator i=rules.iterator(); +MatcherThread mt; while(i.hasNext()) { - Rule r=(Rule) i.next(); - Matcher matcher = r.pattern.matcher(url); - - if (matcher.find()) { -//System.out.println("Matched " + r.regex); -return r.sign ? url : null; - } + mt = new MatcherThread(); + mt.rule=(Rule) i.next(); + mt.start(); + try { +synchronized (mt.monitor) { + if (!mt.done) { +mt.monitor.wait(1000); + } +} + } catch (InterruptedException e) {} + mt.stop(); + return mt.found ? url : null; }; return null; // assume no go @@ -87,6 +93,24 @@ return null; // assume no go } + + class MatcherThread extends Thread { +private Object monitor = new Object(); +private String url; +private Rule rule; +private boolean found = false; +private boolean done = false; +public void run() { + Matcher matcher = this.rule.pattern.matcher(url); + if (matcher.find()) { +this.found = rule.sign; + } + synchronized (monitor) { +this.monitor.notify(); +this.done = true; + } +} + } // // Format of configuration file is Am 16.03.2006 um 18:10 schrieb Jérôme Charron: 1. Keeps the well-known perl syntax for regexp (and then find a way to "simulate" them with automaton "limited" syntax) ? My vote would be for option 1. It's less work for everyone (except for the person incorporating the new library :) That's my prefered solution too. The first challenge is to see how to translate the regexp used in default regexp-urlfilter templates provided by Nutch. For now, in the only thing I don't see how to translate from perl syntax to dk.brics.automaton syntax is this regexp: -.*(/.+?)/.*?\1/.*?\1/.* In fact, automaton doesn't support capturing groups (Anders Moeller has confirmed). We cannot remove this regexp from urlfilter, but we cannot handle it with automaton. So, two solutions: 1. Keep java regexp ... 2. Switch to automaton and provide a java implementation of this regexp (it is more a protection pattern than really a filter pattern, and it could probably be hard-coded). I'm waiting for your suggestions... Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ - blog: http://www.find23.org company: http://www.media-style.com
Re: Much faster RegExp lib needed in nutch?
Jérôme Charron wrote: So, two solutions: 1. Keep java regexp ... 2. Switch to automaton and provide a java implementation of this regexp (it is more a protection pattern than really a filter pattern, and it could probably be hard-coded). If it were easy to implement all java regex features in dk.brics.automaton.RegExp, then they probably would have. Alternately, if they'd implemented all java regex features, it probably wouldn't be so fast. So I worry that attempts to translate are doomed. Better to accept the differences: if you want the speed, you must use restricted regexes. How about: 3. Add new plugins that use dk.brics.automaton.RegExp, using different default regex file names. Then folks can, if they choose, configure things to use these faster regex libraries, but only if they're willing to write the simpler regexes that it supports. If, over time, we find that the most useful regexes are easily converted, then we could switch the default to this. Doug
Re: Much faster RegExp lib needed in nutch?
> >1. Keeps the well-known perl syntax for regexp (and then find a way to >"simulate" them with automaton "limited" syntax) ? My vote would be for option 1. It's less work for everyone > (except for the person incorporating the new library :) That's my prefered solution too. The first challenge is to see how to translate the regexp used in default regexp-urlfilter templates provided by Nutch. For now, in the only thing I don't see how to translate from perl syntax to dk.brics.automaton syntax is this regexp: -.*(/.+?)/.*?\1/.*?\1/.* In fact, automaton doesn't support capturing groups (Anders Moeller has confirmed). We cannot remove this regexp from urlfilter, but we cannot handle it with automaton. So, two solutions: 1. Keep java regexp ... 2. Switch to automaton and provide a java implementation of this regexp (it is more a protection pattern than really a filter pattern, and it could probably be hard-coded). I'm waiting for your suggestions... I've pinged Terence Parr - ANTLR author. I heard that the new version (ANTLR 3) has a fast FSM inside it. If so, somebody could write an ANTLR grammar to convert the Nutch regex into another ANTLR grammar that, when processed by ANLTR, creates a URL parser/validator. It's almost too easy... :) Anyway, waiting to hear back from Ter. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers"
Re: Crawling Accuracy
I have posted this before in the "nutch user", but, since that time I have made some aditional testing and I feel that this has more to do with developers. I have about 450 seed sites (in the quality and environmental areas) and I used the crawl method (Nutch 0.7.1.x) till depth 4, and then used the whole web method till depth 6 and some more sites (in this case not all) till detpth 7. I restrained the outlinks to 50, used the default crawl- urfilter (+^http://([a-z0-9]*\.)*NAMEOFSITE/, one for every site) and got about 523,000 pages. Doing some searches I noted that I only got few results for some terms. For instance "nureg" a document used by the Nuclear Regulatory Commission (NRC) yielded only a little more than 20 documents (there are more than 3,000 of them). Than I tried "site:www.nrc.gov http", and found only 82 pages. This site has more than 10,000 pages! I tried site:www.epa.gov http and only got 2413 pages (also, this site has more than 10,000 pages). The results were similar for other very large (and not dynamic sites). Experimenting further I crawled, using the crawl method, depth 7, only some sites, one per time. For instance, http://www.nrc.gov/ with the filter +^http://([a-z0-9]*\.)*nrc.gov/ (and -.), increasing the "http.max.delays" to 10 and the "http.timeout" to 2 and the results were very poor: looking for "http" resulted in only 58 results. Searching for "nureg" I only found 13 results, but for "adobe" (that should be blocked by the filter (but not by the "outlinks rule", I do not know) I got 4. Performing the same testing in other sites, like www.epa.gov; www.iaea.org; www.iso.org, the results were very similar: a very, very small percentage of the site pages indexed. So I am posting those results that can constitute, in itself, an issue that, may be, shall be dealt with. May be this is not an problem if you try to index the whole web, I dont know, but for niche sites, like mine, it seems to be. I think you're probably running into the limited # of domains problem that many vertical crawlers encounter. The default Nutch settings are for a maximum of one fetcher thread per domain. This is the safe setting for polite crawling, unless you enjoy getting blacklisted :) So if you have only a few domains (e.g. just one for your test case of just nrc.gov), you're going to get a lot of retry timeout errors as threads "block" because another thread is already fetching a page from the same domain. Which means that your effective throughput per domain is going to be limited to the rate at which individual pages can be downloaded, including the delay that your Nutch configuration specifies between each request. If you assume a page takes 1 second to download (counting connection setup time), plus there's a 5 second delay between requests, you're getting 10 pages/minute from any given domain. If you have 10M domains, no problem, but if you only have a limited number of domains, you run into inefficiencies in how Nutch handles fetcher threads that will severely constrain your crawl performance. We're in the middle of a project to improve throughput in this kind of environment, but haven't yet finished. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers"
Re: Much faster RegExp lib needed in nutch?
> >1. Keeps the well-known perl syntax for regexp (and then find a way to > >"simulate" them with automaton "limited" syntax) ? > My vote would be for option 1. It's less work for everyone > (except for the person incorporating the new library :) That's my prefered solution too. The first challenge is to see how to translate the regexp used in default regexp-urlfilter templates provided by Nutch. For now, in the only thing I don't see how to translate from perl syntax to dk.brics.automaton syntax is this regexp: -.*(/.+?)/.*?\1/.*?\1/.* In fact, automaton doesn't support capturing groups (Anders Moeller has confirmed). We cannot remove this regexp from urlfilter, but we cannot handle it with automaton. So, two solutions: 1. Keep java regexp ... 2. Switch to automaton and provide a java implementation of this regexp (it is more a protection pattern than really a filter pattern, and it could probably be hard-coded). I'm waiting for your suggestions... Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Crawling Accuracy
I have posted this before in the "nutch user", but, since that time I have made some aditional testing and I feel that this has more to do with developers. I have about 450 seed sites (in the quality and environmental areas) and I used the crawl method (Nutch 0.7.1.x) till depth 4, and then used the whole web method till depth 6 and some more sites (in this case not all) till detpth 7. I restrained the outlinks to 50, used the default crawl- urfilter (+^http://([a-z0-9]*\.)*NAMEOFSITE/, one for every site) and got about 523,000 pages. Doing some searches I noted that I only got few results for some terms. For instance "nureg" a document used by the Nuclear Regulatory Commission (NRC) yielded only a little more than 20 documents (there are more than 3,000 of them). Than I tried "site:www.nrc.gov http", and found only 82 pages. This site has more than 10,000 pages! I tried site:www.epa.gov http and only got 2413 pages (also, this site has more than 10,000 pages). The results were similar for other very large (and not dynamic sites). Experimenting further I crawled, using the crawl method, depth 7, only some sites, one per time. For instance, http://www.nrc.gov/ with the filter +^http://([a-z0-9]*\.)*nrc.gov/ (and -.), increasing the "http.max.delays" to 10 and the "http.timeout" to 2 and the results were very poor: looking for "http" resulted in only 58 results. Searching for "nureg" I only found 13 results, but for "adobe" (that should be blocked by the filter (but not by the "outlinks rule", I do not know) I got 4. Performing the same testing in other sites, like www.epa.gov; www.iaea.org; www.iso.org, the results were very similar: a very, very small percentage of the site pages indexed. So I am posting those results that can constitute, in itself, an issue that, may be, shall be dealt with. May be this is not an problem if you try to index the whole web, I dont know, but for niche sites, like mine, it seems to be. Tanks
[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12370686 ] Stefan Groschupf commented on NUTCH-233: Sorry, I haven't such url since it happens until reducing a fetch. Reducing provides no logging and map data will be deleted if the job fails because a timeout. :( > wrong regular expression hang reduce process for ever > - > > Key: NUTCH-233 > URL: http://issues.apache.org/jira/browse/NUTCH-233 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Reporter: Stefan Groschupf > Priority: Blocker > Fix For: 0.8-dev > > Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt > wasn't compatible with java.util.regex that is actually used in the regex url > filter. > May be it was missed to change it when the regular expression packages was > changed. > The problem was that until reducing a fetch map output the reducer hangs > forever since the outputformat was applying the urlfilter a url that causes > the hang. > 060315 230823 task_r_3n4zga at > java.lang.Character.codePointAt(Character.java:2335) > 060315 230823 task_r_3n4zga at > java.util.regex.Pattern$Dot.match(Pattern.java:4092) > 060315 230823 task_r_3n4zga at > java.util.regex.Pattern$Curly.match1(Pattern.java: > I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the > fetch job works. (thanks to Grant and Chris B. helping to find the new regex) > However may people can review it and can suggest improvements, since the old > regex would match : > "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the > old regex would also match : > "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12370685 ] Jerome Charron commented on NUTCH-233: -- Stefan, I have created a small unit test for urlfilter-regexp and I doesn't notice any incompatibility in java.util.regex with this regexp. Could you please provide the urls that cause problem so that I can add them to me unit tests. Thanks Jérôme > wrong regular expression hang reduce process for ever > - > > Key: NUTCH-233 > URL: http://issues.apache.org/jira/browse/NUTCH-233 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Reporter: Stefan Groschupf > Priority: Blocker > Fix For: 0.8-dev > > Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt > wasn't compatible with java.util.regex that is actually used in the regex url > filter. > May be it was missed to change it when the regular expression packages was > changed. > The problem was that until reducing a fetch map output the reducer hangs > forever since the outputformat was applying the urlfilter a url that causes > the hang. > 060315 230823 task_r_3n4zga at > java.lang.Character.codePointAt(Character.java:2335) > 060315 230823 task_r_3n4zga at > java.util.regex.Pattern$Dot.match(Pattern.java:4092) > 060315 230823 task_r_3n4zga at > java.util.regex.Pattern$Curly.match1(Pattern.java: > I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the > fetch job works. (thanks to Grant and Chris B. helping to find the new regex) > However may people can review it and can suggest improvements, since the old > regex would match : > "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the > old regex would also match : > "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira