This was where ours was freezing as well. Don't know why the regex causes it other than it is greedy. I will try and take a look at the JS parser when I get some time.
Dennis Mike Smith wrote: > Thank you guys for the hints and helps. I could manage to find the > root of > the problem. When the reducer gets into freezing stage I dumped the > core and > here is the results: > > > > at java.util.regex.Pattern$Curly.match1(Pattern.java:4250) > > at java.util.regex.Pattern$Curly.match(Pattern.java:4199) > > at java.util.regex.Pattern$Single.match(Pattern.java:3314) > > at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629) > > at java.util.regex.Pattern$Curly.match1(Pattern.java:4250) > > at java.util.regex.Pattern$Curly.match(Pattern.java:4199) > > at java.util.regex.Pattern$Single.match(Pattern.java:3314) > > at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570) > > at java.util.regex.Pattern$Curly.match0(Pattern.java:4235) > > at java.util.regex.Pattern$Curly.match(Pattern.java:4197) > > at java.util.regex.Pattern$Start.match(Pattern.java:3019) > > at java.util.regex.Matcher.search(Matcher.java:1092) > > at java.util.regex.Matcher.find(Matcher.java:528) > > at org.apache.nutch.urlfilter.regex.RegexURLFilter$Rule.match( > RegexURLFilter.java:86) > > at org.apache.nutch.urlfilter.api.RegexURLFilterBase.filter( > RegexURLFilterBase.java:116) > > - locked <0x00002aaafcc20468> (a > org.apache.nutch.urlfilter.regex.RegexURLFilter) > > at org.apache.nutch.net.URLFilters.filter(URLFilters.java:82) > > at org.apache.nutch.parse.ParseOutputFormat$1.write( > ParseOutputFormat.java:120) > > at org.apache.nutch.fetcher.FetcherOutputFormat$1.write( > FetcherOutputFormat.java:97) > > at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:263) > > at org.apache.hadoop.mapred.lib.IdentityReducer.reduce( > IdentityReducer.java:39) > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:277) > > at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1211) > > > > > > The reducers freeze because outputformat applies URLFILTER at this > stage. I > added log info to the outputformat.java class in order to catch the URLs > that make the problem. Since it was 4,500,000 urls and 9 machines, it was > really pain to catch the urls, here are 3 trouble making urls: > > > http://www.modelpower.com/ > http://www.discountedboots.com/ > http://www.foreverwomen.com/site/724586/ > > > > > > Then I tried a local crawl using these URLS and put some logging at > RegexURLFilter.java:86, I could catch the Regex (-.*(/.+?)/.*?\1/.*?\1/) > takes more than 10 min. The problem is that java script parser parses > some > bogus links like this: > > > > http://www.discountedboots.com/<SELECT%20%20NAME%3D%22EDIT_BROWSE%22<http://www.discountedboots.com/%3cSELECT%20%20NAME%3D%22EDIT_BROWSE%22>> > > > ……… > > > > These links are very very long and they have lots of / in it. These links > are created from scripts like this: > > > > drawBrowseMenu('<SELECT%20%20NAME%3D%22EDIT_BROWSE%22>........ > > makeDropBox('<SELECT%20%20NAME%3D%22EDIT_BROWSE%22>........ > > > > And this reason is true in all the pages I could catch. > > > > I am not sure what is wrong with (-.*(/.+?)/.*?\1/.*?\1/) that makes > this > long delay or infinite loop! At least, I guess js-parser needs to be > fixed > and ignores these things. Or, we can have a timer thread is > RegexURLFilter.java that when the filtering takes more than 200ms then it > rejects the url and exit. > > > > What do you guys think? > > > > Thanks. Mike > > > > > > > > > > > On 10/18/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: >> >> Mike Smith wrote: >> > I am in the same state again, and same reduce jobs keep failing on >> > different >> > machines. I cannot get the dump using kill -3 pid, it does not make >> the >> > thread to quit. Also, I tried to place some log into >> FetcherOutputFormat, >> > but because of this bug: >> > *https://issues.apache.org/jira/browse/HADOOP-406*< >> https://issues.apache.org/jira/browse/HADOOP-406> >> > >> > The logging is not possible in the childs threads. Do you have any >> > idea why >> > the reducers doesn't catch the QUIT signal from the cache. I am >> > running the >> > latest version on SVN, otherwise I could log some key,value and url >> > filtering information at the reduce stage. >> >> SIGQUIT should not make the JVM quit, it should produce a thread dump on >> stderr. You need to manually pick up the process that corresponds to the >> child JVM of the task, e.g. with top(1) or ps(1), and then execute 'kill >> -SIGQUIT <pid>'. >> >> You can use Hadoop's log4j.properties to quickly enable a lot of log >> info, including stderr - put it in conf on every tasktracker and restart >> the cluster. >> >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> >> >> > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
