Thank you guys for the hints and helps. I could manage to find the root of
the problem. When the reducer gets into freezing stage I dumped the core and
here is the results:
at java.util.regex.Pattern$Curly.match1(Pattern.java:4250)
at java.util.regex.Pattern$Curly.match(Pattern.java:4199)
at java.util.regex.Pattern$Single.match(Pattern.java:3314)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$Curly.match1(Pattern.java:4250)
at java.util.regex.Pattern$Curly.match(Pattern.java:4199)
at java.util.regex.Pattern$Single.match(Pattern.java:3314)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4235)
at java.util.regex.Pattern$Curly.match(Pattern.java:4197)
at java.util.regex.Pattern$Start.match(Pattern.java:3019)
at java.util.regex.Matcher.search(Matcher.java:1092)
at java.util.regex.Matcher.find(Matcher.java:528)
at org.apache.nutch.urlfilter.regex.RegexURLFilter$Rule.match(
RegexURLFilter.java:86)
at org.apache.nutch.urlfilter.api.RegexURLFilterBase.filter(
RegexURLFilterBase.java:116)
- locked <0x00002aaafcc20468> (a
org.apache.nutch.urlfilter.regex.RegexURLFilter)
at org.apache.nutch.net.URLFilters.filter(URLFilters.java:82)
at org.apache.nutch.parse.ParseOutputFormat$1.write(
ParseOutputFormat.java:120)
at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(
FetcherOutputFormat.java:97)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:263)
at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(
IdentityReducer.java:39)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:277)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1211)
The reducers freeze because outputformat applies URLFILTER at this stage. I
added log info to the outputformat.java class in order to catch the URLs
that make the problem. Since it was 4,500,000 urls and 9 machines, it was
really pain to catch the urls, here are 3 trouble making urls:
http://www.modelpower.com/
http://www.discountedboots.com/
http://www.foreverwomen.com/site/724586/
Then I tried a local crawl using these URLS and put some logging at
RegexURLFilter.java:86, I could catch the Regex (-.*(/.+?)/.*?\1/.*?\1/)
takes more than 10 min. The problem is that java script parser parses some
bogus links like this:
http://www.discountedboots.com/<SELECT%20%20NAME%3D%22EDIT_BROWSE%22<http://www.discountedboots.com/%3cSELECT%20%20NAME%3D%22EDIT_BROWSE%22>>
………
These links are very very long and they have lots of / in it. These links
are created from scripts like this:
drawBrowseMenu('<SELECT%20%20NAME%3D%22EDIT_BROWSE%22>........
makeDropBox('<SELECT%20%20NAME%3D%22EDIT_BROWSE%22>........
And this reason is true in all the pages I could catch.
I am not sure what is wrong with (-.*(/.+?)/.*?\1/.*?\1/) that makes this
long delay or infinite loop! At least, I guess js-parser needs to be fixed
and ignores these things. Or, we can have a timer thread is
RegexURLFilter.java that when the filtering takes more than 200ms then it
rejects the url and exit.
What do you guys think?
Thanks. Mike
On 10/18/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Mike Smith wrote:
> I am in the same state again, and same reduce jobs keep failing on
> different
> machines. I cannot get the dump using kill -3 pid, it does not make the
> thread to quit. Also, I tried to place some log into
FetcherOutputFormat,
> but because of this bug:
> *https://issues.apache.org/jira/browse/HADOOP-406*<
https://issues.apache.org/jira/browse/HADOOP-406>
>
> The logging is not possible in the childs threads. Do you have any
> idea why
> the reducers doesn't catch the QUIT signal from the cache. I am
> running the
> latest version on SVN, otherwise I could log some key,value and url
> filtering information at the reduce stage.
SIGQUIT should not make the JVM quit, it should produce a thread dump on
stderr. You need to manually pick up the process that corresponds to the
child JVM of the task, e.g. with top(1) or ps(1), and then execute 'kill
-SIGQUIT <pid>'.
You can use Hadoop's log4j.properties to quickly enable a lot of log
info, including stderr - put it in conf on every tasktracker and restart
the cluster.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general