Thank you guys for the hints and helps. I could manage to find the root of
the problem. When the reducer gets into freezing stage I dumped the core and
here is the results:



  at java.util.regex.Pattern$Curly.match1(Pattern.java:4250)

  at java.util.regex.Pattern$Curly.match(Pattern.java:4199)

  at java.util.regex.Pattern$Single.match(Pattern.java:3314)

  at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)

  at java.util.regex.Pattern$Curly.match1(Pattern.java:4250)

  at java.util.regex.Pattern$Curly.match(Pattern.java:4199)

  at java.util.regex.Pattern$Single.match(Pattern.java:3314)

  at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)

  at java.util.regex.Pattern$Curly.match0(Pattern.java:4235)

  at java.util.regex.Pattern$Curly.match(Pattern.java:4197)

  at java.util.regex.Pattern$Start.match(Pattern.java:3019)

  at java.util.regex.Matcher.search(Matcher.java:1092)

  at java.util.regex.Matcher.find(Matcher.java:528)

  at org.apache.nutch.urlfilter.regex.RegexURLFilter$Rule.match(
RegexURLFilter.java:86)

  at org.apache.nutch.urlfilter.api.RegexURLFilterBase.filter(
RegexURLFilterBase.java:116)

  - locked <0x00002aaafcc20468> (a
org.apache.nutch.urlfilter.regex.RegexURLFilter)

  at org.apache.nutch.net.URLFilters.filter(URLFilters.java:82)

  at org.apache.nutch.parse.ParseOutputFormat$1.write(
ParseOutputFormat.java:120)

  at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(
FetcherOutputFormat.java:97)

  at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:263)

  at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(
IdentityReducer.java:39)

  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:277)

  at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1211)





The reducers freeze because outputformat applies URLFILTER at this stage. I
added log info to the outputformat.java class in order to catch the URLs
that make the problem. Since it was 4,500,000 urls and 9 machines, it was
really pain to catch the urls, here are 3 trouble making urls:


http://www.modelpower.com/
http://www.discountedboots.com/
http://www.foreverwomen.com/site/724586/





Then I tried a local crawl using these URLS and put some logging at
RegexURLFilter.java:86, I could catch the Regex (-.*(/.+?)/.*?\1/.*?\1/)
takes more than 10 min. The problem is that java script parser parses some
bogus links like this:



http://www.discountedboots.com/<SELECT%20%20NAME%3D%22EDIT_BROWSE%22<http://www.discountedboots.com/%3cSELECT%20%20NAME%3D%22EDIT_BROWSE%22>>
………



These links are very very long and they have lots of / in it. These links
are created from scripts like this:



drawBrowseMenu('<SELECT%20%20NAME%3D%22EDIT_BROWSE%22>........

makeDropBox('<SELECT%20%20NAME%3D%22EDIT_BROWSE%22>........



And this reason is true in all the pages I could catch.



I am not sure what is wrong with  (-.*(/.+?)/.*?\1/.*?\1/) that makes this
long delay or infinite loop! At least, I guess js-parser needs to be fixed
and ignores these things. Or, we can have a timer thread is
RegexURLFilter.java that when the filtering takes more than 200ms then it
rejects the url and exit.



What do you guys think?



Thanks. Mike










On 10/18/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Mike Smith wrote:
> I am in the same state again, and same reduce jobs keep failing on
> different
> machines. I cannot get the dump using kill -3 pid, it does not make the
> thread to quit. Also, I tried to place some log into
FetcherOutputFormat,
> but because of this bug:
> *https://issues.apache.org/jira/browse/HADOOP-406*<
https://issues.apache.org/jira/browse/HADOOP-406>
>
> The logging is not possible in the childs threads. Do you have any
> idea why
> the reducers doesn't catch the QUIT signal from the cache. I am
> running the
> latest version on SVN, otherwise I could log some key,value and url
> filtering information at the reduce stage.

SIGQUIT should not make the JVM quit, it should produce a thread dump on
stderr. You need to manually pick up the process that corresponds to the
child JVM of the task, e.g. with top(1) or ps(1), and then execute 'kill
-SIGQUIT <pid>'.

You can use Hadoop's log4j.properties to quickly enable a lot of log
info, including stderr - put it in conf on every tasktracker and restart
the cluster.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to