This was where ours was freezing as well.  Don't know why the regex 
causes it other than it is greedy.  I will try and take a look at the JS 
parser when I get some time.

Dennis

Mike Smith wrote:
> Thank you guys for the hints and helps. I could manage to find the 
> root of
> the problem. When the reducer gets into freezing stage I dumped the 
> core and
> here is the results:
>
>
>
>   at java.util.regex.Pattern$Curly.match1(Pattern.java:4250)
>
>   at java.util.regex.Pattern$Curly.match(Pattern.java:4199)
>
>   at java.util.regex.Pattern$Single.match(Pattern.java:3314)
>
>   at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
>
>   at java.util.regex.Pattern$Curly.match1(Pattern.java:4250)
>
>   at java.util.regex.Pattern$Curly.match(Pattern.java:4199)
>
>   at java.util.regex.Pattern$Single.match(Pattern.java:3314)
>
>   at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
>
>   at java.util.regex.Pattern$Curly.match0(Pattern.java:4235)
>
>   at java.util.regex.Pattern$Curly.match(Pattern.java:4197)
>
>   at java.util.regex.Pattern$Start.match(Pattern.java:3019)
>
>   at java.util.regex.Matcher.search(Matcher.java:1092)
>
>   at java.util.regex.Matcher.find(Matcher.java:528)
>
>   at org.apache.nutch.urlfilter.regex.RegexURLFilter$Rule.match(
> RegexURLFilter.java:86)
>
>   at org.apache.nutch.urlfilter.api.RegexURLFilterBase.filter(
> RegexURLFilterBase.java:116)
>
>   - locked <0x00002aaafcc20468> (a
> org.apache.nutch.urlfilter.regex.RegexURLFilter)
>
>   at org.apache.nutch.net.URLFilters.filter(URLFilters.java:82)
>
>   at org.apache.nutch.parse.ParseOutputFormat$1.write(
> ParseOutputFormat.java:120)
>
>   at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(
> FetcherOutputFormat.java:97)
>
>   at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:263)
>
>   at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(
> IdentityReducer.java:39)
>
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:277)
>
>   at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1211)
>
>
>
>
>
> The reducers freeze because outputformat applies URLFILTER at this 
> stage. I
> added log info to the outputformat.java class in order to catch the URLs
> that make the problem. Since it was 4,500,000 urls and 9 machines, it was
> really pain to catch the urls, here are 3 trouble making urls:
>
>
> http://www.modelpower.com/
> http://www.discountedboots.com/
> http://www.foreverwomen.com/site/724586/
>
>
>
>
>
> Then I tried a local crawl using these URLS and put some logging at
> RegexURLFilter.java:86, I could catch the Regex (-.*(/.+?)/.*?\1/.*?\1/)
> takes more than 10 min. The problem is that java script parser parses 
> some
> bogus links like this:
>
>
>
> http://www.discountedboots.com/<SELECT%20%20NAME%3D%22EDIT_BROWSE%22<http://www.discountedboots.com/%3cSELECT%20%20NAME%3D%22EDIT_BROWSE%22>>
>  
>
> ………
>
>
>
> These links are very very long and they have lots of / in it. These links
> are created from scripts like this:
>
>
>
> drawBrowseMenu('<SELECT%20%20NAME%3D%22EDIT_BROWSE%22>........
>
> makeDropBox('<SELECT%20%20NAME%3D%22EDIT_BROWSE%22>........
>
>
>
> And this reason is true in all the pages I could catch.
>
>
>
> I am not sure what is wrong with  (-.*(/.+?)/.*?\1/.*?\1/) that makes 
> this
> long delay or infinite loop! At least, I guess js-parser needs to be 
> fixed
> and ignores these things. Or, we can have a timer thread is
> RegexURLFilter.java that when the filtering takes more than 200ms then it
> rejects the url and exit.
>
>
>
> What do you guys think?
>
>
>
> Thanks. Mike
>
>
>
>
>
>
>
>
>
>
> On 10/18/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>>
>> Mike Smith wrote:
>> > I am in the same state again, and same reduce jobs keep failing on
>> > different
>> > machines. I cannot get the dump using kill -3 pid, it does not make 
>> the
>> > thread to quit. Also, I tried to place some log into
>> FetcherOutputFormat,
>> > but because of this bug:
>> > *https://issues.apache.org/jira/browse/HADOOP-406*<
>> https://issues.apache.org/jira/browse/HADOOP-406>
>> >
>> > The logging is not possible in the childs threads. Do you have any
>> > idea why
>> > the reducers doesn't catch the QUIT signal from the cache. I am
>> > running the
>> > latest version on SVN, otherwise I could log some key,value and url
>> > filtering information at the reduce stage.
>>
>> SIGQUIT should not make the JVM quit, it should produce a thread dump on
>> stderr. You need to manually pick up the process that corresponds to the
>> child JVM of the task, e.g. with top(1) or ps(1), and then execute 'kill
>> -SIGQUIT <pid>'.
>>
>> You can use Hadoop's log4j.properties to quickly enable a lot of log
>> info, including stderr - put it in conf on every tasktracker and restart
>> the cluster.
>>
>> -- 
>> Best regards,
>> Andrzej Bialecki     <><
>> ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to