[ 
https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938945#comment-13938945
 ] 

ysc commented on NUTCH-1739:
----------------------------

Thanks, [~wastl-nagel] , you are right, i just now saw it. 
In the map method of org.apache.nutch.parse.ParseSegment.java :
{code:java}
    ParseResult parseResult = null;
    try {
      if (parseUtil == null) 
        parseUtil = new ParseUtil(getConf());
      parseResult = parseUtil.parse(content);
    } catch (Exception e) {
      LOG.warn("Error parsing: " + key + ": " + 
StringUtils.stringifyException(e));
      return;
    }
{code}
But still not limit the size for thread pool. This may be produce lots of 
threads result in highly memory usage and frequently GC, the worse is that this 
can cause OutOfMemoryError.

> ExecutorService field in ParseUtil.java not be right use and cause memory leak
> ------------------------------------------------------------------------------
>
>                 Key: NUTCH-1739
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1739
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1
>         Environment: JDK32, runtime/local
>            Reporter: ysc
>            Priority: Critical
>         Attachments: nutch1.7.patch, nutch2.2.1.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> ########################Problem########################
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native 
> thread
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:640)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
>         at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
>         at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>         at java.lang.Thread.run(Thread.java:662)
> ########################Analysis########################
> My server use JDK32. I began thought it was not specify enough memory. I 
> passed the test of {{java -Xmx2600m -version}} so I known my server can use 
> the max memory is 2.6G. So, I add one line config  {{NUTCH_HEAPSIZE=2000}} to 
> the script of bin/nutch. But it's not solve the problem.
> Then, I check the source code to see where to produce so many threads. I find 
> the code
> {code:java}
>  parseResult = new ParseUtil(getConf()).parse(content); 
> {code}
> which in line 97 of the java source file 
> org.apache.nutch.parse.ParseSegment.java's map method.
> Continue, In the constructor of ParseUtil,  instantiate a CachedThreadPool 
> object which no limit of the pool size , see the code:
> {code:java}
> executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
>       .setNameFormat("parse-%d").setDaemon(true).build());
> {code}
> Through the above analyse, I know each map method's output will  instantiate 
> a CachedThreadPool and not to close it. So, ExecutorService field in 
> ParseUtil.java not be right use and cause memory leak.
> ########################Solution########################
> Each map method use a shared FixedThreadPool object which's size can be 
> config in nutch-site.xml, more detail see the patch file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to