[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938892#comment-13938892 ]
ysc edited comment on NUTCH-1739 at 3/18/14 8:04 AM: ----------------------------------------------------- This patch is produced in the environment of nutch1.7. You can reference this patch to patch other 1.x version. was (Author: yangshangchuan): This patch is produced in the environment of nutch1.7. You can reference this patch to patch other version. > ExecutorService field in ParseUtil.java not be right use and cause memory leak > ------------------------------------------------------------------------------ > > Key: NUTCH-1739 > URL: https://issues.apache.org/jira/browse/NUTCH-1739 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1 > Environment: JDK32, runtime/local > Reporter: ysc > Priority: Critical > Attachments: nutch1.7.patch, nutch2.2.1.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > ########################Problem######################## > java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native > thread > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:640) > at > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > at > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > ########################Analysis######################## > My server use JDK32. I began thought it was not specify enough memory. I > passed the test of {{java -Xmx2600m -version}} so I known my server can use > the max memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to > the script of bin/nutch. But it's not solve the problem. > Then, I check the source code to see where to produce so many threads. I find > the code > {code:java} > parseResult = new ParseUtil(getConf()).parse(content); > {code} > which in line 97 of the java source file > org.apache.nutch.parse.ParseSegment.java's map method. > Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool > object which no limit of the pool size , see the code: > {code:java} > executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder() > .setNameFormat("parse-%d").setDaemon(true).build()); > {code} > Through the above analyse, I know each map method's output will instantiate > a CachedThreadPool and not to close it. So, ExecutorService field in > ParseUtil.java not be right use and cause memory leak. > ########################Solution######################## > Each map method use a shared FixedThreadPool object which's size can be > config in nutch-site.xml, more detail see the patch file. -- This message was sent by Atlassian JIRA (v6.2#6252)