[jira] [Commented] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob
[ https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939912#comment-13939912 ] Hudson commented on NUTCH-1738: --- SUCCESS: Integrated in Nutch-nutchgora #958 (See [https://builds.apache.org/job/Nutch-nutchgora/958/]) NUTCH-1738 Expose number of URLs generated per batch in GeneratorJob (lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1579072) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java * /nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java > Expose number of URLs generated per batch in GeneratorJob > - > > Key: NUTCH-1738 > URL: https://issues.apache.org/jira/browse/NUTCH-1738 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Talat UYARER > Fix For: 2.3 > > Attachments: NUTCH-1738.patch > > > GeneratorJob contains one trivial line of logging > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId); > {code} > I propose to improve this logging by exposing how many URL's are contained > within the generated batch. Something like > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + > $numOfURLs + " URLs"); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob
[ https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1738. - Resolution: Fixed Committed @revision 1579072 in 2.x HEAD Thank you [~talat] > Expose number of URLs generated per batch in GeneratorJob > - > > Key: NUTCH-1738 > URL: https://issues.apache.org/jira/browse/NUTCH-1738 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Talat UYARER > Fix For: 2.3 > > Attachments: NUTCH-1738.patch > > > GeneratorJob contains one trivial line of logging > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId); > {code} > I propose to improve this logging by exposing how many URL's are contained > within the generated batch. Something like > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + > $numOfURLs + " URLs"); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob
[ https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1738: Patch Info: Patch Available > Expose number of URLs generated per batch in GeneratorJob > - > > Key: NUTCH-1738 > URL: https://issues.apache.org/jira/browse/NUTCH-1738 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Talat UYARER > Fix For: 2.3 > > Attachments: NUTCH-1738.patch > > > GeneratorJob contains one trivial line of logging > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId); > {code} > I propose to improve this logging by exposing how many URL's are contained > within the generated batch. Something like > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + > $numOfURLs + " URLs"); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob
[ https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939873#comment-13939873 ] Lewis John McGibbney commented on NUTCH-1738: - Assigned to you [~talat] for Karma. > Expose number of URLs generated per batch in GeneratorJob > - > > Key: NUTCH-1738 > URL: https://issues.apache.org/jira/browse/NUTCH-1738 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Talat UYARER > Fix For: 2.3 > > Attachments: NUTCH-1738.patch > > > GeneratorJob contains one trivial line of logging > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId); > {code} > I propose to improve this logging by exposing how many URL's are contained > within the generated batch. Something like > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + > $numOfURLs + " URLs"); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob
[ https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1738: Assignee: Talat UYARER > Expose number of URLs generated per batch in GeneratorJob > - > > Key: NUTCH-1738 > URL: https://issues.apache.org/jira/browse/NUTCH-1738 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Talat UYARER > Fix For: 2.3 > > Attachments: NUTCH-1738.patch > > > GeneratorJob contains one trivial line of logging > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId); > {code} > I propose to improve this logging by exposing how many URL's are contained > within the generated batch. Something like > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + > $numOfURLs + " URLs"); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob
[ https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1738: Assignee: (was: Lewis John McGibbney) > Expose number of URLs generated per batch in GeneratorJob > - > > Key: NUTCH-1738 > URL: https://issues.apache.org/jira/browse/NUTCH-1738 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2.1 >Reporter: Lewis John McGibbney > Fix For: 2.3 > > Attachments: NUTCH-1738.patch > > > GeneratorJob contains one trivial line of logging > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId); > {code} > I propose to improve this logging by exposing how many URL's are contained > within the generated batch. Something like > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + > $numOfURLs + " URLs"); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series
[ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939553#comment-13939553 ] Shanaka Jayasundera commented on NUTCH-1478: Hi All, I've downloaded latest code from 2.x branch and try to index meta data to Solr but Solr query results are not showing meta data. But , parsechecker working fine . Do I need to do any additional configurations to get meta data on solr query results. $ ./bin/nutch parsechecker http://nutch.apache.org/ fetching: http://nutch.apache.org/ parsing: http://nutch.apache.org/ contentType: text/html signature: b2bb805dcd51f12784190d58d619f0bc - Url --- http://nutch.apache.org/ - Metadata - meta_forrest-version : 0.10-dev meta_generator :Apache Forrest meta_forrest-skin-name :nutch_rs_ : � meta_content-type : text/html; charset=UTF-8 Command I'm using to crawl and Index is , bin/crawl urls/seed.txt TestCrawl3.1 http://localhost:8983/solr/ 2 I've not done much configuration changes, I've configure nutch-sites.xml and gora.properties to use hbase & gora Appreciate if anyone can help me to identify the missing configurations. Thanks in advance. > Parse-metatags and index-metadata plugin for Nutch 2.x series > -- > > Key: NUTCH-1478 > URL: https://issues.apache.org/jira/browse/NUTCH-1478 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 2.1 >Reporter: kiran > Fix For: 2.3 > > Attachments: NUTCH-1478-parse-v2.patch, NUTCH-1478v3.patch, > NUTCH-1478v4.patch, NUTCH-1478v5.1.patch, NUTCH-1478v5.patch, > NUTCH-1478v6.patch, Nutch1478.patch, Nutch1478.zip, > metadata_parseChecker_sites.png > > > I have ported parse-metatags and index-metadata plugin to Nutch 2.x series. > This will take multiple values of same tag and index in Solr as i patched > before (https://issues.apache.org/jira/browse/NUTCH-1467). > The usage is same as described here > (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is > no need to give 'metatag' keyword before metatag names. For example my > configuration looks like this > (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml) > > This is only the first version and does not include the junit test. I will > update the new version soon. > This will parse the tags and index the tags in Solr. Make sure you create the > fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr. > Please let me know if you have any suggestions > This is supported by DLA (Digital Library and Archives) of Virginia Tech. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1733) parse-html to support HTML5 charset definitions
[ https://issues.apache.org/jira/browse/NUTCH-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939540#comment-13939540 ] John Lafitte commented on NUTCH-1733: - It might just be specific to my files or configuration, but when using this patch it does seem to remove the BOM but I get what looks like an extra space at the beginning of the content. > parse-html to support HTML5 charset definitions > --- > > Key: NUTCH-1733 > URL: https://issues.apache.org/jira/browse/NUTCH-1733 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.8, 2.2.1 >Reporter: Sebastian Nagel > Fix For: 2.3, 1.9 > > Attachments: NUTCH-1733-trunk.patch, charset_bom_html5.html, > charset_html5.html > > > HTML 5 allows to specify the character encoding of a page per > * {{}} > * Unicode Byte Order Mark (BOM) > These are allowed in addition to previous HTTP/http-equiv Content-Type, see > [[1|http://www.w3.org/TR/2011/WD-html5-diff-20110405/#character-encoding]]. > Parse-html ignores both meta charset and BOM, falls back to the default > encoding (cp1252). Parse-tika sets the encoding appropriately. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (NUTCH-1740) BatchId parameter is not set in DbUpdaterJob
Alparslan Avcı created NUTCH-1740: - Summary: BatchId parameter is not set in DbUpdaterJob Key: NUTCH-1740 URL: https://issues.apache.org/jira/browse/NUTCH-1740 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Alparslan Avcı Priority: Minor Attachments: NUTCH-1556-batchId.patch BatchId is not set in DbUpdaterJob since batchId is set to configuration after creating currentJob. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1740) BatchId parameter is not set in DbUpdaterJob
[ https://issues.apache.org/jira/browse/NUTCH-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alparslan Avcı updated NUTCH-1740: -- Attachment: NUTCH-1556-batchId.patch This is fixed for 2.x in NUTCH-1556. Uploading the related patch to this issue. > BatchId parameter is not set in DbUpdaterJob > > > Key: NUTCH-1740 > URL: https://issues.apache.org/jira/browse/NUTCH-1740 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Alparslan Avcı >Priority: Minor > Attachments: NUTCH-1556-batchId.patch > > > BatchId is not set in DbUpdaterJob since batchId is set to configuration > after creating currentJob. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak
[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938962#comment-13938962 ] Alparslan Avcı commented on NUTCH-1739: --- Hi [~yangshangchuan], and thanks for the patch! IMHO, FixedThreadPool is not needed in this case. As you can see in the source code of _Executors.java_; _newCachedThreadPool()_ method is implemented as follows: {code:java} public static ExecutorService newCachedThreadPool(ThreadFactory threadFactory) { return new ThreadPoolExecutor(0, Integer.MAX_VALUE, 60L, TimeUnit.SECONDS, new SynchronousQueue(), threadFactory); } {code} It is seen that the keepAliveTime parameter is given as 60 seconds, means that idle threads will wait 60 sec for new tasks before terminating. So, the threads will created as needed and killed when they are idle. And as an experience, we have parsed ten millions of webpages and never faced a problem when we use CachedThreadPool. Another point is that configuring the fixed size of threadpools is a hard issue when the size of crawled webpages is too large. > ExecutorService field in ParseUtil.java not be right use and cause memory leak > -- > > Key: NUTCH-1739 > URL: https://issues.apache.org/jira/browse/NUTCH-1739 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1 > Environment: JDK32, runtime/local >Reporter: ysc >Priority: Critical > Attachments: nutch1.7.patch, nutch2.2.1.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Problem > java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native > thread > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:640) > at > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > at > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Analysis > My server use JDK32. I began thought it was not specify enough memory. I > passed the test of {{java -Xmx2600m -version}} so I known my server can use > the max memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to > the script of bin/nutch. But it's not solve the problem. > Then, I check the source code to see where to produce so many threads. I find > the code > {code:java} > parseResult = new ParseUtil(getConf()).parse(content); > {code} > which in line 97 of the java source file > org.apache.nutch.parse.ParseSegment.java's map method. > Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool > object which no limit of the pool size , see the code: > {code:java} > executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder() > .setNameFormat("parse-%d").setDaemon(true).build()); > {code} > Through the above analyse, I know each map method's output will instantiate > a CachedThreadPool and not to close it. So, ExecutorService field in > ParseUtil.java not be right use and cause memory leak. > Solution
[jira] [Comment Edited] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob
[ https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938960#comment-13938960 ] Talat UYARER edited comment on NUTCH-1738 at 3/18/14 8:29 AM: -- Hi [~lewismc], I attached a patch for this information. Can you review it ? was (Author: talat): Hi [~lewis] , I attached a patch for this information. Can you review it ? > Expose number of URLs generated per batch in GeneratorJob > - > > Key: NUTCH-1738 > URL: https://issues.apache.org/jira/browse/NUTCH-1738 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.3 > > Attachments: NUTCH-1738.patch > > > GeneratorJob contains one trivial line of logging > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId); > {code} > I propose to improve this logging by exposing how many URL's are contained > within the generated batch. Something like > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + > $numOfURLs + " URLs"); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob
[ https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Talat UYARER updated NUTCH-1738: Attachment: NUTCH-1738.patch Hi [~lewis] , I attached a patch for this information. Can you review it ? > Expose number of URLs generated per batch in GeneratorJob > - > > Key: NUTCH-1738 > URL: https://issues.apache.org/jira/browse/NUTCH-1738 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.3 > > Attachments: NUTCH-1738.patch > > > GeneratorJob contains one trivial line of logging > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId); > {code} > I propose to improve this logging by exposing how many URL's are contained > within the generated batch. Something like > {code:title=GeneratorJob.java|borderStyle=solid} > LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + > $numOfURLs + " URLs"); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak
[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938945#comment-13938945 ] ysc commented on NUTCH-1739: Thanks, [~wastl-nagel] , you are right, i just now saw it. In the map method of org.apache.nutch.parse.ParseSegment.java : {code:java} ParseResult parseResult = null; try { if (parseUtil == null) parseUtil = new ParseUtil(getConf()); parseResult = parseUtil.parse(content); } catch (Exception e) { LOG.warn("Error parsing: " + key + ": " + StringUtils.stringifyException(e)); return; } {code} But still not limit the size for thread pool. This may be produce lots of threads result in highly memory usage and frequently GC, the worse is that this can cause OutOfMemoryError. > ExecutorService field in ParseUtil.java not be right use and cause memory leak > -- > > Key: NUTCH-1739 > URL: https://issues.apache.org/jira/browse/NUTCH-1739 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1 > Environment: JDK32, runtime/local >Reporter: ysc >Priority: Critical > Attachments: nutch1.7.patch, nutch2.2.1.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Problem > java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native > thread > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:640) > at > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > at > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Analysis > My server use JDK32. I began thought it was not specify enough memory. I > passed the test of {{java -Xmx2600m -version}} so I known my server can use > the max memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to > the script of bin/nutch. But it's not solve the problem. > Then, I check the source code to see where to produce so many threads. I find > the code > {code:java} > parseResult = new ParseUtil(getConf()).parse(content); > {code} > which in line 97 of the java source file > org.apache.nutch.parse.ParseSegment.java's map method. > Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool > object which no limit of the pool size , see the code: > {code:java} > executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder() > .setNameFormat("parse-%d").setDaemon(true).build()); > {code} > Through the above analyse, I know each map method's output will instantiate > a CachedThreadPool and not to close it. So, ExecutorService field in > ParseUtil.java not be right use and cause memory leak. > Solution > Each map method use a shared FixedThreadPool object which's size can be > config in nutch-site.xml, more detail see the patch file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak
[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938892#comment-13938892 ] ysc edited comment on NUTCH-1739 at 3/18/14 8:04 AM: - This patch is produced in the environment of nutch1.7. You can reference this patch to patch other 1.x version. was (Author: yangshangchuan): This patch is produced in the environment of nutch1.7. You can reference this patch to patch other version. > ExecutorService field in ParseUtil.java not be right use and cause memory leak > -- > > Key: NUTCH-1739 > URL: https://issues.apache.org/jira/browse/NUTCH-1739 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1 > Environment: JDK32, runtime/local >Reporter: ysc >Priority: Critical > Attachments: nutch1.7.patch, nutch2.2.1.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Problem > java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native > thread > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:640) > at > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > at > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Analysis > My server use JDK32. I began thought it was not specify enough memory. I > passed the test of {{java -Xmx2600m -version}} so I known my server can use > the max memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to > the script of bin/nutch. But it's not solve the problem. > Then, I check the source code to see where to produce so many threads. I find > the code > {code:java} > parseResult = new ParseUtil(getConf()).parse(content); > {code} > which in line 97 of the java source file > org.apache.nutch.parse.ParseSegment.java's map method. > Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool > object which no limit of the pool size , see the code: > {code:java} > executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder() > .setNameFormat("parse-%d").setDaemon(true).build()); > {code} > Through the above analyse, I know each map method's output will instantiate > a CachedThreadPool and not to close it. So, ExecutorService field in > ParseUtil.java not be right use and cause memory leak. > Solution > Each map method use a shared FixedThreadPool object which's size can be > config in nutch-site.xml, more detail see the patch file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak
[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ysc updated NUTCH-1739: --- Attachment: nutch2.2.1.patch This patch is produced in the environment of nutch2.2.1. You can reference this patch to patch other 2.x version. > ExecutorService field in ParseUtil.java not be right use and cause memory leak > -- > > Key: NUTCH-1739 > URL: https://issues.apache.org/jira/browse/NUTCH-1739 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1 > Environment: JDK32, runtime/local >Reporter: ysc >Priority: Critical > Attachments: nutch1.7.patch, nutch2.2.1.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Problem > java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native > thread > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:640) > at > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > at > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Analysis > My server use JDK32. I began thought it was not specify enough memory. I > passed the test of {{java -Xmx2600m -version}} so I known my server can use > the max memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to > the script of bin/nutch. But it's not solve the problem. > Then, I check the source code to see where to produce so many threads. I find > the code > {code:java} > parseResult = new ParseUtil(getConf()).parse(content); > {code} > which in line 97 of the java source file > org.apache.nutch.parse.ParseSegment.java's map method. > Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool > object which no limit of the pool size , see the code: > {code:java} > executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder() > .setNameFormat("parse-%d").setDaemon(true).build()); > {code} > Through the above analyse, I know each map method's output will instantiate > a CachedThreadPool and not to close it. So, ExecutorService field in > ParseUtil.java not be right use and cause memory leak. > Solution > Each map method use a shared FixedThreadPool object which's size can be > config in nutch-site.xml, more detail see the patch file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak
[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938941#comment-13938941 ] ysc commented on NUTCH-1739: Thanks, [~alparslan.avci] , you are right. Nutch2.1 hasn't initialized the thread pool for each map method's output, but still not limit the size for thread pool. This may be produce lots of threads result in highly memory usage and frequently GC, the worse is that this can cause OutOfMemoryError.I will add a patch for for 2.x. > ExecutorService field in ParseUtil.java not be right use and cause memory leak > -- > > Key: NUTCH-1739 > URL: https://issues.apache.org/jira/browse/NUTCH-1739 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1 > Environment: JDK32, runtime/local >Reporter: ysc >Priority: Critical > Attachments: nutch1.7.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Problem > java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native > thread > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:640) > at > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > at > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Analysis > My server use JDK32. I began thought it was not specify enough memory. I > passed the test of {{java -Xmx2600m -version}} so I known my server can use > the max memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to > the script of bin/nutch. But it's not solve the problem. > Then, I check the source code to see where to produce so many threads. I find > the code > {code:java} > parseResult = new ParseUtil(getConf()).parse(content); > {code} > which in line 97 of the java source file > org.apache.nutch.parse.ParseSegment.java's map method. > Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool > object which no limit of the pool size , see the code: > {code:java} > executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder() > .setNameFormat("parse-%d").setDaemon(true).build()); > {code} > Through the above analyse, I know each map method's output will instantiate > a CachedThreadPool and not to close it. So, ExecutorService field in > ParseUtil.java not be right use and cause memory leak. > Solution > Each map method use a shared FixedThreadPool object which's size can be > config in nutch-site.xml, more detail see the patch file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak
[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ysc updated NUTCH-1739: --- Affects Version/s: 2.1 2.2 2.2.1 > ExecutorService field in ParseUtil.java not be right use and cause memory leak > -- > > Key: NUTCH-1739 > URL: https://issues.apache.org/jira/browse/NUTCH-1739 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1 > Environment: JDK32, runtime/local >Reporter: ysc >Priority: Critical > Attachments: nutch1.7.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Problem > java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native > thread > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:640) > at > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > at > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Analysis > My server use JDK32. I began thought it was not specify enough memory. I > passed the test of {{java -Xmx2600m -version}} so I known my server can use > the max memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to > the script of bin/nutch. But it's not solve the problem. > Then, I check the source code to see where to produce so many threads. I find > the code > {code:java} > parseResult = new ParseUtil(getConf()).parse(content); > {code} > which in line 97 of the java source file > org.apache.nutch.parse.ParseSegment.java's map method. > Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool > object which no limit of the pool size , see the code: > {code:java} > executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder() > .setNameFormat("parse-%d").setDaemon(true).build()); > {code} > Through the above analyse, I know each map method's output will instantiate > a CachedThreadPool and not to close it. So, ExecutorService field in > ParseUtil.java not be right use and cause memory leak. > Solution > Each map method use a shared FixedThreadPool object which's size can be > config in nutch-site.xml, more detail see the patch file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak
[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938930#comment-13938930 ] Sebastian Nagel edited comment on NUTCH-1739 at 3/18/14 7:43 AM: - Thanks, [~yangshangchuan]. But isn't this fixed with NUTCH-1640 (contained in 1.8 which was just released)? was (Author: wastl-nagel): Thanks, [~yangshangchuan]. But isn't this fixed with NUTCH-1640 (contained in 1.8 which was just released). > ExecutorService field in ParseUtil.java not be right use and cause memory leak > -- > > Key: NUTCH-1739 > URL: https://issues.apache.org/jira/browse/NUTCH-1739 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.6, 1.7, 1.8 > Environment: JDK32, runtime/local >Reporter: ysc >Priority: Critical > Attachments: nutch1.7.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Problem > java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native > thread > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:640) > at > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > at > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Analysis > My server use JDK32. I began thought it was not specify enough memory. I > passed the test of {{java -Xmx2600m -version}} so I known my server can use > the max memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to > the script of bin/nutch. But it's not solve the problem. > Then, I check the source code to see where to produce so many threads. I find > the code > {code:java} > parseResult = new ParseUtil(getConf()).parse(content); > {code} > which in line 97 of the java source file > org.apache.nutch.parse.ParseSegment.java's map method. > Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool > object which no limit of the pool size , see the code: > {code:java} > executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder() > .setNameFormat("parse-%d").setDaemon(true).build()); > {code} > Through the above analyse, I know each map method's output will instantiate > a CachedThreadPool and not to close it. So, ExecutorService field in > ParseUtil.java not be right use and cause memory leak. > Solution > Each map method use a shared FixedThreadPool object which's size can be > config in nutch-site.xml, more detail see the patch file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak
[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938930#comment-13938930 ] Sebastian Nagel commented on NUTCH-1739: Thanks, [~yangshangchuan]. But isn't this fixed with NUTCH-1640 (contained in 1.8 which was just released). > ExecutorService field in ParseUtil.java not be right use and cause memory leak > -- > > Key: NUTCH-1739 > URL: https://issues.apache.org/jira/browse/NUTCH-1739 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.6, 1.7, 1.8 > Environment: JDK32, runtime/local >Reporter: ysc >Priority: Critical > Attachments: nutch1.7.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Problem > java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native > thread > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:640) > at > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > at > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Analysis > My server use JDK32. I began thought it was not specify enough memory. I > passed the test of {{java -Xmx2600m -version}} so I known my server can use > the max memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to > the script of bin/nutch. But it's not solve the problem. > Then, I check the source code to see where to produce so many threads. I find > the code > {code:java} > parseResult = new ParseUtil(getConf()).parse(content); > {code} > which in line 97 of the java source file > org.apache.nutch.parse.ParseSegment.java's map method. > Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool > object which no limit of the pool size , see the code: > {code:java} > executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder() > .setNameFormat("parse-%d").setDaemon(true).build()); > {code} > Through the above analyse, I know each map method's output will instantiate > a CachedThreadPool and not to close it. So, ExecutorService field in > ParseUtil.java not be right use and cause memory leak. > Solution > Each map method use a shared FixedThreadPool object which's size can be > config in nutch-site.xml, more detail see the patch file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak
[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ysc updated NUTCH-1739: --- Affects Version/s: (was: 2.2.1) (was: 2.2) (was: 2.1) > ExecutorService field in ParseUtil.java not be right use and cause memory leak > -- > > Key: NUTCH-1739 > URL: https://issues.apache.org/jira/browse/NUTCH-1739 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.6, 1.7, 1.8 > Environment: JDK32, runtime/local >Reporter: ysc >Priority: Critical > Attachments: nutch1.7.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Problem > java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native > thread > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:640) > at > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > at > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Analysis > My server use JDK32. I began thought it was not specify enough memory. I > passed the test of {{java -Xmx2600m -version}} so I known my server can use > the max memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to > the script of bin/nutch. But it's not solve the problem. > Then, I check the source code to see where to produce so many threads. I find > the code > {code:java} > parseResult = new ParseUtil(getConf()).parse(content); > {code} > which in line 97 of the java source file > org.apache.nutch.parse.ParseSegment.java's map method. > Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool > object which no limit of the pool size , see the code: > {code:java} > executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder() > .setNameFormat("parse-%d").setDaemon(true).build()); > {code} > Through the above analyse, I know each map method's output will instantiate > a CachedThreadPool and not to close it. So, ExecutorService field in > ParseUtil.java not be right use and cause memory leak. > Solution > Each map method use a shared FixedThreadPool object which's size can be > config in nutch-site.xml, more detail see the patch file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak
[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938922#comment-13938922 ] Alparslan Avcı edited comment on NUTCH-1739 at 3/18/14 7:32 AM: It seems there is no problem for 2.x since all of the ParseUtil objects are initialized once for a job. So, thread pool is shared for uses of ParseUtil in the same job. was (Author: alparslan.avci): It seems there is no problem for 2.x since all of the ParseUtil objects initialized once for a job. So, thread pool is shared for uses of ParseUtil in the same job. > ExecutorService field in ParseUtil.java not be right use and cause memory leak > -- > > Key: NUTCH-1739 > URL: https://issues.apache.org/jira/browse/NUTCH-1739 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1 > Environment: JDK32, runtime/local >Reporter: ysc >Priority: Critical > Attachments: nutch1.7.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Problem > java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native > thread > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:640) > at > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > at > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Analysis > My server use JDK32. I began thought it was not specify enough memory. I > passed the test of {{java -Xmx2600m -version}} so I known my server can use > the max memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to > the script of bin/nutch. But it's not solve the problem. > Then, I check the source code to see where to produce so many threads. I find > the code > {code:java} > parseResult = new ParseUtil(getConf()).parse(content); > {code} > which in line 97 of the java source file > org.apache.nutch.parse.ParseSegment.java's map method. > Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool > object which no limit of the pool size , see the code: > {code:java} > executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder() > .setNameFormat("parse-%d").setDaemon(true).build()); > {code} > Through the above analyse, I know each map method's output will instantiate > a CachedThreadPool and not to close it. So, ExecutorService field in > ParseUtil.java not be right use and cause memory leak. > Solution > Each map method use a shared FixedThreadPool object which's size can be > config in nutch-site.xml, more detail see the patch file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak
[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938922#comment-13938922 ] Alparslan Avcı commented on NUTCH-1739: --- It seems there is no problem for 2.x since all of the ParseUtil objects initialized once for a job. So, thread pool is shared for uses of ParseUtil in the same job. > ExecutorService field in ParseUtil.java not be right use and cause memory leak > -- > > Key: NUTCH-1739 > URL: https://issues.apache.org/jira/browse/NUTCH-1739 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1 > Environment: JDK32, runtime/local >Reporter: ysc >Priority: Critical > Attachments: nutch1.7.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Problem > java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native > thread > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:640) > at > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > at > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Analysis > My server use JDK32. I began thought it was not specify enough memory. I > passed the test of {{java -Xmx2600m -version}} so I known my server can use > the max memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to > the script of bin/nutch. But it's not solve the problem. > Then, I check the source code to see where to produce so many threads. I find > the code > {code:java} > parseResult = new ParseUtil(getConf()).parse(content); > {code} > which in line 97 of the java source file > org.apache.nutch.parse.ParseSegment.java's map method. > Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool > object which no limit of the pool size , see the code: > {code:java} > executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder() > .setNameFormat("parse-%d").setDaemon(true).build()); > {code} > Through the above analyse, I know each map method's output will instantiate > a CachedThreadPool and not to close it. So, ExecutorService field in > ParseUtil.java not be right use and cause memory leak. > Solution > Each map method use a shared FixedThreadPool object which's size can be > config in nutch-site.xml, more detail see the patch file. -- This message was sent by Atlassian JIRA (v6.2#6252)