[jira] [Commented] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob

2014-03-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939912#comment-13939912
 ] 

Hudson commented on NUTCH-1738:
---

SUCCESS: Integrated in Nutch-nutchgora #958 (See 
[https://builds.apache.org/job/Nutch-nutchgora/958/])
NUTCH-1738 Expose number of URLs generated per batch in GeneratorJob (lewismc: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1579072)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java


> Expose number of URLs generated per batch in GeneratorJob
> -
>
> Key: NUTCH-1738
> URL: https://issues.apache.org/jira/browse/NUTCH-1738
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Talat UYARER
> Fix For: 2.3
>
> Attachments: NUTCH-1738.patch
>
>
> GeneratorJob contains one trivial line of logging
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId);
> {code}
> I propose to improve this logging by exposing how many URL's are contained 
> within the generated batch. Something like
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + 
> $numOfURLs + " URLs");
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob

2014-03-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1738.
-

Resolution: Fixed

Committed @revision 1579072 in 2.x HEAD
Thank you [~talat]

> Expose number of URLs generated per batch in GeneratorJob
> -
>
> Key: NUTCH-1738
> URL: https://issues.apache.org/jira/browse/NUTCH-1738
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Talat UYARER
> Fix For: 2.3
>
> Attachments: NUTCH-1738.patch
>
>
> GeneratorJob contains one trivial line of logging
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId);
> {code}
> I propose to improve this logging by exposing how many URL's are contained 
> within the generated batch. Something like
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + 
> $numOfURLs + " URLs");
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob

2014-03-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1738:


Patch Info: Patch Available

> Expose number of URLs generated per batch in GeneratorJob
> -
>
> Key: NUTCH-1738
> URL: https://issues.apache.org/jira/browse/NUTCH-1738
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Talat UYARER
> Fix For: 2.3
>
> Attachments: NUTCH-1738.patch
>
>
> GeneratorJob contains one trivial line of logging
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId);
> {code}
> I propose to improve this logging by exposing how many URL's are contained 
> within the generated batch. Something like
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + 
> $numOfURLs + " URLs");
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob

2014-03-18 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939873#comment-13939873
 ] 

Lewis John McGibbney commented on NUTCH-1738:
-

Assigned to you [~talat] for Karma.

> Expose number of URLs generated per batch in GeneratorJob
> -
>
> Key: NUTCH-1738
> URL: https://issues.apache.org/jira/browse/NUTCH-1738
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Talat UYARER
> Fix For: 2.3
>
> Attachments: NUTCH-1738.patch
>
>
> GeneratorJob contains one trivial line of logging
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId);
> {code}
> I propose to improve this logging by exposing how many URL's are contained 
> within the generated batch. Something like
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + 
> $numOfURLs + " URLs");
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob

2014-03-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1738:


Assignee: Talat UYARER

> Expose number of URLs generated per batch in GeneratorJob
> -
>
> Key: NUTCH-1738
> URL: https://issues.apache.org/jira/browse/NUTCH-1738
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Talat UYARER
> Fix For: 2.3
>
> Attachments: NUTCH-1738.patch
>
>
> GeneratorJob contains one trivial line of logging
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId);
> {code}
> I propose to improve this logging by exposing how many URL's are contained 
> within the generated batch. Something like
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + 
> $numOfURLs + " URLs");
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob

2014-03-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1738:


Assignee: (was: Lewis John McGibbney)

> Expose number of URLs generated per batch in GeneratorJob
> -
>
> Key: NUTCH-1738
> URL: https://issues.apache.org/jira/browse/NUTCH-1738
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2.1
>Reporter: Lewis John McGibbney
> Fix For: 2.3
>
> Attachments: NUTCH-1738.patch
>
>
> GeneratorJob contains one trivial line of logging
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId);
> {code}
> I propose to improve this logging by exposing how many URL's are contained 
> within the generated batch. Something like
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + 
> $numOfURLs + " URLs");
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

2014-03-18 Thread Shanaka Jayasundera (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939553#comment-13939553
 ] 

Shanaka Jayasundera commented on NUTCH-1478:


Hi All,

I've downloaded latest code from 2.x branch and try to index meta data to Solr 
but Solr query results are not showing meta data.  

But , parsechecker working fine . Do I need to do any additional configurations 
to get meta data on solr query results.

$ ./bin/nutch parsechecker http://nutch.apache.org/
fetching: http://nutch.apache.org/
parsing: http://nutch.apache.org/
contentType: text/html
signature: b2bb805dcd51f12784190d58d619f0bc
-
Url
---
http://nutch.apache.org/
-
Metadata
-
meta_forrest-version :  0.10-dev
meta_generator :Apache Forrest
meta_forrest-skin-name :nutch_rs_ : �
meta_content-type : text/html; charset=UTF-8

Command I'm using to crawl and Index is ,
bin/crawl urls/seed.txt TestCrawl3.1 http://localhost:8983/solr/ 2

I've not done much configuration changes,  I've configure nutch-sites.xml  and 
gora.properties to use hbase & gora

Appreciate if anyone can help me to identify the missing configurations.
Thanks in advance.

> Parse-metatags and index-metadata plugin for Nutch 2.x series 
> --
>
> Key: NUTCH-1478
> URL: https://issues.apache.org/jira/browse/NUTCH-1478
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.1
>Reporter: kiran
> Fix For: 2.3
>
> Attachments: NUTCH-1478-parse-v2.patch, NUTCH-1478v3.patch, 
> NUTCH-1478v4.patch, NUTCH-1478v5.1.patch, NUTCH-1478v5.patch, 
> NUTCH-1478v6.patch, Nutch1478.patch, Nutch1478.zip, 
> metadata_parseChecker_sites.png
>
>
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  
> This will take multiple values of same tag and index in Solr as i patched 
> before (https://issues.apache.org/jira/browse/NUTCH-1467).
> The usage is same as described here 
> (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is 
> no need to give 'metatag' keyword before metatag names. For example my 
> configuration looks like this 
> (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)
>  
> This is only the first version and does not include the junit test. I will 
> update the new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the 
> fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1733) parse-html to support HTML5 charset definitions

2014-03-18 Thread John Lafitte (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939540#comment-13939540
 ] 

John Lafitte commented on NUTCH-1733:
-

It might just be specific to my files or configuration, but when using this 
patch it does seem to remove the BOM but I get what looks like an extra space 
at the beginning of the content.

> parse-html to support HTML5 charset definitions
> ---
>
> Key: NUTCH-1733
> URL: https://issues.apache.org/jira/browse/NUTCH-1733
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.8, 2.2.1
>Reporter: Sebastian Nagel
> Fix For: 2.3, 1.9
>
> Attachments: NUTCH-1733-trunk.patch, charset_bom_html5.html, 
> charset_html5.html
>
>
> HTML 5 allows to specify the character encoding of a page per
> * {{}}
> * Unicode Byte Order Mark (BOM)
> These are allowed in addition to previous HTTP/http-equiv Content-Type, see 
> [[1|http://www.w3.org/TR/2011/WD-html5-diff-20110405/#character-encoding]].
> Parse-html ignores both meta charset and BOM, falls back to the default 
> encoding (cp1252). Parse-tika sets the encoding appropriately.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1740) BatchId parameter is not set in DbUpdaterJob

2014-03-18 Thread JIRA
Alparslan Avcı created NUTCH-1740:
-

 Summary: BatchId parameter is not set in DbUpdaterJob
 Key: NUTCH-1740
 URL: https://issues.apache.org/jira/browse/NUTCH-1740
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Alparslan Avcı
Priority: Minor
 Attachments: NUTCH-1556-batchId.patch

BatchId is not set in DbUpdaterJob since batchId is set to configuration after 
creating currentJob.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1740) BatchId parameter is not set in DbUpdaterJob

2014-03-18 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alparslan Avcı updated NUTCH-1740:
--

Attachment: NUTCH-1556-batchId.patch

This is fixed for 2.x in NUTCH-1556. Uploading the related patch to this issue.

> BatchId parameter is not set in DbUpdaterJob
> 
>
> Key: NUTCH-1740
> URL: https://issues.apache.org/jira/browse/NUTCH-1740
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Alparslan Avcı
>Priority: Minor
> Attachments: NUTCH-1556-batchId.patch
>
>
> BatchId is not set in DbUpdaterJob since batchId is set to configuration 
> after creating currentJob.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak

2014-03-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938962#comment-13938962
 ] 

Alparslan Avcı commented on NUTCH-1739:
---

Hi [~yangshangchuan], and thanks for the patch!

IMHO, FixedThreadPool is not needed in this case. As you can see in the source 
code of _Executors.java_; _newCachedThreadPool()_ method is implemented as 
follows:
{code:java}
public static ExecutorService newCachedThreadPool(ThreadFactory 
threadFactory) {
return new ThreadPoolExecutor(0, Integer.MAX_VALUE,
  60L, TimeUnit.SECONDS,
  new SynchronousQueue(),
  threadFactory);
}
{code}

It is seen that the keepAliveTime parameter is given as 60 seconds, means that 
idle threads will wait 60 sec for new tasks before terminating. So, the threads 
will created as needed and killed when they are idle. And as an experience, we 
have parsed ten millions of webpages and never faced a problem when we use 
CachedThreadPool. Another point is that configuring the fixed size of 
threadpools is a hard issue when the size of crawled webpages is too large.

> ExecutorService field in ParseUtil.java not be right use and cause memory leak
> --
>
> Key: NUTCH-1739
> URL: https://issues.apache.org/jira/browse/NUTCH-1739
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1
> Environment: JDK32, runtime/local
>Reporter: ysc
>Priority: Critical
> Attachments: nutch1.7.patch, nutch2.2.1.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native 
> thread
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:640)
> at 
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> at 
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> Analysis
> My server use JDK32. I began thought it was not specify enough memory. I 
> passed the test of {{java -Xmx2600m -version}} so I known my server can use 
> the max memory is 2.6G. So, I add one line config  {{NUTCH_HEAPSIZE=2000}} to 
> the script of bin/nutch. But it's not solve the problem.
> Then, I check the source code to see where to produce so many threads. I find 
> the code
> {code:java}
>  parseResult = new ParseUtil(getConf()).parse(content); 
> {code}
> which in line 97 of the java source file 
> org.apache.nutch.parse.ParseSegment.java's map method.
> Continue, In the constructor of ParseUtil,  instantiate a CachedThreadPool 
> object which no limit of the pool size , see the code:
> {code:java}
> executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
>   .setNameFormat("parse-%d").setDaemon(true).build());
> {code}
> Through the above analyse, I know each map method's output will  instantiate 
> a CachedThreadPool and not to close it. So, ExecutorService field in 
> ParseUtil.java not be right use and cause memory leak.
> Solution

[jira] [Comment Edited] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob

2014-03-18 Thread Talat UYARER (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938960#comment-13938960
 ] 

Talat UYARER edited comment on NUTCH-1738 at 3/18/14 8:29 AM:
--

Hi [~lewismc],

I attached a patch for this information. Can you review it ? 


was (Author: talat):
Hi [~lewis] ,

I attached a patch for this information. Can you review it ? 

> Expose number of URLs generated per batch in GeneratorJob
> -
>
> Key: NUTCH-1738
> URL: https://issues.apache.org/jira/browse/NUTCH-1738
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3
>
> Attachments: NUTCH-1738.patch
>
>
> GeneratorJob contains one trivial line of logging
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId);
> {code}
> I propose to improve this logging by exposing how many URL's are contained 
> within the generated batch. Something like
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + 
> $numOfURLs + " URLs");
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob

2014-03-18 Thread Talat UYARER (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Talat UYARER updated NUTCH-1738:


Attachment: NUTCH-1738.patch

Hi [~lewis] ,

I attached a patch for this information. Can you review it ? 

> Expose number of URLs generated per batch in GeneratorJob
> -
>
> Key: NUTCH-1738
> URL: https://issues.apache.org/jira/browse/NUTCH-1738
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3
>
> Attachments: NUTCH-1738.patch
>
>
> GeneratorJob contains one trivial line of logging
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId);
> {code}
> I propose to improve this logging by exposing how many URL's are contained 
> within the generated batch. Something like
> {code:title=GeneratorJob.java|borderStyle=solid}
> LOG.info("GeneratorJob: generated batch id: " + batchId + " containing " + 
> $numOfURLs + " URLs");
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak

2014-03-18 Thread ysc (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938945#comment-13938945
 ] 

ysc commented on NUTCH-1739:


Thanks, [~wastl-nagel] , you are right, i just now saw it. 
In the map method of org.apache.nutch.parse.ParseSegment.java :
{code:java}
ParseResult parseResult = null;
try {
  if (parseUtil == null) 
parseUtil = new ParseUtil(getConf());
  parseResult = parseUtil.parse(content);
} catch (Exception e) {
  LOG.warn("Error parsing: " + key + ": " + 
StringUtils.stringifyException(e));
  return;
}
{code}
But still not limit the size for thread pool. This may be produce lots of 
threads result in highly memory usage and frequently GC, the worse is that this 
can cause OutOfMemoryError.

> ExecutorService field in ParseUtil.java not be right use and cause memory leak
> --
>
> Key: NUTCH-1739
> URL: https://issues.apache.org/jira/browse/NUTCH-1739
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1
> Environment: JDK32, runtime/local
>Reporter: ysc
>Priority: Critical
> Attachments: nutch1.7.patch, nutch2.2.1.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native 
> thread
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:640)
> at 
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> at 
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> Analysis
> My server use JDK32. I began thought it was not specify enough memory. I 
> passed the test of {{java -Xmx2600m -version}} so I known my server can use 
> the max memory is 2.6G. So, I add one line config  {{NUTCH_HEAPSIZE=2000}} to 
> the script of bin/nutch. But it's not solve the problem.
> Then, I check the source code to see where to produce so many threads. I find 
> the code
> {code:java}
>  parseResult = new ParseUtil(getConf()).parse(content); 
> {code}
> which in line 97 of the java source file 
> org.apache.nutch.parse.ParseSegment.java's map method.
> Continue, In the constructor of ParseUtil,  instantiate a CachedThreadPool 
> object which no limit of the pool size , see the code:
> {code:java}
> executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
>   .setNameFormat("parse-%d").setDaemon(true).build());
> {code}
> Through the above analyse, I know each map method's output will  instantiate 
> a CachedThreadPool and not to close it. So, ExecutorService field in 
> ParseUtil.java not be right use and cause memory leak.
> Solution
> Each map method use a shared FixedThreadPool object which's size can be 
> config in nutch-site.xml, more detail see the patch file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak

2014-03-18 Thread ysc (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938892#comment-13938892
 ] 

ysc edited comment on NUTCH-1739 at 3/18/14 8:04 AM:
-

This patch is produced in the environment of nutch1.7. You can reference this 
patch to patch other 1.x version.


was (Author: yangshangchuan):
This patch is produced in the environment of nutch1.7. You can reference this 
patch to patch other version.

> ExecutorService field in ParseUtil.java not be right use and cause memory leak
> --
>
> Key: NUTCH-1739
> URL: https://issues.apache.org/jira/browse/NUTCH-1739
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1
> Environment: JDK32, runtime/local
>Reporter: ysc
>Priority: Critical
> Attachments: nutch1.7.patch, nutch2.2.1.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native 
> thread
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:640)
> at 
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> at 
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> Analysis
> My server use JDK32. I began thought it was not specify enough memory. I 
> passed the test of {{java -Xmx2600m -version}} so I known my server can use 
> the max memory is 2.6G. So, I add one line config  {{NUTCH_HEAPSIZE=2000}} to 
> the script of bin/nutch. But it's not solve the problem.
> Then, I check the source code to see where to produce so many threads. I find 
> the code
> {code:java}
>  parseResult = new ParseUtil(getConf()).parse(content); 
> {code}
> which in line 97 of the java source file 
> org.apache.nutch.parse.ParseSegment.java's map method.
> Continue, In the constructor of ParseUtil,  instantiate a CachedThreadPool 
> object which no limit of the pool size , see the code:
> {code:java}
> executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
>   .setNameFormat("parse-%d").setDaemon(true).build());
> {code}
> Through the above analyse, I know each map method's output will  instantiate 
> a CachedThreadPool and not to close it. So, ExecutorService field in 
> ParseUtil.java not be right use and cause memory leak.
> Solution
> Each map method use a shared FixedThreadPool object which's size can be 
> config in nutch-site.xml, more detail see the patch file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak

2014-03-18 Thread ysc (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ysc updated NUTCH-1739:
---

Attachment: nutch2.2.1.patch

This patch is produced in the environment of nutch2.2.1. You can reference this 
patch to patch other 2.x version.

> ExecutorService field in ParseUtil.java not be right use and cause memory leak
> --
>
> Key: NUTCH-1739
> URL: https://issues.apache.org/jira/browse/NUTCH-1739
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1
> Environment: JDK32, runtime/local
>Reporter: ysc
>Priority: Critical
> Attachments: nutch1.7.patch, nutch2.2.1.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native 
> thread
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:640)
> at 
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> at 
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> Analysis
> My server use JDK32. I began thought it was not specify enough memory. I 
> passed the test of {{java -Xmx2600m -version}} so I known my server can use 
> the max memory is 2.6G. So, I add one line config  {{NUTCH_HEAPSIZE=2000}} to 
> the script of bin/nutch. But it's not solve the problem.
> Then, I check the source code to see where to produce so many threads. I find 
> the code
> {code:java}
>  parseResult = new ParseUtil(getConf()).parse(content); 
> {code}
> which in line 97 of the java source file 
> org.apache.nutch.parse.ParseSegment.java's map method.
> Continue, In the constructor of ParseUtil,  instantiate a CachedThreadPool 
> object which no limit of the pool size , see the code:
> {code:java}
> executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
>   .setNameFormat("parse-%d").setDaemon(true).build());
> {code}
> Through the above analyse, I know each map method's output will  instantiate 
> a CachedThreadPool and not to close it. So, ExecutorService field in 
> ParseUtil.java not be right use and cause memory leak.
> Solution
> Each map method use a shared FixedThreadPool object which's size can be 
> config in nutch-site.xml, more detail see the patch file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak

2014-03-18 Thread ysc (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938941#comment-13938941
 ] 

ysc commented on NUTCH-1739:


Thanks, [~alparslan.avci] , you are right. Nutch2.1 hasn't initialized the 
thread pool for each map method's output, but still not limit the size for 
thread pool. This may be produce lots of threads result in highly memory usage 
and frequently GC, the worse  is that this can cause OutOfMemoryError.I will 
add a patch for for 2.x.

> ExecutorService field in ParseUtil.java not be right use and cause memory leak
> --
>
> Key: NUTCH-1739
> URL: https://issues.apache.org/jira/browse/NUTCH-1739
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1
> Environment: JDK32, runtime/local
>Reporter: ysc
>Priority: Critical
> Attachments: nutch1.7.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native 
> thread
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:640)
> at 
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> at 
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> Analysis
> My server use JDK32. I began thought it was not specify enough memory. I 
> passed the test of {{java -Xmx2600m -version}} so I known my server can use 
> the max memory is 2.6G. So, I add one line config  {{NUTCH_HEAPSIZE=2000}} to 
> the script of bin/nutch. But it's not solve the problem.
> Then, I check the source code to see where to produce so many threads. I find 
> the code
> {code:java}
>  parseResult = new ParseUtil(getConf()).parse(content); 
> {code}
> which in line 97 of the java source file 
> org.apache.nutch.parse.ParseSegment.java's map method.
> Continue, In the constructor of ParseUtil,  instantiate a CachedThreadPool 
> object which no limit of the pool size , see the code:
> {code:java}
> executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
>   .setNameFormat("parse-%d").setDaemon(true).build());
> {code}
> Through the above analyse, I know each map method's output will  instantiate 
> a CachedThreadPool and not to close it. So, ExecutorService field in 
> ParseUtil.java not be right use and cause memory leak.
> Solution
> Each map method use a shared FixedThreadPool object which's size can be 
> config in nutch-site.xml, more detail see the patch file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak

2014-03-18 Thread ysc (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ysc updated NUTCH-1739:
---

Affects Version/s: 2.1
   2.2
   2.2.1

> ExecutorService field in ParseUtil.java not be right use and cause memory leak
> --
>
> Key: NUTCH-1739
> URL: https://issues.apache.org/jira/browse/NUTCH-1739
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1
> Environment: JDK32, runtime/local
>Reporter: ysc
>Priority: Critical
> Attachments: nutch1.7.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native 
> thread
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:640)
> at 
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> at 
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> Analysis
> My server use JDK32. I began thought it was not specify enough memory. I 
> passed the test of {{java -Xmx2600m -version}} so I known my server can use 
> the max memory is 2.6G. So, I add one line config  {{NUTCH_HEAPSIZE=2000}} to 
> the script of bin/nutch. But it's not solve the problem.
> Then, I check the source code to see where to produce so many threads. I find 
> the code
> {code:java}
>  parseResult = new ParseUtil(getConf()).parse(content); 
> {code}
> which in line 97 of the java source file 
> org.apache.nutch.parse.ParseSegment.java's map method.
> Continue, In the constructor of ParseUtil,  instantiate a CachedThreadPool 
> object which no limit of the pool size , see the code:
> {code:java}
> executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
>   .setNameFormat("parse-%d").setDaemon(true).build());
> {code}
> Through the above analyse, I know each map method's output will  instantiate 
> a CachedThreadPool and not to close it. So, ExecutorService field in 
> ParseUtil.java not be right use and cause memory leak.
> Solution
> Each map method use a shared FixedThreadPool object which's size can be 
> config in nutch-site.xml, more detail see the patch file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak

2014-03-18 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938930#comment-13938930
 ] 

Sebastian Nagel edited comment on NUTCH-1739 at 3/18/14 7:43 AM:
-

Thanks, [~yangshangchuan]. But isn't this fixed with NUTCH-1640 (contained in 
1.8 which was just released)?


was (Author: wastl-nagel):
Thanks, [~yangshangchuan]. But isn't this fixed with NUTCH-1640 (contained in 
1.8 which was just released).

> ExecutorService field in ParseUtil.java not be right use and cause memory leak
> --
>
> Key: NUTCH-1739
> URL: https://issues.apache.org/jira/browse/NUTCH-1739
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6, 1.7, 1.8
> Environment: JDK32, runtime/local
>Reporter: ysc
>Priority: Critical
> Attachments: nutch1.7.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native 
> thread
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:640)
> at 
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> at 
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> Analysis
> My server use JDK32. I began thought it was not specify enough memory. I 
> passed the test of {{java -Xmx2600m -version}} so I known my server can use 
> the max memory is 2.6G. So, I add one line config  {{NUTCH_HEAPSIZE=2000}} to 
> the script of bin/nutch. But it's not solve the problem.
> Then, I check the source code to see where to produce so many threads. I find 
> the code
> {code:java}
>  parseResult = new ParseUtil(getConf()).parse(content); 
> {code}
> which in line 97 of the java source file 
> org.apache.nutch.parse.ParseSegment.java's map method.
> Continue, In the constructor of ParseUtil,  instantiate a CachedThreadPool 
> object which no limit of the pool size , see the code:
> {code:java}
> executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
>   .setNameFormat("parse-%d").setDaemon(true).build());
> {code}
> Through the above analyse, I know each map method's output will  instantiate 
> a CachedThreadPool and not to close it. So, ExecutorService field in 
> ParseUtil.java not be right use and cause memory leak.
> Solution
> Each map method use a shared FixedThreadPool object which's size can be 
> config in nutch-site.xml, more detail see the patch file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak

2014-03-18 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938930#comment-13938930
 ] 

Sebastian Nagel commented on NUTCH-1739:


Thanks, [~yangshangchuan]. But isn't this fixed with NUTCH-1640 (contained in 
1.8 which was just released).

> ExecutorService field in ParseUtil.java not be right use and cause memory leak
> --
>
> Key: NUTCH-1739
> URL: https://issues.apache.org/jira/browse/NUTCH-1739
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6, 1.7, 1.8
> Environment: JDK32, runtime/local
>Reporter: ysc
>Priority: Critical
> Attachments: nutch1.7.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native 
> thread
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:640)
> at 
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> at 
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> Analysis
> My server use JDK32. I began thought it was not specify enough memory. I 
> passed the test of {{java -Xmx2600m -version}} so I known my server can use 
> the max memory is 2.6G. So, I add one line config  {{NUTCH_HEAPSIZE=2000}} to 
> the script of bin/nutch. But it's not solve the problem.
> Then, I check the source code to see where to produce so many threads. I find 
> the code
> {code:java}
>  parseResult = new ParseUtil(getConf()).parse(content); 
> {code}
> which in line 97 of the java source file 
> org.apache.nutch.parse.ParseSegment.java's map method.
> Continue, In the constructor of ParseUtil,  instantiate a CachedThreadPool 
> object which no limit of the pool size , see the code:
> {code:java}
> executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
>   .setNameFormat("parse-%d").setDaemon(true).build());
> {code}
> Through the above analyse, I know each map method's output will  instantiate 
> a CachedThreadPool and not to close it. So, ExecutorService field in 
> ParseUtil.java not be right use and cause memory leak.
> Solution
> Each map method use a shared FixedThreadPool object which's size can be 
> config in nutch-site.xml, more detail see the patch file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak

2014-03-18 Thread ysc (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ysc updated NUTCH-1739:
---

Affects Version/s: (was: 2.2.1)
   (was: 2.2)
   (was: 2.1)

> ExecutorService field in ParseUtil.java not be right use and cause memory leak
> --
>
> Key: NUTCH-1739
> URL: https://issues.apache.org/jira/browse/NUTCH-1739
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6, 1.7, 1.8
> Environment: JDK32, runtime/local
>Reporter: ysc
>Priority: Critical
> Attachments: nutch1.7.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native 
> thread
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:640)
> at 
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> at 
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> Analysis
> My server use JDK32. I began thought it was not specify enough memory. I 
> passed the test of {{java -Xmx2600m -version}} so I known my server can use 
> the max memory is 2.6G. So, I add one line config  {{NUTCH_HEAPSIZE=2000}} to 
> the script of bin/nutch. But it's not solve the problem.
> Then, I check the source code to see where to produce so many threads. I find 
> the code
> {code:java}
>  parseResult = new ParseUtil(getConf()).parse(content); 
> {code}
> which in line 97 of the java source file 
> org.apache.nutch.parse.ParseSegment.java's map method.
> Continue, In the constructor of ParseUtil,  instantiate a CachedThreadPool 
> object which no limit of the pool size , see the code:
> {code:java}
> executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
>   .setNameFormat("parse-%d").setDaemon(true).build());
> {code}
> Through the above analyse, I know each map method's output will  instantiate 
> a CachedThreadPool and not to close it. So, ExecutorService field in 
> ParseUtil.java not be right use and cause memory leak.
> Solution
> Each map method use a shared FixedThreadPool object which's size can be 
> config in nutch-site.xml, more detail see the patch file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak

2014-03-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938922#comment-13938922
 ] 

Alparslan Avcı edited comment on NUTCH-1739 at 3/18/14 7:32 AM:


It seems there is no problem for 2.x since all of the ParseUtil objects are 
initialized once for a job. So, thread pool is shared for uses of ParseUtil in 
the same job.


was (Author: alparslan.avci):
It seems there is no problem for 2.x since all of the ParseUtil objects 
initialized once for a job. So, thread pool is shared for uses of ParseUtil in 
the same job.

> ExecutorService field in ParseUtil.java not be right use and cause memory leak
> --
>
> Key: NUTCH-1739
> URL: https://issues.apache.org/jira/browse/NUTCH-1739
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1
> Environment: JDK32, runtime/local
>Reporter: ysc
>Priority: Critical
> Attachments: nutch1.7.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native 
> thread
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:640)
> at 
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> at 
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> Analysis
> My server use JDK32. I began thought it was not specify enough memory. I 
> passed the test of {{java -Xmx2600m -version}} so I known my server can use 
> the max memory is 2.6G. So, I add one line config  {{NUTCH_HEAPSIZE=2000}} to 
> the script of bin/nutch. But it's not solve the problem.
> Then, I check the source code to see where to produce so many threads. I find 
> the code
> {code:java}
>  parseResult = new ParseUtil(getConf()).parse(content); 
> {code}
> which in line 97 of the java source file 
> org.apache.nutch.parse.ParseSegment.java's map method.
> Continue, In the constructor of ParseUtil,  instantiate a CachedThreadPool 
> object which no limit of the pool size , see the code:
> {code:java}
> executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
>   .setNameFormat("parse-%d").setDaemon(true).build());
> {code}
> Through the above analyse, I know each map method's output will  instantiate 
> a CachedThreadPool and not to close it. So, ExecutorService field in 
> ParseUtil.java not be right use and cause memory leak.
> Solution
> Each map method use a shared FixedThreadPool object which's size can be 
> config in nutch-site.xml, more detail see the patch file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak

2014-03-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938922#comment-13938922
 ] 

Alparslan Avcı commented on NUTCH-1739:
---

It seems there is no problem for 2.x since all of the ParseUtil objects 
initialized once for a job. So, thread pool is shared for uses of ParseUtil in 
the same job.

> ExecutorService field in ParseUtil.java not be right use and cause memory leak
> --
>
> Key: NUTCH-1739
> URL: https://issues.apache.org/jira/browse/NUTCH-1739
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1
> Environment: JDK32, runtime/local
>Reporter: ysc
>Priority: Critical
> Attachments: nutch1.7.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native 
> thread
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:640)
> at 
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> at 
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> Analysis
> My server use JDK32. I began thought it was not specify enough memory. I 
> passed the test of {{java -Xmx2600m -version}} so I known my server can use 
> the max memory is 2.6G. So, I add one line config  {{NUTCH_HEAPSIZE=2000}} to 
> the script of bin/nutch. But it's not solve the problem.
> Then, I check the source code to see where to produce so many threads. I find 
> the code
> {code:java}
>  parseResult = new ParseUtil(getConf()).parse(content); 
> {code}
> which in line 97 of the java source file 
> org.apache.nutch.parse.ParseSegment.java's map method.
> Continue, In the constructor of ParseUtil,  instantiate a CachedThreadPool 
> object which no limit of the pool size , see the code:
> {code:java}
> executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
>   .setNameFormat("parse-%d").setDaemon(true).build());
> {code}
> Through the above analyse, I know each map method's output will  instantiate 
> a CachedThreadPool and not to close it. So, ExecutorService field in 
> ParseUtil.java not be right use and cause memory leak.
> Solution
> Each map method use a shared FixedThreadPool object which's size can be 
> config in nutch-site.xml, more detail see the patch file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)