[jira] [Commented] (NUTCH-809) Parse-metatags plugin

2011-07-07 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061177#comment-13061177
 ] 

Markus Jelsma commented on NUTCH-809:
-

Why don't we include this plugin?

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The query-basic plugin is used to include these fields in the search e.g. in 
 nutch-site.xml
 {code:xml}
 property
   namequery.basic.description.boost/name
   value2.0/value
 /property
 property
   namequery.basic.keywords.boost/name
   value2.0/value
 /property
 {code}
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-07-07 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-797:


Fix Version/s: 2.0
   1.4

Back on radar: has this ever been committed at all?

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.4, 2.0

 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1)
 + {
 + baseRightMost = 

[jira] [Updated] (NUTCH-925) plugins stored in weakhashmap lead memory leak

2011-07-07 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-925:


Fix Version/s: 1.4

This has been fixed for 2.0 in NUTCH-844 but not in 1.x.

 plugins stored in weakhashmap lead memory leak
 --

 Key: NUTCH-925
 URL: https://issues.apache.org/jira/browse/NUTCH-925
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.2
Reporter: congliu
 Fix For: 1.4


 I suffer serious memory leak using Nutch 1.2 though a very deep crawl. I get 
 the error like this:
 Exception in thread Thread-113544 java.lang.OutOfMemoryError: PermGen space
   at java.lang.Throwable.getStackTraceElement(Native Method)
   at java.lang.Throwable.getOurStackTrace(Throwable.java:591)
   at java.lang.Throwable.printStackTrace(Throwable.java:510)
   at 
 org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:76)
   at 
 org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:407)
   at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:305)
   at 
 org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:359)
   at org.apache.log4j.WriterAppender.append(WriterAppender.java:160)
   at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
   at 
 org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
   at org.apache.log4j.Category.callAppenders(Category.java:206)
   at org.apache.log4j.Category.forcedLog(Category.java:391)
   at org.apache.log4j.Category.log(Category.java:856)
   at org.slf4j.impl.Log4jLoggerAdapter.log(Log4jLoggerAdapter.java:509)
   at 
 org.apache.commons.logging.impl.SLF4JLocationAwareLog.warn(SLF4JLocationAwareLog.java:173)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:256)
 Exception in thread main java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
   at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:133)
 I guess Plugin repository cache lead to memory leak.
 As u know plugins is stored in weakhashmap conf, plugins, and new class 
 classload
 create when u need plugins.
 Usually,WeakHashMap object can been gc, but class and classload is stored in 
 Perm NOT stack and gc can't perform in Perm, SO (java.lang.OutOfMemoryError: 
 PermGen space) occured..., is any nutch-issues have concerned this promble? 
 or there is any solution? 
 nutch-356 may help?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-783) IndexerChecker Utilty

2011-07-07 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061222#comment-13061222
 ] 

Markus Jelsma commented on NUTCH-783:
-

Hey, this code is not compatible with Nutch API in 1.4

{code}
+  ListString values = doc.getFieldValues(fname);
+  if (values != null) {
+for (String value : values){
+  int minText = Math.min(100, value.length());
+  System.out.println(fname +  :\t + value.substring(0, minText));
+}
+  }
{code}

changed to

{code}
  ListObject values = Arrays.asList(doc.getFieldValue(fname));
  if (values != null) {
for (Object value : values) {
  String str = value.toString();
  int minText = Math.min(100, str.length());
  System.out.println(fname +  :\t + str.substring(0, minText));
}
  }
{code}


It works now. I think it's nice to have in 1.4 and 2.0. 

 IndexerChecker Utilty
 -

 Key: NUTCH-783
 URL: https://issues.apache.org/jira/browse/NUTCH-783
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-783.patch


 This patch contains a new utility which allows to check the configuration of 
 the indexing filters. The IndexerChecker reads and parses a URL and run the 
 indexers on it. Displays the fields obtained and the first
  100 characters of their value.
 Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker 
 http://www.lemonde.fr/

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (NUTCH-783) IndexerChecker Utilty

2011-07-07 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061222#comment-13061222
 ] 

Markus Jelsma edited comment on NUTCH-783 at 7/7/11 11:43 AM:
--

Yes, this code is not compatible with Nutch API in 1.4

{code}
+  ListString values = doc.getFieldValues(fname);
+  if (values != null) {
+for (String value : values){
+  int minText = Math.min(100, value.length());
+  System.out.println(fname +  :\t + value.substring(0, minText));
+}
+  }
{code}

changed to

{code}
  ListObject values = Arrays.asList(doc.getFieldValue(fname));
  if (values != null) {
for (Object value : values) {
  String str = value.toString();
  int minText = Math.min(100, str.length());
  System.out.println(fname +  :\t + str.substring(0, minText));
}
  }
{code}


It works now. I think it's nice to have in 1.4 and 2.0. 

  was (Author: markus17):
Hey, this code is not compatible with Nutch API in 1.4

{code}
+  ListString values = doc.getFieldValues(fname);
+  if (values != null) {
+for (String value : values){
+  int minText = Math.min(100, value.length());
+  System.out.println(fname +  :\t + value.substring(0, minText));
+}
+  }
{code}

changed to

{code}
  ListObject values = Arrays.asList(doc.getFieldValue(fname));
  if (values != null) {
for (Object value : values) {
  String str = value.toString();
  int minText = Math.min(100, str.length());
  System.out.println(fname +  :\t + str.substring(0, minText));
}
  }
{code}


It works now. I think it's nice to have in 1.4 and 2.0. 
  
 IndexerChecker Utilty
 -

 Key: NUTCH-783
 URL: https://issues.apache.org/jira/browse/NUTCH-783
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-783.patch


 This patch contains a new utility which allows to check the configuration of 
 the indexing filters. The IndexerChecker reads and parses a URL and run the 
 indexers on it. Displays the fields obtained and the first
  100 characters of their value.
 Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker 
 http://www.lemonde.fr/

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Upgrade libs to support Hadoop 0.20.203 and 0.21

2011-07-07 Thread Markus Jelsma
Hi,

To support Hadoop  0.20 in Nutch we should to upgrade our Ivy configuration 
for Hadoop. Newer versions depend need Jackson and Avro. We can include Avro 
and Jackson as both as available under de ASL 2.0.

Thoughts?

Cheers,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-07-07 Thread Robert Hohman (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061280#comment-13061280
 ] 

Robert Hohman commented on NUTCH-797:
-

Hi markus  - I am not sure if the committers committed it. I thought they were 
going to. 

We have moved off of nutch and so I am a little out of touch with what the 
latest is. 

If you hve any other questions let me know. 






 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.4, 2.0

 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + 

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-07-07 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061283#comment-13061283
 ] 

Markus Jelsma commented on NUTCH-797:
-

We'll look in to it. Thanks for reporting.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.4, 2.0

 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1)
 + {
 + baseRightMost = 

[jira] [Commented] (NUTCH-925) plugins stored in weakhashmap lead memory leak

2011-07-07 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061291#comment-13061291
 ] 

Julien Nioche commented on NUTCH-925:
-

This should have been part of the batch of issues which I backported from 2.0 
to 1.x. Can't see it in the list of modifs in 1.3 though so it is possible that 
it slipped through the net.

 plugins stored in weakhashmap lead memory leak
 --

 Key: NUTCH-925
 URL: https://issues.apache.org/jira/browse/NUTCH-925
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.2
Reporter: congliu
 Fix For: 1.4


 I suffer serious memory leak using Nutch 1.2 though a very deep crawl. I get 
 the error like this:
 Exception in thread Thread-113544 java.lang.OutOfMemoryError: PermGen space
   at java.lang.Throwable.getStackTraceElement(Native Method)
   at java.lang.Throwable.getOurStackTrace(Throwable.java:591)
   at java.lang.Throwable.printStackTrace(Throwable.java:510)
   at 
 org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:76)
   at 
 org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:407)
   at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:305)
   at 
 org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:359)
   at org.apache.log4j.WriterAppender.append(WriterAppender.java:160)
   at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
   at 
 org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
   at org.apache.log4j.Category.callAppenders(Category.java:206)
   at org.apache.log4j.Category.forcedLog(Category.java:391)
   at org.apache.log4j.Category.log(Category.java:856)
   at org.slf4j.impl.Log4jLoggerAdapter.log(Log4jLoggerAdapter.java:509)
   at 
 org.apache.commons.logging.impl.SLF4JLocationAwareLog.warn(SLF4JLocationAwareLog.java:173)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:256)
 Exception in thread main java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
   at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:133)
 I guess Plugin repository cache lead to memory leak.
 As u know plugins is stored in weakhashmap conf, plugins, and new class 
 classload
 create when u need plugins.
 Usually,WeakHashMap object can been gc, but class and classload is stored in 
 Perm NOT stack and gc can't perform in Perm, SO (java.lang.OutOfMemoryError: 
 PermGen space) occured..., is any nutch-issues have concerned this promble? 
 or there is any solution? 
 nutch-356 may help?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-925) plugins stored in weakhashmap lead memory leak

2011-07-07 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-925.
-

Resolution: Duplicate

I've checked the PluginReposiry diff of NUTCH-844 and compared with 1.4. It's 
in!

 plugins stored in weakhashmap lead memory leak
 --

 Key: NUTCH-925
 URL: https://issues.apache.org/jira/browse/NUTCH-925
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.2
Reporter: congliu
 Fix For: 1.4


 I suffer serious memory leak using Nutch 1.2 though a very deep crawl. I get 
 the error like this:
 Exception in thread Thread-113544 java.lang.OutOfMemoryError: PermGen space
   at java.lang.Throwable.getStackTraceElement(Native Method)
   at java.lang.Throwable.getOurStackTrace(Throwable.java:591)
   at java.lang.Throwable.printStackTrace(Throwable.java:510)
   at 
 org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:76)
   at 
 org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:407)
   at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:305)
   at 
 org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:359)
   at org.apache.log4j.WriterAppender.append(WriterAppender.java:160)
   at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
   at 
 org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
   at org.apache.log4j.Category.callAppenders(Category.java:206)
   at org.apache.log4j.Category.forcedLog(Category.java:391)
   at org.apache.log4j.Category.log(Category.java:856)
   at org.slf4j.impl.Log4jLoggerAdapter.log(Log4jLoggerAdapter.java:509)
   at 
 org.apache.commons.logging.impl.SLF4JLocationAwareLog.warn(SLF4JLocationAwareLog.java:173)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:256)
 Exception in thread main java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
   at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:133)
 I guess Plugin repository cache lead to memory leak.
 As u know plugins is stored in weakhashmap conf, plugins, and new class 
 classload
 create when u need plugins.
 Usually,WeakHashMap object can been gc, but class and classload is stored in 
 Perm NOT stack and gc can't perform in Perm, SO (java.lang.OutOfMemoryError: 
 PermGen space) occured..., is any nutch-issues have concerned this promble? 
 or there is any solution? 
 nutch-356 may help?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[Nutch Wiki] Trivial Update of Archive and Legacy by LewisJohnMcgibbney

2011-07-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Archive and Legacy page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/Archive%20and%20Legacy?action=diffrev1=12rev2=13

  == Archive and Legacy ==
  
  This section includes all Pre Nutch 1.3 material
+ TableOfContents(3)
  
  === Reference Section ===
   * [[http://frutch.free.fr/|Frutch Wiki]] -- French Nutch Wiki


Rebuilding site

2011-07-07 Thread lewis john mcgibbney
Hi,

As I am back home I propose to rebuild the site to link the current tutorial
link to the new 1.3 tutorial on the wiki. I would also like to formally make
my first committ by adding my name to the list of committers before I
progress with other bits and pieces.

Julien, I managed to pick out from some of your activity (these are not your
exact words) a suggestion to make a clear definition of Nutch explicit on
the website. Can we get some ideas out in the open which summarizes the
project and the points we wish to get across to prospective users, visitors
and generally anyone else who stumbles across the site.

If we can agree on some documentation then I will be happy to make the
changes. Should I open a ticket for this?

Thank you

-- 
*Lewis*


Re: Rebuilding site

2011-07-07 Thread Julien Nioche
Hi Lewis,


 As I am back home I propose to rebuild the site to link the current
 tutorial link to the new 1.3 tutorial on the wiki. I would also like to
 formally make my first committ by adding my name to the list of committers
 before I progress with other bits and pieces.


Good idea!

See also https://issues.apache.org/jira/browse/NUTCH-914 for a list of
things that we need to do on the website (long overdue)

If you could also update Otis' status from committer to former, this would
be great

Thanks for all your hard work and enthusiasm

Jul

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: Rebuilding site

2011-07-07 Thread lewis john mcgibbney
Thanks Julien, I didn't even see this ticket. I'm on it.

One further question, it would be interesting to unearth why people are
subscribing to the nutch-user@ list. I am aware that this was the old list
when Nutch was a subpart of Lucene. There is a heavily weighted tendency for
people to cross post when asking reasonably simple questions on the user
list, my assumption is that they think this is an option which will increase
their chances of getting a reply as it goes out to a larger audience...
which is not the case.

It's a pretty minor suggestion (even if it has any substance), but would it
be possible to cease the old nutch-user@ and try to refer people to the
current user@ list.

On Thu, Jul 7, 2011 at 5:38 PM, Julien Nioche lists.digitalpeb...@gmail.com
 wrote:

 Hi Lewis,


 As I am back home I propose to rebuild the site to link the current
 tutorial link to the new 1.3 tutorial on the wiki. I would also like to
 formally make my first committ by adding my name to the list of committers
 before I progress with other bits and pieces.


 Good idea!

 See also https://issues.apache.org/jira/browse/NUTCH-914 for a list of
 things that we need to do on the website (long overdue)

 If you could also update Otis' status from committer to former, this would
 be great

 Thanks for all your hard work and enthusiasm

 Jul

 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com




-- 
*Lewis*


Build failed in Jenkins: Nutch-trunk #1539

2011-07-07 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/1539/

--
[...truncated 985 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A src/plugin/parse-html/src/test/org/apache/nutch
A src/plugin/parse-html/src/test/org/apache/nutch/parse
A