Nutch 0.9 not loading plugins (sorry very long)

2006-11-08 Thread zzcgiacomini

Hi everybody,
Sorry if I come again on this issue with this long mail but I really 
cant have my plugin loaded.
I have read and applied the suggestion given  in various previous 
postings on this list

but i still have not get results

Well basically I  have used part of the code written for the recommended
plugin example from the nutch wiki, and kept only the Parse extension.
I have ported it a on nutch 0.9 and run the inject/generate/fetch cycle.
The plugin is compiled and correctly installed in 
$NUTCH_HOME/plugins/parse-rec directory.


My problem is the it looks like that my plugin is never executed even if 
it appears to be correctly registered.
Another problem I got is to make the plugin  system to produce some  
logs unless I invoke it directly (see below)


I add here all my code/config etc. hoping someone can point out my 
mistakes or misunderstanding .


-Corrado

I took the code from the latest nightly  At revision 472436
put my plugin code in 
trunk/src/plugin/parse-rec/src/java/org/apache/nutch/parse/rec/RecParseFilter.java


here is the code  and  config files:
__ RecParseFilter.java 
__

package org.apache.nutch.parse.rec;

// JDK imports
import java.util.Enumeration;
import java.util.Properties;
import java.util.logging.Logger;

// Nutch imports
import org.apache.nutch.parse.HTMLMetaTags;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.HtmlParseFilter;
import org.apache.nutch.protocol.Content;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import org.apache.nutch.util.NutchConfiguration;
import org.apache.hadoop.conf.Configuration;

import org.w3c.dom.DocumentFragment;

public class RecParseFilter implements HtmlParseFilter {

 /** Configuration  */
 private Configuration conf;

 public static final Log LOG = LogFactory.getLog(RecParseFilter.class);

 /** The Recommended meta data attribute name */
 public static final String META_RECOMMENDED_NAME=Recommended;

 /** Scan the HTML document looking for a recommended meta tag.  */
 public Parse filter(Content content, Parse parse, HTMLMetaTags 
metaTags, DocumentFragment doc) {


   LOG.debug(RecParseFilter::filter() ---);
   /** Trying to find the document's recommended term */
   String recommendation = null;
   Properties generalMetaTags = metaTags.getGeneralTags();
   String title = parse.getData().getTitle();
   LOG.debug(RecParseFilter::filter() - Document Title :  + title);

   for(Enumeration tagNames = generalMetaTags.propertyNames(); 
tagNames.hasMoreElements(); ) {

   if (tagNames.nextElement().equals(recommended)) {
   recommendation = generalMetaTags.getProperty(recommended);
   LOG.debug(RecParseFilter::filter() - Found a 
Recommendation for  + recommendation);

}
   }

   if(recommendation == null)
  LOG.debug(RecParseFilter::filter() - No Recommendataion);
   else {
  LOG.debug(RecParseFilter::filter() - Adding Recommendation 
for  + recommendation);
  parse.getData().getContentMeta().set(META_RECOMMENDED_NAME, 
recommendation);

   }
   LOG.debug(RecParseFilter::filter() --);
   return parse;
 }

 public Configuration getConf() {
   LOG.debug(RecParseFilter::getConf() --);
   LOG.debug(RecParseFilter::getConf() --);
   return this.conf;
 }

 public void setConf(Configuration conf) {
   LOG.debug(RecParseFilter::setConf() --);
   LOG.debug(RecParseFilter::setConf() --);
   this.conf = conf;
 }
}


_plugin.xml___

?xml version=1.0 encoding=UTF-8?
plugin
  id=parse-rec
  name=Recommended Parser/Filter
  version=0.0.1
  provider-name=nutch.org

  runtime
 !-- As defined in build.xml this plugin will end up bundled as 
recommended.jar --

 library name=parse-rec.jar
export name=*/
 /library
  /runtime

  requires
   import plugin=nutch-extensionpoints/
  /requires

  !-- The RecommendedParser extends the HtmlParseFilter to grab the 
contents of any recommended meta tags --

  extension id=org.apache.nutch.parse.rec.RecParseFilter
 name=Recommended Parser
 point=org.apache.nutch.parse.HtmlParseFilter
 implementation id=RecParseFilter 
class=org.apache.nutch.parse.rec.RecParseFilter

parameter name=contentType value=text/html/
parameter name=pathSuffix  value=/
 /implementation
  /extension
/plugin


I have added this line in nutch-site.xml

___nutch-site.xml__
 property
   nameplugin.includes/name   
value*nutch-extensionpoints*|protocol-http|urlfilter-regex|*parse-(*text|html|js|rec)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value


 

Re: Nutch 0.9 not loading plugins (sorry very long)

2006-11-08 Thread zzcgiacomini
Sorry in my previous posting the output of  nutch readseg -get was 
wrong .. here is the actual output:


-Corrado

SegmentReader: get 'http://testmachine.test.net/index.html'
Content::
Version: 2
url: http://testmachine.test.net/index.html
base: http://testmachine.test.net/index.html
contentType: text/html
metadata: Content-Length=345 Connection=close 
ETag=2f4ac-159-421166c12a140 nutch.segment.name=20061108113703 
nutch.crawl.score=1.0 Recommended=plugins 
nutch.content.digest=82e307c71d7476ce729a8e6d3b0de50a 
Accept-Ranges=bytes Server=Apache/2.2.0 (Fedora) Content-Type=text/html; 
charset=UTF-8 date=Wed, 08 Nov 2006 10:37:57 GMT Last-Modified=Tue, 31 
Oct 2006 07:34:53 GMT

Content:
!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Frameset//EN 
http://www.w3.org/TR/html4/frameset.dtd;

HTML
HEAD
TITLE
PLUG-IN TEST
/TITLE
/HEAD
meta name=recommended content=plugins
A HREF=http://testmachine.test.net/omniORB/index.html;omniORB/A
BR
A HREF=http://testmachine.test.net/nutch/index.html;Nutch/A
/HTML

Crawl Generate::
Version: 4
Status: 1 (DB_unfetched)
Fetch time: Wed Nov 08 11:36:31 CET 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: null

Crawl Fetch::
Version: 4
Status: 5 (fetch_success)
Fetch time: Wed Nov 08 11:37:58 CET 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: 82e307c71d7476ce729a8e6d3b0de50a
Metadata: null

Crawl Parse::
Version: 4
Status: 4 (linked)
Fetch time: Wed Nov 08 11:38:05 CET 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.5
Signature: null
Metadata: null

ParseData::
Version: 5
Status: success(1,0)
Title: PLUG-IN TEST
Outlinks: 2
 outlink: toUrl: http://testmachine.test.net/omniORB/index.html anchor: 
omniORB

 outlink: toUrl: http://testmachine.test.net/nutch/index.html anchor: Nutch
Content Metadata: Connection=close Content-Length=345 
nutch.crawl.score=1.0 nutch.segment.name=20061108113703 
ETag=2f4ac-159-421166c12a140 Recommended=plugins 
nutch.content.digest=82e307c71d7476ce729a8e6d3b0de50a 
Accept-Ranges=bytes Content-Type=text/html; charset=UTF-8 
Server=Apache/2.2.0 (Fedora) Last-Modified=Tue, 31 Oct 2006 07:34:53 GMT 
date=Wed, 08 Nov 2006 10:37:57 GMT

Parse Metadata: OriginalCharEncoding=UTF-8 CharEncodingForConversion=UTF-8

ParseText::
PLUG-IN TEST omniORB Nutch