Hi Eyal,

 Did you also modify parse-plugins.xml at the bottom to add an alias for
parse-exe to point to the actual extension point id? I'm guessing that's
your problem. Check out the bottom of parse-plugins.xml for an example of
this.

 Let me know if you still need more help and we'll go from there.

Thanks,
  Chris



On 10/17/07 6:53 AM, "eyal edri" <[EMAIL PROTECTED]> wrote:

> Hi all,
> 
> I'm trying to write a new plugin that will download pages with contentType:
> x-dosexec (EXE) files.
> i've followed the "write your own plugin tutorial" in the wiki and done the
> following actions: (some actions are not mentioned in the tutorial)
> 
>    1. Created a new dir under $NUTCH_HOME/src/plugins/parse-exe
>    2. Created new $NUTCH_HOME/src/plugins/parse-exe/plugin.xml [displayed
>    below]
>    3. Created new $NUTCH_HOME/src/plugins/parse-exe/build.xml [displayed
>    below]
>    4. Written the java code
>    
> $NUTCH_HOME/src/plugin/parse-exe/src/java/org/apache/nutch/parse/exe/ExeParser
> .java
>    5. Add "nutch-extensionpoints" & "parse-exe" to the 'plugins-include'
>    property in $NUTCH_HOME/conf/nutch-site.xml
>    6. Add code to the $NUTCH_HOME/conf/parse-plugins.xml [written below]
>    7. Added code the the $NUTCH_HOME/src/plugins/build.xml [written
>    below]
>    8. copied $NUTCH_HOME/build/plugins/parse-exe/parse-exe.jar to
>    $NUTCH_HOME/plugins/parse-exe
>    9. run ant (build successful)
> 
> the log shows that nutch identifies the plugin:
> 
> 2007-10-17 15:15:55,657 INFO  plugin.PluginRepository - Registered Plugins:
> 2007-10-17 15:15:55,657 INFO  plugin.PluginRepository -         the nutch
> core extension points (nutch-extensionpoints)
> 2007-10-17 15:15:55,657 INFO  plugin.PluginRepository -         Html Parse
> Plug-in (parse-html)
> 2007-10-17 15:15:55,657 INFO  plugin.PluginRepository -         Exe Parse
> Plug-in (parse-exe)
> 
> but when the fetcher encounters a x-dosexec file it thorws an exception:
> 
> 2007-10-17 15:17:16,146 WARN  parse.ParseUtil - No suitable parser found
> when trying to parse content http://www.foo.com/yyy/foo.exe of type
> application/x-dosexec
> 2007-10-17 15:17:16,146 WARN  fetcher.Fetcher - Error parsing:
> http://www.foo.com/yyy/foo.exe: failed(2,200):
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=application/x-dosexec url=http://www.foo.com/yyy/movie30.exe
> 
> (sorry, but the url has been masked for security reasons)
> 
> Am i missing something??
> 
> thanks !!
> 
> 
> 
> [$NUTCH_HOME/src/plugins/build.xml]
> 
> <ant dir="parse-exe" target="deploy"/>
> 
> [parse-plugins.xml]
> 
>  <mimeType name="application/x-dosexec">
>                 <plugin id="parse-exe" />
>   </mimeType>
> 
> 
> [plugin.xml] // copied and changed from parse-pdf
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <plugin
>    id="parse-exe"
>    name="Exe Parse Plug-in"
>    version="1.0.0"
>    provider-name="nutch.org">
> 
>    <runtime>
>       <library name="parse-exe.jar">
>          <export name="*"/>
>       </library>
>    </runtime>
> 
>    <requires>
>       <import plugin="nutch-extensionpoints"/>
>       <import plugin="lib-log4j"/>
>    </requires>
> 
>    <extension id="org.apache.nutch.parse.exe"
>               name="ExeParse"
>               point="org.apache.nutch.parse.Parser">
> 
>       <implementation id="org.apache.nutch.parse.exe.ExeParse"
>                       class="org.apache.nutch.parse.exe.ExeParse">
>         <parameter name="contentType" value="application/x-dosexec"/>
>         <parameter name="pathSuffix"  value=""/>
>       </implementation>
>    </extension>
> 
> </plugin>
> 
> ------------------------------------------------------------------------------
> -----------------------------------
> 
> [build.xml]
> 
> <?xml version="1.0"?>
> 
> <project name="parse-exe" default="jar-core">
> 
>   <import file="../build-plugin.xml"/>
> 
> </project>
> 
> ------------------------------------------------------------------------
> [ExeParser.java]
> 
> public class ExeParser implements Parser {
>   public static final Log LOG = LogFactory.getLog("
> org.apache.nutch.parse.exe");
>   private Configuration conf;
> 
>   public Parse getParse(Content content) {
> 
>     try {
> 
>       byte[] raw = content.getContent();
> 
>       // enter here my code ( i will replace this with real code)
>       LOG.info ("EDRI:: you have reached the parse-exe plugin!");
>       System.out.println("EDRI:: system.out.print... parse-exe");
> 
> 
> 
> 
>       String contentLength = content.getMetadata().get(
> Response.CONTENT_LENGTH);
>       if (contentLength != null && raw.length !=
> Integer.parseInt(contentLength))
> {
>           return new ParseStatus(ParseStatus.FAILED,
> ParseStatus.FAILED_TRUNCATED,
>                   "Content truncated at "+raw.length
>             +" bytes. Parser can't handle incomplete exe
> file.").getEmptyParse(getConf());
>       }
> 
>     } catch (Exception e) { // run time exception
>         if (LOG.isWarnEnabled()) {
>           LOG.warn("General exception in EXE parser: "+e.getMessage());
>           e.printStackTrace(LogUtil.getWarnStream(LOG));
>         }
>         return new ParseStatus(ParseStatus.FAILED,
>               "Can't be handled as exe document. " +
> e).getEmptyParse(getConf());
>       }
> 
>     /// i'm not sure what to return here if i only need to d/l the file
> 
>     ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, "",null,
> null, null);
>     parseData.setConf(this.conf);
>     return new ParseImpl("", parseData);
>   }
> 
>   public void setConf(Configuration conf) {
>     this.conf = conf;
>   }
> 
>   public Configuration getConf() {
>     return this.conf;
>   }
> 
> 
> 
> 

______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply via email to