Good day, all,

I wanted to index some values that I had to derive from fields in the NutchDocument. I started on an indexing plugin. Then I realized I would need more than one, or I could generalize the plugin. I went with the generalizing and wrote a plugin that will use custom POJOs to process & inject whatever the Nutch user wants, based on properties in NUTCH_CONF_DIR/nutch-site.xml. I've tested it so far with

one POJO that uses jsoup to extract values from the page based on a CSS selector specified in nutch-site.xml,

another POJO that takes a regex from nutch-site.xml and applies it to the URL to determine how "deep" the URL directory structure goes for the document,

and a third toy POJO to take multiple arguments from nutch-site.xml and return their product. That last test was just to be sure the plug-in would handle more than two arguments in the property value.

There's an optional boolean in the config to set whether to overwrite an existing field, or (by default) add to it. Finally, I hacked a naming convention and the way the plugin uses the setConf() call so the plugin will accept configuration for multiple different POJOs to set multiple fields in the NutchDocument. I didn't see any examples of a plugin running more than once for each document quite that way, so I'm not sure if this conforms to whatever canonical approach might exist.

I think of this plugin as a way to extend the reach of the plugin architecture's flexibility out to POJO-land :) for anyone who can't/won't for whatever reason write a plugin of their own. The POJOs have to accept a String in a constructor, but they don't work on NutchDocument or CrawlDatum or anything. I think if the plugin wants to pass all that to a POJO for reflection, it's a clever way to waste time when the work could be done in the plugin itself. For some subset of indexing requirements, I think this could be useful to a wider set of users. Still, I'm not a wider set of users, so I'm asking here.

NUTCH-585 has a lot of discussion about a concern similar to what this jsoup example enables and Solr itself includes the URLClassifierProcessor that addresses the same type of task that the regex example shows, so is there any interest in this kind of generalized plugin? Just from those examples, it could enable some altered version of those capabilities. I've only built and tested with the 1.19 branch and main branch code so far, and only with a Solr 9.2.1 cloud install, 'cause that's what I'm running, but if it seems worthwhile to others, I'll beef up the documentation and write JUnit cases.

 Thanks, stay safe, stay healthy,

 Joe

Reply via email to