Indexing arbitrary fields

Joe Gilvary Thu, 07 Mar 2024 17:08:56 -0800

Good day, all,

I wanted to index some values that I had to derive from fields in theNutchDocument. I started on an indexing plugin. Then I realized I wouldneed more than one, or I could generalize the plugin. I went with thegeneralizing and wrote a plugin that will use custom POJOs to process &inject whatever the Nutch user wants, based on properties inNUTCH_CONF_DIR/nutch-site.xml. I've tested it so far with

one POJO that uses jsoup to extract values from the page based on a CSSselector specified in nutch-site.xml,

another POJO that takes a regex from nutch-site.xml and applies it tothe URL to determine how "deep" the URL directory structure goes for thedocument,

and a third toy POJO to take multiple arguments from nutch-site.xml andreturn their product. That last test was just to be sure the plug-inwould handle more than two arguments in the property value.

There's an optional boolean in the config to set whether to overwrite anexisting field, or (by default) add to it. Finally, I hacked a namingconvention and the way the plugin uses the setConf() call so the pluginwill accept configuration for multiple different POJOs to set multiplefields in the NutchDocument. I didn't see any examples of a pluginrunning more than once for each document quite that way, so I'm not sureif this conforms to whatever canonical approach might exist.

I think of this plugin as a way to extend the reach of the pluginarchitecture's flexibility out to POJO-land :) for anyone whocan't/won't for whatever reason write a plugin of their own. The POJOshave to accept a String in a constructor, but they don't work onNutchDocument or CrawlDatum or anything. I think if the plugin wants topass all that to a POJO for reflection, it's a clever way to waste timewhen the work could be done in the plugin itself. For some subset ofindexing requirements, I think this could be useful to a wider set ofusers. Still, I'm not a wider set of users, so I'm asking here.

NUTCH-585 has a lot of discussion about a concern similar to what thisjsoup example enables and Solr itself includes theURLClassifierProcessor that addresses the same type of task that theregex example shows, so is there any interest in this kind ofgeneralized plugin? Just from those examples, it could enable somealtered version of those capabilities. I've only built and tested withthe 1.19 branch and main branch code so far, and only with a Solr 9.2.1cloud install, 'cause that's what I'm running, but if it seemsworthwhile to others, I'll beef up the documentation and write JUnit cases.


 Thanks, stay safe, stay healthy,

 Joe

Indexing arbitrary fields

Reply via email to