Hi Joe, Thanks for describing your work in detail. It provides a great utility which I think could be of immense value. Please feel free to create a JIRA ticket which can be used as the basis for linking to the prior similar examples you referenced. A WIP pull request would be ideal. Thanks lewismc
On 2024/03/08 01:06:18 Joe Gilvary wrote: > Good day, all, > > I wanted to index some values that I had to derive from fields in the > NutchDocument. I started on an indexing plugin. Then I realized I would > need more than one, or I could generalize the plugin. I went with the > generalizing and wrote a plugin that will use custom POJOs to process & > inject whatever the Nutch user wants, based on properties in > NUTCH_CONF_DIR/nutch-site.xml. I've tested it so far with > > one POJO that uses jsoup to extract values from the page based on a CSS > selector specified in nutch-site.xml, > > another POJO that takes a regex from nutch-site.xml and applies it to > the URL to determine how "deep" the URL directory structure goes for the > document, > > and a third toy POJO to take multiple arguments from nutch-site.xml and > return their product. That last test was just to be sure the plug-in > would handle more than two arguments in the property value. > > There's an optional boolean in the config to set whether to overwrite an > existing field, or (by default) add to it. Finally, I hacked a naming > convention and the way the plugin uses the setConf() call so the plugin > will accept configuration for multiple different POJOs to set multiple > fields in the NutchDocument. I didn't see any examples of a plugin > running more than once for each document quite that way, so I'm not sure > if this conforms to whatever canonical approach might exist. > > I think of this plugin as a way to extend the reach of the plugin > architecture's flexibility out to POJO-land :) for anyone who > can't/won't for whatever reason write a plugin of their own. The POJOs > have to accept a String in a constructor, but they don't work on > NutchDocument or CrawlDatum or anything. I think if the plugin wants to > pass all that to a POJO for reflection, it's a clever way to waste time > when the work could be done in the plugin itself. For some subset of > indexing requirements, I think this could be useful to a wider set of > users. Still, I'm not a wider set of users, so I'm asking here. > > NUTCH-585 has a lot of discussion about a concern similar to what this > jsoup example enables and Solr itself includes the > URLClassifierProcessor that addresses the same type of task that the > regex example shows, so is there any interest in this kind of > generalized plugin? Just from those examples, it could enable some > altered version of those capabilities. I've only built and tested with > the 1.19 branch and main branch code so far, and only with a Solr 9.2.1 > cloud install, 'cause that's what I'm running, but if it seems > worthwhile to others, I'll beef up the documentation and write JUnit cases. > > Thanks, stay safe, stay healthy, > > Joe > >