Hi Joe,
Thanks for describing your work in detail. It provides a great utility which I
think could be of immense value.
Please feel free to create a JIRA ticket which can be used as the basis for
linking to the prior similar examples you referenced.
A WIP pull request would be ideal.
Thanks
lewismc
On 2024/03/08 01:06:18 Joe Gilvary wrote:
> Good day, all,
>
> I wanted to index some values that I had to derive from fields in the
> NutchDocument. I started on an indexing plugin. Then I realized I would
> need more than one, or I could generalize the plugin. I went with the
> generalizing and wrote a plugin that will use custom POJOs to process &
> inject whatever the Nutch user wants, based on properties in
> NUTCH_CONF_DIR/nutch-site.xml. I've tested it so far with
>
> one POJO that uses jsoup to extract values from the page based on a CSS
> selector specified in nutch-site.xml,
>
> another POJO that takes a regex from nutch-site.xml and applies it to
> the URL to determine how "deep" the URL directory structure goes for the
> document,
>
> and a third toy POJO to take multiple arguments from nutch-site.xml and
> return their product. That last test was just to be sure the plug-in
> would handle more than two arguments in the property value.
>
> There's an optional boolean in the config to set whether to overwrite an
> existing field, or (by default) add to it. Finally, I hacked a naming
> convention and the way the plugin uses the setConf() call so the plugin
> will accept configuration for multiple different POJOs to set multiple
> fields in the NutchDocument. I didn't see any examples of a plugin
> running more than once for each document quite that way, so I'm not sure
> if this conforms to whatever canonical approach might exist.
>
> I think of this plugin as a way to extend the reach of the plugin
> architecture's flexibility out to POJO-land :) for anyone who
> can't/won't for whatever reason write a plugin of their own. The POJOs
> have to accept a String in a constructor, but they don't work on
> NutchDocument or CrawlDatum or anything. I think if the plugin wants to
> pass all that to a POJO for reflection, it's a clever way to waste time
> when the work could be done in the plugin itself. For some subset of
> indexing requirements, I think this could be useful to a wider set of
> users. Still, I'm not a wider set of users, so I'm asking here.
>
> NUTCH-585 has a lot of discussion about a concern similar to what this
> jsoup example enables and Solr itself includes the
> URLClassifierProcessor that addresses the same type of task that the
> regex example shows, so is there any interest in this kind of
> generalized plugin? Just from those examples, it could enable some
> altered version of those capabilities. I've only built and tested with
> the 1.19 branch and main branch code so far, and only with a Solr 9.2.1
> cloud install, 'cause that's what I'm running, but if it seems
> worthwhile to others, I'll beef up the documentation and write JUnit cases.
>
> Thanks, stay safe, stay healthy,
>
> Joe
>
>