Re: Indexing arbitrary fields

Lewis John McGibbney Fri, 08 Mar 2024 09:54:03 -0800

Hi Joe,
Thanks for describing your work in detail. It provides a great utility which I 
think could be of immense value.
Please feel free to create a JIRA ticket which can be used as the basis for 
linking to the prior similar examples you referenced.
A WIP pull request would be ideal.
Thanks
lewismc


On 2024/03/08 01:06:18 Joe Gilvary wrote:
> Good day, all,
> 
> I wanted to index some values that I had to derive from fields in the 
> NutchDocument. I started on an indexing plugin. Then I realized I would 
> need more than one, or I could generalize the plugin. I went with the 
> generalizing and wrote a plugin that will use custom POJOs to process & 
> inject whatever the Nutch user wants, based on properties in 
> NUTCH_CONF_DIR/nutch-site.xml. I've tested it so far with
> 
> one POJO that uses jsoup to extract values from the page based on a CSS 
> selector specified in nutch-site.xml,
> 
> another POJO that takes a regex from nutch-site.xml and applies it to 
> the URL to determine how "deep" the URL directory structure goes for the 
> document,
> 
> and a third toy POJO to take multiple arguments from nutch-site.xml and 
> return their product. That last test was just to be sure the plug-in 
> would handle more than two arguments in the property value.
> 
> There's an optional boolean in the config to set whether to overwrite an 
> existing field, or (by default) add to it. Finally, I hacked a naming 
> convention and the way the plugin uses the setConf() call so the plugin 
> will accept configuration for multiple different POJOs to set multiple 
> fields in the NutchDocument. I didn't see any examples of a plugin 
> running more than once for each document quite that way, so I'm not sure 
> if this conforms to whatever canonical approach might exist.
> 
> I think of this plugin as a way to extend the reach of the plugin 
> architecture's flexibility out to POJO-land :) for anyone who 
> can't/won't for whatever reason write a plugin of their own. The POJOs 
> have to accept a String in a constructor, but they don't work on 
> NutchDocument or CrawlDatum or anything. I think if the plugin wants to 
> pass all that to a POJO for reflection, it's a clever way to waste time 
> when the work could be done in the plugin itself. For some subset of 
> indexing requirements, I think this could be useful to a wider set of 
> users. Still, I'm not a wider set of users, so I'm asking here.
> 
> NUTCH-585 has a lot of discussion about a concern similar to what this 
> jsoup example enables and Solr itself includes the 
> URLClassifierProcessor that addresses the same type of task that the 
> regex example shows, so is there any interest in this kind of 
> generalized plugin? Just from those examples, it could enable some 
> altered version of those capabilities. I've only built and tested with 
> the 1.19 branch and main branch code so far, and only with a Solr 9.2.1 
> cloud install, 'cause that's what I'm running, but if it seems 
> worthwhile to others, I'll beef up the documentation and write JUnit cases.
> 
>   Thanks, stay safe, stay healthy,
> 
>   Joe
> 
>

Re: Indexing arbitrary fields

Reply via email to