subject:"Indexing arbitrary fields"

Re: Indexing arbitrary fields

2024-03-08 Thread Lewis John McGibbney

Hi Joe,
Thanks for describing your work in detail. It provides a great utility which I 
think could be of immense value.
Please feel free to create a JIRA ticket which can be used as the basis for 
linking to the prior similar examples you referenced.
A WIP pull request would be ideal.
Thanks
lewismc

On 2024/03/08 01:06:18 Joe Gilvary wrote:
> Good day, all,
> 
> I wanted to index some values that I had to derive from fields in the 
> NutchDocument. I started on an indexing plugin. Then I realized I would 
> need more than one, or I could generalize the plugin. I went with the 
> generalizing and wrote a plugin that will use custom POJOs to process & 
> inject whatever the Nutch user wants, based on properties in 
> NUTCH_CONF_DIR/nutch-site.xml. I've tested it so far with
> 
> one POJO that uses jsoup to extract values from the page based on a CSS 
> selector specified in nutch-site.xml,
> 
> another POJO that takes a regex from nutch-site.xml and applies it to 
> the URL to determine how "deep" the URL directory structure goes for the 
> document,
> 
> and a third toy POJO to take multiple arguments from nutch-site.xml and 
> return their product. That last test was just to be sure the plug-in 
> would handle more than two arguments in the property value.
> 
> There's an optional boolean in the config to set whether to overwrite an 
> existing field, or (by default) add to it. Finally, I hacked a naming 
> convention and the way the plugin uses the setConf() call so the plugin 
> will accept configuration for multiple different POJOs to set multiple 
> fields in the NutchDocument. I didn't see any examples of a plugin 
> running more than once for each document quite that way, so I'm not sure 
> if this conforms to whatever canonical approach might exist.
> 
> I think of this plugin as a way to extend the reach of the plugin 
> architecture's flexibility out to POJO-land :) for anyone who 
> can't/won't for whatever reason write a plugin of their own. The POJOs 
> have to accept a String in a constructor, but they don't work on 
> NutchDocument or CrawlDatum or anything. I think if the plugin wants to 
> pass all that to a POJO for reflection, it's a clever way to waste time 
> when the work could be done in the plugin itself. For some subset of 
> indexing requirements, I think this could be useful to a wider set of 
> users. Still, I'm not a wider set of users, so I'm asking here.
> 
> NUTCH-585 has a lot of discussion about a concern similar to what this 
> jsoup example enables and Solr itself includes the 
> URLClassifierProcessor that addresses the same type of task that the 
> regex example shows, so is there any interest in this kind of 
> generalized plugin? Just from those examples, it could enable some 
> altered version of those capabilities. I've only built and tested with 
> the 1.19 branch and main branch code so far, and only with a Solr 9.2.1 
> cloud install, 'cause that's what I'm running, but if it seems 
> worthwhile to others, I'll beef up the documentation and write JUnit cases.
> 
>   Thanks, stay safe, stay healthy,
> 
>   Joe
> 
>

Indexing arbitrary fields

2024-03-07 Thread Joe Gilvary


Good day, all,

I wanted to index some values that I had to derive from fields in the 
NutchDocument. I started on an indexing plugin. Then I realized I would 
need more than one, or I could generalize the plugin. I went with the 
generalizing and wrote a plugin that will use custom POJOs to process & 
inject whatever the Nutch user wants, based on properties in 
NUTCH_CONF_DIR/nutch-site.xml. I've tested it so far with


one POJO that uses jsoup to extract values from the page based on a CSS 
selector specified in nutch-site.xml,


another POJO that takes a regex from nutch-site.xml and applies it to 
the URL to determine how "deep" the URL directory structure goes for the 
document,


and a third toy POJO to take multiple arguments from nutch-site.xml and 
return their product. That last test was just to be sure the plug-in 
would handle more than two arguments in the property value.


There's an optional boolean in the config to set whether to overwrite an 
existing field, or (by default) add to it. Finally, I hacked a naming 
convention and the way the plugin uses the setConf() call so the plugin 
will accept configuration for multiple different POJOs to set multiple 
fields in the NutchDocument. I didn't see any examples of a plugin 
running more than once for each document quite that way, so I'm not sure 
if this conforms to whatever canonical approach might exist.


I think of this plugin as a way to extend the reach of the plugin 
architecture's flexibility out to POJO-land :) for anyone who 
can't/won't for whatever reason write a plugin of their own. The POJOs 
have to accept a String in a constructor, but they don't work on 
NutchDocument or CrawlDatum or anything. I think if the plugin wants to 
pass all that to a POJO for reflection, it's a clever way to waste time 
when the work could be done in the plugin itself. For some subset of 
indexing requirements, I think this could be useful to a wider set of 
users. Still, I'm not a wider set of users, so I'm asking here.


NUTCH-585 has a lot of discussion about a concern similar to what this 
jsoup example enables and Solr itself includes the 
URLClassifierProcessor that addresses the same type of task that the 
regex example shows, so is there any interest in this kind of 
generalized plugin? Just from those examples, it could enable some 
altered version of those capabilities. I've only built and tested with 
the 1.19 branch and main branch code so far, and only with a Solr 9.2.1 
cloud install, 'cause that's what I'm running, but if it seems 
worthwhile to others, I'll beef up the documentation and write JUnit cases.


 Thanks, stay safe, stay healthy,

 Joe

Re: Indexing arbitrary fields

Indexing arbitrary fields

2 matches

Site Navigation

Mail list logo

Footer information