Indexing arbitrary fields

2024-03-07 Thread Joe Gilvary

Good day, all,

I wanted to index some values that I had to derive from fields in the 
NutchDocument. I started on an indexing plugin. Then I realized I would 
need more than one, or I could generalize the plugin. I went with the 
generalizing and wrote a plugin that will use custom POJOs to process & 
inject whatever the Nutch user wants, based on properties in 
NUTCH_CONF_DIR/nutch-site.xml. I've tested it so far with


one POJO that uses jsoup to extract values from the page based on a CSS 
selector specified in nutch-site.xml,


another POJO that takes a regex from nutch-site.xml and applies it to 
the URL to determine how "deep" the URL directory structure goes for the 
document,


and a third toy POJO to take multiple arguments from nutch-site.xml and 
return their product. That last test was just to be sure the plug-in 
would handle more than two arguments in the property value.


There's an optional boolean in the config to set whether to overwrite an 
existing field, or (by default) add to it. Finally, I hacked a naming 
convention and the way the plugin uses the setConf() call so the plugin 
will accept configuration for multiple different POJOs to set multiple 
fields in the NutchDocument. I didn't see any examples of a plugin 
running more than once for each document quite that way, so I'm not sure 
if this conforms to whatever canonical approach might exist.


I think of this plugin as a way to extend the reach of the plugin 
architecture's flexibility out to POJO-land :) for anyone who 
can't/won't for whatever reason write a plugin of their own. The POJOs 
have to accept a String in a constructor, but they don't work on 
NutchDocument or CrawlDatum or anything. I think if the plugin wants to 
pass all that to a POJO for reflection, it's a clever way to waste time 
when the work could be done in the plugin itself. For some subset of 
indexing requirements, I think this could be useful to a wider set of 
users. Still, I'm not a wider set of users, so I'm asking here.


NUTCH-585 has a lot of discussion about a concern similar to what this 
jsoup example enables and Solr itself includes the 
URLClassifierProcessor that addresses the same type of task that the 
regex example shows, so is there any interest in this kind of 
generalized plugin? Just from those examples, it could enable some 
altered version of those capabilities. I've only built and tested with 
the 1.19 branch and main branch code so far, and only with a Solr 9.2.1 
cloud install, 'cause that's what I'm running, but if it seems 
worthwhile to others, I'll beef up the documentation and write JUnit cases.


 Thanks, stay safe, stay healthy,

 Joe



[DISCUSS] Release Nutch 1.20

2024-03-07 Thread lewis john mcgibbney
Hi dev@,
As of today, 51 issues have been addressed in the 1.20 development drive.
https://issues.apache.org/jira/projects/NUTCH/versions/12352190
I would like to push a release soon and ship it to the user community.
Any objections?
Thank you
lewismc