After thinking about this some more... my overall goal is a status index that contains lots of information about the crawl except the parsed metadata + content. I'd like to write this to an OpenSearch index so that we can perform analysis on the crawl.
If that's the case, maybe the cleanest path forward would be to add a "statusOnly" flag to the IndexingJob and just index all the status bits. I'm still coming up to speed on what's available from the crawldb vs the segments etc so some of this may be offbase, but something like: target url fetched url (in case of redirects?) target host fetched host (in case of redirects?) fetched timestamp(?) response code fetch status content-length digest truncated truncated reason content type parse status (bonus points for parse exception's stacktrace or parse timeout/crash info) inlinks (optional ?) outlinks (would require successful parse) user injected crawl metadata ... as we now allow for regular indexing via configuration user selected parse metadata ... as we now allow for regular indexing via configuration whether the file was successfully indexed or not (how do we get this info?) Does this seem like a reasonable path forward? Are there other status bits we should be pulling? Would others find this generally useful, or should I just copy/paste the IndexinJob into a personal addon repo and modify that?