After thinking about this some more... my overall goal is a status index that 
contains lots of information about the crawl except the parsed metadata + 
content.  I'd like to write this to an OpenSearch index so that we can perform 
analysis on the crawl.

If that's the case, maybe the cleanest path forward would be to add a 
"statusOnly" flag to the IndexingJob and just index all the status bits. I'm 
still coming up to speed on what's available from the crawldb vs the segments 
etc so some of this may be offbase, but something like:

target url
fetched url (in case of redirects?)
target host
fetched host (in case of redirects?)
fetched timestamp(?)
response code
fetch status
content-length
digest
truncated
truncated reason
content type
parse status (bonus points for parse exception's stacktrace or parse 
timeout/crash info)
inlinks (optional ?)
outlinks (would require successful parse)
user injected crawl metadata ... as we now allow for regular indexing via 
configuration
user selected parse metadata ... as we now allow for regular indexing via 
configuration
whether the file was successfully indexed or not (how do we get this info?)

Does this seem like a reasonable path forward? 
Are there other status bits we should be pulling?

Would others find this generally useful, or should I just copy/paste the 
IndexinJob into a personal addon repo and modify that? 

Reply via email to