I opened: https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-3026 for further discussion.
On 2023/11/03 13:31:26 Tim Allison wrote: > After thinking about this some more... my overall goal is a status index that > contains lots of information about the crawl except the parsed metadata + > content. I'd like to write this to an OpenSearch index so that we can > perform analysis on the crawl. > > If that's the case, maybe the cleanest path forward would be to add a > "statusOnly" flag to the IndexingJob and just index all the status bits. I'm > still coming up to speed on what's available from the crawldb vs the segments > etc so some of this may be offbase, but something like: > > target url > fetched url (in case of redirects?) > target host > fetched host (in case of redirects?) > fetched timestamp(?) > response code > fetch status > content-length > digest > truncated > truncated reason > content type > parse status (bonus points for parse exception's stacktrace or parse > timeout/crash info) > inlinks (optional ?) > outlinks (would require successful parse) > user injected crawl metadata ... as we now allow for regular indexing via > configuration > user selected parse metadata ... as we now allow for regular indexing via > configuration > whether the file was successfully indexed or not (how do we get this info?) > > Does this seem like a reasonable path forward? > Are there other status bits we should be pulling? > > Would others find this generally useful, or should I just copy/paste the > IndexinJob into a personal addon repo and modify that? >