Re: truncation, parsing and indexing?

Tim Allison Fri, 17 Nov 2023 09:03:43 -0800

I opened: https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-3026 for 
further discussion.


On 2023/11/03 13:31:26 Tim Allison wrote:
> After thinking about this some more... my overall goal is a status index that 
> contains lots of information about the crawl except the parsed metadata + 
> content.  I'd like to write this to an OpenSearch index so that we can 
> perform analysis on the crawl.
> 
> If that's the case, maybe the cleanest path forward would be to add a 
> "statusOnly" flag to the IndexingJob and just index all the status bits. I'm 
> still coming up to speed on what's available from the crawldb vs the segments 
> etc so some of this may be offbase, but something like:
> 
> target url
> fetched url (in case of redirects?)
> target host
> fetched host (in case of redirects?)
> fetched timestamp(?)
> response code
> fetch status
> content-length
> digest
> truncated
> truncated reason
> content type
> parse status (bonus points for parse exception's stacktrace or parse 
> timeout/crash info)
> inlinks (optional ?)
> outlinks (would require successful parse)
> user injected crawl metadata ... as we now allow for regular indexing via 
> configuration
> user selected parse metadata ... as we now allow for regular indexing via 
> configuration
> whether the file was successfully indexed or not (how do we get this info?)
> 
> Does this seem like a reasonable path forward? 
> Are there other status bits we should be pulling?
> 
> Would others find this generally useful, or should I just copy/paste the 
> IndexinJob into a personal addon repo and modify that? 
>

Re: truncation, parsing and indexing?

Reply via email to