Re: truncation, parsing and indexing?

Tim Allison Wed, 18 Oct 2023 09:16:58 -0700

One work around to ignore parse exceptions (at least in the Tika parser): 
https://github.com/tballison/nutch/tree/ignore-parse-exception


Proposed fix for truncation checking:
https://github.com/tballison/nutch/tree/okhttp-truncated

On 2023/10/18 14:28:45 Tim Allison wrote:
> I'm trying to configure Nutch to index pages/files that are truncated (in
> addition to the successful non-truncated files).
> 
> I'm using the okhttp protocol, because I don't think the http protocol
> stores truncation information.
> 
> I'm using parse-tika, and the "parser.skip.truncated" is set to
> default=true.
> 
> The particular PDF that I'm experimenting with is returned chunked with gz
> compression.  There is no length header in the response.
> 
> For this PDF, okhttp correctly marks it as truncated, but then the file is
> sent to parsetika, which throws a parse exception. The file is then not
> sent to the index.
> 
> If I understand correctly, ParseSegment is checking for truncation, but it
> requires a Content-Length header to work. In my case, there is no
> Content-Length header, so it assumes the file is not truncated.
> 
> Should I open a ticket to have ParseSegment also check for okhttp's header (
> http.content.truncated=true)?
> 
> Is there a way to index files even if they are truncated or if there is a
> parse exception?
> 
> If indexing is a bridge too far, what's the most efficient way to dump a
> list of urls that are truncated and/or had a parse exception?
> 
> Thank you!
> 
> Best,
> 
>       Tim
>

Re: truncation, parsing and indexing?

Reply via email to