One work around to ignore parse exceptions (at least in the Tika parser):

Proposed fix for truncation checking:

On 2023/10/18 14:28:45 Tim Allison wrote:
> I'm trying to configure Nutch to index pages/files that are truncated (in
> addition to the successful non-truncated files).
> I'm using the okhttp protocol, because I don't think the http protocol
> stores truncation information.
> I'm using parse-tika, and the "parser.skip.truncated" is set to
> default=true.
> The particular PDF that I'm experimenting with is returned chunked with gz
> compression.  There is no length header in the response.
> For this PDF, okhttp correctly marks it as truncated, but then the file is
> sent to parsetika, which throws a parse exception. The file is then not
> sent to the index.
> If I understand correctly, ParseSegment is checking for truncation, but it
> requires a Content-Length header to work. In my case, there is no
> Content-Length header, so it assumes the file is not truncated.
> Should I open a ticket to have ParseSegment also check for okhttp's header (
> http.content.truncated=true)?
> Is there a way to index files even if they are truncated or if there is a
> parse exception?
> If indexing is a bridge too far, what's the most efficient way to dump a
> list of urls that are truncated and/or had a parse exception?
> Thank you!
> Best,
>       Tim

Reply via email to