[
https://issues.apache.org/jira/browse/NUTCH-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2548.
------------------------------------
Resolution: Fixed
Fix Version/s: 2.4
Thanks, [~rustyx]! Applied patch / merged PR (sorry, I've applied the patch
first, missed the PR).
> Compressed content skipped. Content of size 78 was truncated to 74
> ------------------------------------------------------------------
>
> Key: NUTCH-2548
> URL: https://issues.apache.org/jira/browse/NUTCH-2548
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 2.4
> Reporter: Rustam
> Priority: Major
> Fix For: 2.4
>
> Attachments: nutch-content-truncated.patch
>
>
> gzip or deflate compressed content fails to parse with a message like:
> {{WARN parse.ParserJob - https://rustyx.org/temp/index%20bbb skipped.
> Content of size 78 was truncated to 74}}
> The root cause is that the original (compressed) Content-Length is stored in
> the headers, while the content is stored uncompressed. Subsequently the
> Content-Length doesn't match the stored content size.
> See attached patch that fixed the issue by removing Content-Length from the
> headers if it contains compressed value.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)