Tim Allison created TIKA-4059: --------------------------------- Summary: Consider parsing common gzipped formats like we do with package files Key: TIKA-4059 URL: https://issues.apache.org/jira/browse/TIKA-4059 Project: Tika Issue Type: Improvement Reporter: Tim Allison
For docx and zip-based formats, we have a zip detector and we parse those container files as a single file. There are a handful of file formats that are often gzipped: tgz, svgz and warc files. Users currently get the content of these files as an attachment to the main gzipped file with /rmeta or the -J option in tika-app. This issue proposes adding a simple gzip container detector to treat these file formats as a single file. -- This message was sent by Atlassian Jira (v8.20.10#820010)