[ https://issues.apache.org/jira/browse/TIKA-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728020#comment-17728020 ]
Tim Allison commented on TIKA-4059: ----------------------------------- Are there any other formats that are typically gzipped? > Consider parsing common gzipped formats like we do with package files > --------------------------------------------------------------------- > > Key: TIKA-4059 > URL: https://issues.apache.org/jira/browse/TIKA-4059 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Major > > For docx and zip-based formats, we have a zip detector and we parse those > container files as a single file. There are a handful of file formats that > are often gzipped: tgz, svgz and warc files. > Users currently get the content of these files as an attachment to the main > gzipped file with /rmeta or the -J option in tika-app. > This issue proposes adding a simple gzip container detector to treat these > file formats as a single file. -- This message was sent by Atlassian Jira (v8.20.10#820010)