Tim Allison created TIKA-4059:
---------------------------------

             Summary: Consider parsing common gzipped formats like we do with 
package files
                 Key: TIKA-4059
                 URL: https://issues.apache.org/jira/browse/TIKA-4059
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison


For docx and zip-based formats, we have a zip detector and we parse those 
container files as a single file.  There are a handful of file formats that are 
often gzipped: tgz, svgz and warc files.

Users currently get the content of these files as an attachment to the main 
gzipped file with /rmeta or the -J option in tika-app.

This issue proposes adding a simple gzip container detector to treat these file 
formats as a single file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to