[ https://issues.apache.org/jira/browse/TIKA-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728031#comment-17728031 ]
Tim Allison commented on TIKA-4059: ----------------------------------- {{unzip}} doesn't like that file either: {noformat} Archive: sample.siard creating: content/ creating: content/schema0/ creating: content/schema0/table0/ inflating: content/schema0/table0/table0.xml error: invalid compressed data to inflate inflating: content/schema0/table0/table0.xsd error: invalid compressed data to inflate creating: content/schema0/table0/lob2/ inflating: content/schema0/table0/lob2/record0.txt error: invalid compressed data to inflate inflating: content/schema0/table0/lob2/record1.txt error: invalid compressed data to inflate inflating: content/schema0/table0/lob2/record3.txt error: invalid compressed data to inflate creating: content/schema0/table0/lob5/ inflating: content/schema0/table0/lob5/record0.txt error: invalid compressed data to inflate inflating: content/schema0/table0/lob5/record2.txt error: invalid compressed data to inflate inflating: content/schema0/table0/lob5/record3.txt error: invalid compressed data to inflate creating: content/schema0/table0/lob6/ inflating: content/schema0/table0/lob6/record0.xml error: invalid compressed data to inflate inflating: content/schema0/table0/lob6/record1.xml error: invalid compressed data to inflate inflating: content/schema0/table0/lob6/record2.xml error: invalid compressed data to inflate creating: content/schema0/table0/lob9/ inflating: content/schema0/table0/lob9/record0.bin error: invalid compressed data to inflate inflating: content/schema0/table0/lob9/record1.bin error: invalid compressed data to inflate inflating: content/schema0/table0/lob9/record2.bin {noformat} > Consider parsing common gzipped formats like we do with package files > --------------------------------------------------------------------- > > Key: TIKA-4059 > URL: https://issues.apache.org/jira/browse/TIKA-4059 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Major > > For docx and zip-based formats, we have a zip detector and we parse those > container files as a single file. There are a handful of file formats that > are often gzipped: tgz, svgz and warc files. > Users currently get the content of these files as an attachment to the main > gzipped file with /rmeta or the -J option in tika-app. > This issue proposes adding a simple gzip container detector to treat these > file formats as a single file. -- This message was sent by Atlassian Jira (v8.20.10#820010)