[ 
https://issues.apache.org/jira/browse/TIKA-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728031#comment-17728031
 ] 

Tim Allison commented on TIKA-4059:
-----------------------------------

{{unzip}} doesn't like that file either: 
{noformat}
Archive:  sample.siard
   creating: content/
   creating: content/schema0/
   creating: content/schema0/table0/
  inflating: content/schema0/table0/table0.xml  
  error:  invalid compressed data to inflate
  inflating: content/schema0/table0/table0.xsd  
  error:  invalid compressed data to inflate
   creating: content/schema0/table0/lob2/
  inflating: content/schema0/table0/lob2/record0.txt  
  error:  invalid compressed data to inflate
  inflating: content/schema0/table0/lob2/record1.txt  
  error:  invalid compressed data to inflate
  inflating: content/schema0/table0/lob2/record3.txt  
  error:  invalid compressed data to inflate
   creating: content/schema0/table0/lob5/
  inflating: content/schema0/table0/lob5/record0.txt  
  error:  invalid compressed data to inflate
  inflating: content/schema0/table0/lob5/record2.txt  
  error:  invalid compressed data to inflate
  inflating: content/schema0/table0/lob5/record3.txt  
  error:  invalid compressed data to inflate
   creating: content/schema0/table0/lob6/
  inflating: content/schema0/table0/lob6/record0.xml  
  error:  invalid compressed data to inflate
  inflating: content/schema0/table0/lob6/record1.xml  
  error:  invalid compressed data to inflate
  inflating: content/schema0/table0/lob6/record2.xml  
  error:  invalid compressed data to inflate
   creating: content/schema0/table0/lob9/
  inflating: content/schema0/table0/lob9/record0.bin  
  error:  invalid compressed data to inflate
  inflating: content/schema0/table0/lob9/record1.bin  
  error:  invalid compressed data to inflate
  inflating: content/schema0/table0/lob9/record2.bin  
{noformat}

> Consider parsing common gzipped formats like we do with package files
> ---------------------------------------------------------------------
>
>                 Key: TIKA-4059
>                 URL: https://issues.apache.org/jira/browse/TIKA-4059
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>
> For docx and zip-based formats, we have a zip detector and we parse those 
> container files as a single file.  There are a handful of file formats that 
> are often gzipped: tgz, svgz and warc files.
> Users currently get the content of these files as an attachment to the main 
> gzipped file with /rmeta or the -J option in tika-app.
> This issue proposes adding a simple gzip container detector to treat these 
> file formats as a single file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to