[ 
https://issues.apache.org/jira/browse/TIKA-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4048.
-------------------------------
    Fix Version/s: 2.8.1
       Resolution: Fixed

This could actually be a pretty big change to now parse all streams in a 
multistream compression format by default.  Thank you [~g...@rhobard.com]!

> Gzipped WARC not identifying all assets
> ---------------------------------------
>
>                 Key: TIKA-4048
>                 URL: https://issues.apache.org/jira/browse/TIKA-4048
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gregory Lepore
>            Priority: Minor
>             Fix For: 2.8.1
>
>         Attachments: Screenshot 2023-05-30 at 3.49.19 PM.png, Screenshot 
> 2023-05-30 at 3.50.41 PM.png, rec-20230518121844489398-5335604b8b23.warc, 
> rec-20230518121844489398-5335604b8b23.warc.gz, 
> rec-20230518121844489398-5335604b8b23.warc.gz.json, 
> rec-20230518121844489398-5335604b8b23.warc.json
>
>
> The WARC parser works for non GZipped WARC files, but for GZipped WARC files 
> it appears not all embedded files are being identified.
>  
> Processing a WARC.GZ file should return identical JSON output as the plain 
> WARC file, with the addition of the GZ file metadata. However, in the 
> attached JSON outputs, the JPEG present in the plain WARC file is not 
> represented in the WARC.GZ.json file.
>  
> Additionally, the warc: metadata is not being returned for all files, 
> although this may be by design. 
>  
> Attached are two JSON files, one for the GZipped WARC file and one for the 
> plain WARC file. And the two original files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to