[ https://issues.apache.org/jira/browse/TIKA-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-4048. ------------------------------- Fix Version/s: 2.8.1 Resolution: Fixed This could actually be a pretty big change to now parse all streams in a multistream compression format by default. Thank you [~g...@rhobard.com]! > Gzipped WARC not identifying all assets > --------------------------------------- > > Key: TIKA-4048 > URL: https://issues.apache.org/jira/browse/TIKA-4048 > Project: Tika > Issue Type: Bug > Reporter: Gregory Lepore > Priority: Minor > Fix For: 2.8.1 > > Attachments: Screenshot 2023-05-30 at 3.49.19 PM.png, Screenshot > 2023-05-30 at 3.50.41 PM.png, rec-20230518121844489398-5335604b8b23.warc, > rec-20230518121844489398-5335604b8b23.warc.gz, > rec-20230518121844489398-5335604b8b23.warc.gz.json, > rec-20230518121844489398-5335604b8b23.warc.json > > > The WARC parser works for non GZipped WARC files, but for GZipped WARC files > it appears not all embedded files are being identified. > > Processing a WARC.GZ file should return identical JSON output as the plain > WARC file, with the addition of the GZ file metadata. However, in the > attached JSON outputs, the JPEG present in the plain WARC file is not > represented in the WARC.GZ.json file. > > Additionally, the warc: metadata is not being returned for all files, > although this may be by design. > > Attached are two JSON files, one for the GZipped WARC file and one for the > plain WARC file. And the two original files. -- This message was sent by Atlassian Jira (v8.20.10#820010)