[ https://issues.apache.org/jira/browse/TIKA-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17727975#comment-17727975 ]
Tim Allison commented on TIKA-4048: ----------------------------------- I'm wondering if we should switch the default to "on" and let users configure the CompressorParser via tika-config.xml to turn it off for legacy behavior. Looking through commons-compress' source code it looks like there are several compressors that allow reading multiple records. > Gzipped WARC not identifying all assets > --------------------------------------- > > Key: TIKA-4048 > URL: https://issues.apache.org/jira/browse/TIKA-4048 > Project: Tika > Issue Type: Bug > Reporter: Gregory Lepore > Priority: Minor > Attachments: Screenshot 2023-05-30 at 3.49.19 PM.png, Screenshot > 2023-05-30 at 3.50.41 PM.png, rec-20230518121844489398-5335604b8b23.warc, > rec-20230518121844489398-5335604b8b23.warc.gz, > rec-20230518121844489398-5335604b8b23.warc.gz.json, > rec-20230518121844489398-5335604b8b23.warc.json > > > The WARC parser works for non GZipped WARC files, but for GZipped WARC files > it appears not all embedded files are being identified. > > Processing a WARC.GZ file should return identical JSON output as the plain > WARC file, with the addition of the GZ file metadata. However, in the > attached JSON outputs, the JPEG present in the plain WARC file is not > represented in the WARC.GZ.json file. > > Additionally, the warc: metadata is not being returned for all files, > although this may be by design. > > Attached are two JSON files, one for the GZipped WARC file and one for the > plain WARC file. And the two original files. -- This message was sent by Atlassian Jira (v8.20.10#820010)