[ https://issues.apache.org/jira/browse/TIKA-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17727953#comment-17727953 ]
Tim Allison commented on TIKA-4048: ----------------------------------- Thank you [~snagel], that's exactly it! With commons-compress, I get 2288, and with Java's GZIPInputStream, I get 6773. {noformat} try (InputStream is = new GzipCompressorInputStream(Files.newInputStream(p))) { byte[] bytes = IOUtils.toByteArray(is); System.out.println("length: " + bytes.length); {noformat} We can add warc detection to our gzip detector as a work-around. I'm wondering if we should also at tgz detection to the gzip detector... separate issue. Is this something that commons-compress should fix or is this a unique feature of WARCs? > Gzipped WARC not identifying all assets > --------------------------------------- > > Key: TIKA-4048 > URL: https://issues.apache.org/jira/browse/TIKA-4048 > Project: Tika > Issue Type: Bug > Reporter: Gregory Lepore > Priority: Minor > Attachments: Screenshot 2023-05-30 at 3.49.19 PM.png, Screenshot > 2023-05-30 at 3.50.41 PM.png, rec-20230518121844489398-5335604b8b23.warc, > rec-20230518121844489398-5335604b8b23.warc.gz, > rec-20230518121844489398-5335604b8b23.warc.gz.json, > rec-20230518121844489398-5335604b8b23.warc.json > > > The WARC parser works for non GZipped WARC files, but for GZipped WARC files > it appears not all embedded files are being identified. > > Processing a WARC.GZ file should return identical JSON output as the plain > WARC file, with the addition of the GZ file metadata. However, in the > attached JSON outputs, the JPEG present in the plain WARC file is not > represented in the WARC.GZ.json file. > > Additionally, the warc: metadata is not being returned for all files, > although this may be by design. > > Attached are two JSON files, one for the GZipped WARC file and one for the > plain WARC file. And the two original files. -- This message was sent by Atlassian Jira (v8.20.10#820010)