[ https://issues.apache.org/jira/browse/TIKA-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17727667#comment-17727667 ]
Tim Allison commented on TIKA-4048: ----------------------------------- I can reproduce this minimally without Tika with the following. There is some kind of interplay between commons-compress' GzipCompressorInputStream and the WARCReader that is problematic. When I swap in Java's native GZIPInputStream, the problem goes away. {noformat} @Test public void testCommonsCompress() throws Exception { Path p = Paths.get("rec-20230518121844489398-5335604b8b23.warc.gz"); int i = 0; try (InputStream is = new GzipCompressorInputStream(Files.newInputStream(p))) { WarcReader warcReader = new WarcReader(is); for (WarcRecord record : warcReader) { System.out.println("processing : " + i++ + " : " + record.id() + " :: " + record.type()); } } assertEquals(1, i); } {noformat} vs {noformat} public void testOneRMeta() throws Exception { Path p = Paths.get("rec-20230518121844489398-5335604b8b23.warc.gz"); int i = 0; try (InputStream is = new GZIPInputStream(Files.newInputStream(p))) { WarcReader warcReader = new WarcReader(is); for (WarcRecord record : warcReader) { System.out.println("processing : " + i++ + " : " + record.id() + " :: " + record.type()); } } assertEquals(4, i); } {noformat} > Gzipped WARC not identifying all assets > --------------------------------------- > > Key: TIKA-4048 > URL: https://issues.apache.org/jira/browse/TIKA-4048 > Project: Tika > Issue Type: Bug > Reporter: Gregory Lepore > Priority: Minor > Attachments: rec-20230518121844489398-5335604b8b23.warc, > rec-20230518121844489398-5335604b8b23.warc.gz, > rec-20230518121844489398-5335604b8b23.warc.gz.json, > rec-20230518121844489398-5335604b8b23.warc.json > > > The WARC parser works for non GZipped WARC files, but for GZipped WARC files > it appears not all embedded files are being identified. > > Processing a WARC.GZ file should return identical JSON output as the plain > WARC file, with the addition of the GZ file metadata. However, in the > attached JSON outputs, the JPEG present in the plain WARC file is not > represented in the WARC.GZ.json file. > > Additionally, the warc: metadata is not being returned for all files, > although this may be by design. > > Attached are two JSON files, one for the GZipped WARC file and one for the > plain WARC file. And the two original files. -- This message was sent by Atlassian Jira (v8.20.10#820010)