[jira] [Commented] (TIKA-4048) Gzipped WARC not identifying all assets

Tim Allison (Jira) Tue, 30 May 2023 12:27:27 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17727667#comment-17727667
 ]


Tim Allison commented on TIKA-4048:
-----------------------------------

I can reproduce this minimally without Tika with the following.  There is some 
kind of interplay between commons-compress' GzipCompressorInputStream and the 
WARCReader that is problematic.  When I swap in Java's native GZIPInputStream, 
the problem goes away.

{noformat}
    @Test
    public void testCommonsCompress() throws Exception {
        Path p = Paths.get("rec-20230518121844489398-5335604b8b23.warc.gz");
        int i = 0;
        try (InputStream is = new 
GzipCompressorInputStream(Files.newInputStream(p))) {
            WarcReader warcReader = new WarcReader(is);

            for (WarcRecord record : warcReader) {
                System.out.println("processing : " + i++ + " : " + record.id() 
+ " :: " + record.type());
            }
        }
        assertEquals(1, i);
    }
{noformat}

vs

{noformat}
    public void testOneRMeta() throws Exception {
        Path p = Paths.get("rec-20230518121844489398-5335604b8b23.warc.gz");
        int i = 0;
        try (InputStream is = new GZIPInputStream(Files.newInputStream(p))) {
            WarcReader warcReader = new WarcReader(is);

            for (WarcRecord record : warcReader) {
                System.out.println("processing : " + i++ + " : " + record.id() 
+ " :: " + record.type());
            }
        }
        assertEquals(4, i);
    }
{noformat}

> Gzipped WARC not identifying all assets
> ---------------------------------------
>
>                 Key: TIKA-4048
>                 URL: https://issues.apache.org/jira/browse/TIKA-4048
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gregory Lepore
>            Priority: Minor
>         Attachments: rec-20230518121844489398-5335604b8b23.warc, 
> rec-20230518121844489398-5335604b8b23.warc.gz, 
> rec-20230518121844489398-5335604b8b23.warc.gz.json, 
> rec-20230518121844489398-5335604b8b23.warc.json
>
>
> The WARC parser works for non GZipped WARC files, but for GZipped WARC files 
> it appears not all embedded files are being identified.
>  
> Processing a WARC.GZ file should return identical JSON output as the plain 
> WARC file, with the addition of the GZ file metadata. However, in the 
> attached JSON outputs, the JPEG present in the plain WARC file is not 
> represented in the WARC.GZ.json file.
>  
> Additionally, the warc: metadata is not being returned for all files, 
> although this may be by design. 
>  
> Attached are two JSON files, one for the GZipped WARC file and one for the 
> plain WARC file. And the two original files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4048) Gzipped WARC not identifying all assets

Reply via email to