[jira] [Commented] (TIKA-4048) Gzipped WARC not identifying all assets

Tim Allison (Jira) Wed, 31 May 2023 06:47:26 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17727975#comment-17727975
 ]


Tim Allison commented on TIKA-4048:
-----------------------------------

I'm wondering if we should switch the default to "on" and let users configure 
the CompressorParser via tika-config.xml to turn it off for legacy behavior.

Looking through commons-compress' source code it looks like there are several 
compressors that allow reading multiple records.

> Gzipped WARC not identifying all assets
> ---------------------------------------
>
>                 Key: TIKA-4048
>                 URL: https://issues.apache.org/jira/browse/TIKA-4048
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gregory Lepore
>            Priority: Minor
>         Attachments: Screenshot 2023-05-30 at 3.49.19 PM.png, Screenshot 
> 2023-05-30 at 3.50.41 PM.png, rec-20230518121844489398-5335604b8b23.warc, 
> rec-20230518121844489398-5335604b8b23.warc.gz, 
> rec-20230518121844489398-5335604b8b23.warc.gz.json, 
> rec-20230518121844489398-5335604b8b23.warc.json
>
>
> The WARC parser works for non GZipped WARC files, but for GZipped WARC files 
> it appears not all embedded files are being identified.
>  
> Processing a WARC.GZ file should return identical JSON output as the plain 
> WARC file, with the addition of the GZ file metadata. However, in the 
> attached JSON outputs, the JPEG present in the plain WARC file is not 
> represented in the WARC.GZ.json file.
>  
> Additionally, the warc: metadata is not being returned for all files, 
> although this may be by design. 
>  
> Attached are two JSON files, one for the GZipped WARC file and one for the 
> plain WARC file. And the two original files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4048) Gzipped WARC not identifying all assets

Reply via email to