[ 
https://issues.apache.org/jira/browse/TIKA-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725977#comment-17725977
 ] 

Tim Allison edited comment on TIKA-4004 at 5/24/23 10:22 PM:
-------------------------------------------------------------

 [^000000.warc] is the result of gunzipping {{curl -r 52967301-53010202 
https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764499831.97/warc/CC-MAIN-20230130232547-20230131022547-00296.warc.gz
 -o 000000.warc.gz}}

So, y, CC fetched some different bytes than I'm currently refetching from the 
source sites. 


was (Author: talli...@mitre.org):
 [^000000.warc] is the result of {{curl -r 52967301-53010202 
https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764499831.97/warc/CC-MAIN-20230130232547-20230131022547-00296.warc.gz
 -o 000000.warc.gz}}

So, y, CC fetched some different bytes than I'm currently refetching from the 
source sites. 

> font/otf application/vnd.ms-opentype
> ------------------------------------
>
>                 Key: TIKA-4004
>                 URL: https://issues.apache.org/jira/browse/TIKA-4004
>             Project: Tika
>          Issue Type: Sub-task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: 000000.warc, aller-bold.eot, aller-light.eot, 
> fleurons.eot, index.html_id=45_and_type=eot, index.html_id=67_and_type=eot, 
> index.html_id=75_and_type=eot, index.html_id=77_and_type=eot, 
> index.html_id=80_and_type=eot, index.html_id=83_and_type=eot, 
> index.html_id=84_and_type=eot
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to