[ https://issues.apache.org/jira/browse/TIKA-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725977#comment-17725977 ]
Tim Allison edited comment on TIKA-4004 at 5/24/23 10:22 PM: ------------------------------------------------------------- [^000000.warc] is the result of gunzipping {{curl -r 52967301-53010202 https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764499831.97/warc/CC-MAIN-20230130232547-20230131022547-00296.warc.gz -o 000000.warc.gz}} So, y, CC fetched some different bytes than I'm currently refetching from the source sites. was (Author: talli...@mitre.org): [^000000.warc] is the result of {{curl -r 52967301-53010202 https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764499831.97/warc/CC-MAIN-20230130232547-20230131022547-00296.warc.gz -o 000000.warc.gz}} So, y, CC fetched some different bytes than I'm currently refetching from the source sites. > font/otf application/vnd.ms-opentype > ------------------------------------ > > Key: TIKA-4004 > URL: https://issues.apache.org/jira/browse/TIKA-4004 > Project: Tika > Issue Type: Sub-task > Reporter: Tim Allison > Priority: Major > Attachments: 000000.warc, aller-bold.eot, aller-light.eot, > fleurons.eot, index.html_id=45_and_type=eot, index.html_id=67_and_type=eot, > index.html_id=75_and_type=eot, index.html_id=77_and_type=eot, > index.html_id=80_and_type=eot, index.html_id=83_and_type=eot, > index.html_id=84_and_type=eot > > -- This message was sent by Atlassian Jira (v8.20.10#820010)