[ https://issues.apache.org/jira/browse/TIKA-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149012#comment-15149012 ]
Chris A. Mattmann commented on TIKA-1856: ----------------------------------------- Hey Nick It's possible they were truncated from Nutch crawls and content limits. See http://github.com/chrismattmann/trec-dd-polar/ for a description of the dataset. > Error while parsing an ogg file > ------------------------------- > > Key: TIKA-1856 > URL: https://issues.apache.org/jira/browse/TIKA-1856 > Project: Tika > Issue Type: Bug > Components: detector, parser > Affects Versions: 1.12 > Environment: python > Reporter: Yash Tanna > Labels: newbie, tika > Attachments: > 1B7A7AE8FE999D22E2A677EFDA38982C8957CF77BEF33717777E48852F7D67A7, > 1DE811ACAB8432D526EFE9D941E5EFE58F3C89F1AAB6CB7152091961DD854431, > 4600B9FF184F6AB71AA0CF6873E580FB0A31D75CE1218998057E9A185A5FFBB2, > 5E5892EA6C2B4A07BE998403A04127C7924E5539DB3EB0D27B9BD34D11A1575B, > CA3065B754E6CE79E4BF128464F4A202B0F2CF0336FBE73FA33F13776CD01CE8, > F036789D92EE18032556D9D0ECAC75073CED52226E1833001E379740E23E183D, > F33BFE4B1AF562D40E5B9D9F5D4B34EA6734F8F3A06F99535F100F957958D9BA, > F47F833BFD4A7E55C128DD76DB3666EEFFD0F5EDA24BF3EEEE1D6F2427BA092D, > FA9D1D2B8D0FB50CFE306FA6024EC48BD771562878B9B70D38D106DF4E61147A > > > Unable to detect a malformed ogg file. The error thrown was > Exception in thread "main" java.io.IOException: Asked to read 4335 bytes > from 0 but hit EoF at 780 > at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:39) > at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:31) > at org.gagravarr.ogg.OggPage.<init>(OggPage.java:82) > at > org.gagravarr.ogg.OggPacketReader.getNextPacket(OggPacketReader.java:116) > at org.gagravarr.tika.OggDetector.detect(OggDetector.java:97) > at > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) > at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:291) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) > [xdatadeploy@xdata upload]$ -- This message was sent by Atlassian JIRA (v6.3.4#6332)