[ https://issues.apache.org/jira/browse/TIKA-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-1515: ------------------------------ Attachment: 081247.unk.xls This file comes from govdocs1 and demonstrates the code page issue. > Old XLS 3 parsing is not working on some documents > -------------------------------------------------- > > Key: TIKA-1515 > URL: https://issues.apache.org/jira/browse/TIKA-1515 > Project: Tika > Issue Type: Bug > Reporter: Tim Allison > Priority: Minor > Attachments: 081247.unk.xls > > > Thanks to [~gagravarr], we now have mime type id for excel.sheet.4 and > excel.sheet.3, and we have parsing for excel.sheet.4. It looks like there's > are two issues with excel.sheet.3 parsing on most excel.sheet.3 files in > govdocs1. > The predominant issue (169 out of 173) appears to stem from a bad/missing > code page parse: > {noformat} > Caused by: java.lang.IllegalArgumentException: Unsupported codepage requested > at > org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:83) > at > org.apache.poi.hssf.record.OldLabelRecord.getValue(OldLabelRecord.java:82) > at > org.apache.poi.hssf.extractor.OldExcelExtractor.getText(OldExcelExtractor.java:159) > at > org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:82) > at > org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:76) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) > ... 41 more > Caused by: java.io.UnsupportedEncodingException: Codepage number may not be > -32767 > at > org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:275) > at > org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:253) > at > org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:231) > at > org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:219) > at > org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:81) > ... 46 more > {noformat} > The second issue only affects 4 documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332)