Tim Allison created TIKA-1515: --------------------------------- Summary: Old XLS 3 parsing is not working Key: TIKA-1515 URL: https://issues.apache.org/jira/browse/TIKA-1515 Project: Tika Issue Type: Bug Reporter: Tim Allison Priority: Minor
Thanks to [~gagravarr], we now have mime type id for excel.sheet.4 and excel.sheet.3, and we have parsing for excel.sheet.4. It looks like there's are two issues with excel.sheet.3 parsing on most excel.sheet.3 files in govdocs1. The predominant issue (169 out of 173) appears to stem from a bad/missing code page parse: {noformat} Caused by: java.lang.IllegalArgumentException: Unsupported codepage requested at org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:83) at org.apache.poi.hssf.record.OldLabelRecord.getValue(OldLabelRecord.java:82) at org.apache.poi.hssf.extractor.OldExcelExtractor.getText(OldExcelExtractor.java:159) at org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:82) at org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:76) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) ... 41 more Caused by: java.io.UnsupportedEncodingException: Codepage number may not be -32767 at org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:275) at org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:253) at org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:231) at org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:219) at org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:81) ... 46 more {noformat} The second issue only affects 4 documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332)