[ 
https://issues.apache.org/jira/browse/TIKA-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1515:
------------------------------
    Attachment: 081247.unk.xls

This file comes from govdocs1 and demonstrates the code page issue.

> Old XLS 3 parsing is not working on some documents
> --------------------------------------------------
>
>                 Key: TIKA-1515
>                 URL: https://issues.apache.org/jira/browse/TIKA-1515
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: 081247.unk.xls
>
>
> Thanks to [~gagravarr], we now have mime type id for excel.sheet.4 and 
> excel.sheet.3, and we have parsing for excel.sheet.4.  It looks like there's 
> are two issues with excel.sheet.3 parsing on most excel.sheet.3 files in 
> govdocs1.
> The predominant issue (169 out of 173) appears to stem from a bad/missing 
> code page parse:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Unsupported codepage requested
>       at 
> org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:83)
>       at 
> org.apache.poi.hssf.record.OldLabelRecord.getValue(OldLabelRecord.java:82)
>       at 
> org.apache.poi.hssf.extractor.OldExcelExtractor.getText(OldExcelExtractor.java:159)
>       at 
> org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:82)
>       at 
> org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:76)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>       ... 41 more
> Caused by: java.io.UnsupportedEncodingException: Codepage number may not be 
> -32767
>       at 
> org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:275)
>       at 
> org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:253)
>       at 
> org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:231)
>       at 
> org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:219)
>       at 
> org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:81)
>       ... 46 more
> {noformat}
> The second issue only affects 4 documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to