[ 
https://issues.apache.org/jira/browse/TIKA-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1515:
------------------------------
    Description: 
Thanks to [~gagravarr], we now have mime type id for excel.sheet.4 and 
excel.sheet.3, and we have parsing for excel.sheet.4.  It looks like there are 
two issues with excel.sheet.3 parsing on most excel.sheet.3 files in govdocs1.

The predominant issue (169 out of 175 files) appears to stem from a bad/missing 
code page parse:
{noformat}
Caused by: java.lang.IllegalArgumentException: Unsupported codepage requested
        at 
org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:83)
        at 
org.apache.poi.hssf.record.OldLabelRecord.getValue(OldLabelRecord.java:82)
        at 
org.apache.poi.hssf.extractor.OldExcelExtractor.getText(OldExcelExtractor.java:159)
        at 
org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:82)
        at 
org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:76)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
        ... 41 more
Caused by: java.io.UnsupportedEncodingException: Codepage number may not be 
-32767
        at 
org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:275)
        at 
org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:253)
        at 
org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:231)
        at 
org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:219)
        at 
org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:81)
        ... 46 more
{noformat}

The second issue only affects 4 documents.

  was:
Thanks to [~gagravarr], we now have mime type id for excel.sheet.4 and 
excel.sheet.3, and we have parsing for excel.sheet.4.  It looks like there's 
are two issues with excel.sheet.3 parsing on most excel.sheet.3 files in 
govdocs1.

The predominant issue (169 out of 175) appears to stem from a bad/missing code 
page parse:
{noformat}
Caused by: java.lang.IllegalArgumentException: Unsupported codepage requested
        at 
org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:83)
        at 
org.apache.poi.hssf.record.OldLabelRecord.getValue(OldLabelRecord.java:82)
        at 
org.apache.poi.hssf.extractor.OldExcelExtractor.getText(OldExcelExtractor.java:159)
        at 
org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:82)
        at 
org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:76)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
        ... 41 more
Caused by: java.io.UnsupportedEncodingException: Codepage number may not be 
-32767
        at 
org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:275)
        at 
org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:253)
        at 
org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:231)
        at 
org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:219)
        at 
org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:81)
        ... 46 more
{noformat}

The second issue only affects 4 documents.


> Old XLS 3 parsing is not working on some documents
> --------------------------------------------------
>
>                 Key: TIKA-1515
>                 URL: https://issues.apache.org/jira/browse/TIKA-1515
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: 081247.unk.xls
>
>
> Thanks to [~gagravarr], we now have mime type id for excel.sheet.4 and 
> excel.sheet.3, and we have parsing for excel.sheet.4.  It looks like there 
> are two issues with excel.sheet.3 parsing on most excel.sheet.3 files in 
> govdocs1.
> The predominant issue (169 out of 175 files) appears to stem from a 
> bad/missing code page parse:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Unsupported codepage requested
>       at 
> org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:83)
>       at 
> org.apache.poi.hssf.record.OldLabelRecord.getValue(OldLabelRecord.java:82)
>       at 
> org.apache.poi.hssf.extractor.OldExcelExtractor.getText(OldExcelExtractor.java:159)
>       at 
> org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:82)
>       at 
> org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:76)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
>       ... 41 more
> Caused by: java.io.UnsupportedEncodingException: Codepage number may not be 
> -32767
>       at 
> org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:275)
>       at 
> org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:253)
>       at 
> org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:231)
>       at 
> org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:219)
>       at 
> org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:81)
>       ... 46 more
> {noformat}
> The second issue only affects 4 documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to