[
https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13585618#comment-13585618
]
art rodkin edited comment on TIKA-1020 at 2/25/13 4:42 AM:
-----------------------------------------------------------
Nick,
Could you please explain the intended goal of parsing an excel file?
Conceptually an excel file is a collection of records where typically the cells
in each row are the values for the fields defined in the headers in the first
row. Essentially it is a table...no? I am not sure an excel file always makes
sense as just a collection of non-empty cells. So - with the current
implementation it is impossible to correctly extract a list of records from a
2010 excel file. Works with excel 97 BTW. Any feedback would be greatly
appreciated.
was (Author: arodkin):
Nick,
Could you please explain the intended goal of parsing an excel file?
Conceptually an excel file is a collection of records where typically the cells
in each row are the values for the fields defined in the headers in the first
row. I am not sure an excel file always makes sense as just a collection of
non-empty cells. So - with the current implementation it is impossible to
correctly extract a list of records from a 2010 excel file. Works with excel 97
BTW. Any feedback would be greatly appreciated.
> Excel 2010 parser missing cell values are not reported resulting in missing
> columns values
> ------------------------------------------------------------------------------------------
>
> Key: TIKA-1020
> URL: https://issues.apache.org/jira/browse/TIKA-1020
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.2
> Environment: java 1.6 & 1.7
> Reporter: Neil Blue
> Labels: newbie, patch
>
> When parting an excel 2010 table, if a worksheet has a missing value, then it
> is not reported in the sax handler. As a result a missing value can result in
> unordered data.
> For example given the table:
> {code:title=Bar.java|borderStyle=solid}
> A B B
> 1 2 3
> 4 6
> 7 8 9
> {code}
> the returned sax handler reports elements
> {code:title=Bar.java|borderStyle=solid}
> <tr><td>A</td><td>B</td><td>C</td><tr>
> <tr><td>1</td><td>2</td><td>3</td><tr>
> <tr><td>4</td><td>6</td><tr>
> <tr><td>7</td><td>8</td><td>9</td><tr>
> {code}
> As a result the handler can detect that the third row as incomplete cell
> values but it is ambiguous which columns have missing data.
> As a possible fix for this excel 2010 xml data contains the cell reference
> value, which could be returned to the sax handler as an attribute.
> {code:title=Bar.java|borderStyle=solid}
> *** XSSFExcelExtractorDecorator.java 2012-11-08 10:51:55.881207100 +0000
> --- XSSFExcelExtractorDecorator.java.1 2012-11-08 10:59:02.972223700 +0000
> ***************
> *** 200,206 ****
>
> public void cell(String cellRef, String formattedValue) {
> try {
> ! xhtml.startElement("td");
>
> // Main cell contents
> xhtml.characters(formattedValue);
> --- 200,208 ----
>
> public void cell(String cellRef, String formattedValue) {
> try {
> ! AttributesImpl attributes = new AttributesImpl();
> ! attributes.addAttribute(null, "cellRef", "cellRef", null,
> cellRef) ;
> ! xhtml.startElement("td",attributes);
>
> // Main cell contents
> xhtml.characters(formattedValue);
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira