Neil Blue created TIKA-1020:
-------------------------------
Summary: Excel 2010 parser missing cell values are not reported
resulting in missing columns values
Key: TIKA-1020
URL: https://issues.apache.org/jira/browse/TIKA-1020
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.2
Environment: java 1.6 & 1.7
Reporter: Neil Blue
When parting an excel 2010 table, if a worksheet has a missing value, then it
is not reported in the sax handler. As a result a missing value can result in
unordered data.
For example given the table:
{code:title=Bar.java|borderStyle=solid}
A B B
1 2 3
4 6
7 8 9
{code}
the returned sax handler reports elements
{code:title=Bar.java|borderStyle=solid}
<tr><td>A</td><td>B</td><td>C</td><tr>
<tr><td>1</td><td>2</td><td>3</td><tr>
<tr><td>4</td><td>6</td><tr>
<tr><td>7</td><td>8</td><td>9</td><tr>
{code}
As a result the handler can detect that the third row as incomplete cell values
but it is ambiguous which columns have missing data.
As a possible fix for this excel 2010 xml data contains the cell reference
value, which could be returned to the sax handler as an attribute.
{code:title=Bar.java|borderStyle=solid}
*** XSSFExcelExtractorDecorator.java 2012-11-08 10:51:55.881207100 +0000
--- XSSFExcelExtractorDecorator.java.1 2012-11-08 10:59:02.972223700 +0000
***************
*** 200,206 ****
public void cell(String cellRef, String formattedValue) {
try {
! xhtml.startElement("td");
// Main cell contents
xhtml.characters(formattedValue);
--- 200,208 ----
public void cell(String cellRef, String formattedValue) {
try {
! AttributesImpl attributes = new AttributesImpl();
! attributes.addAttribute(null, "cellRef", "cellRef", null,
cellRef) ;
! xhtml.startElement("td",attributes);
// Main cell contents
xhtml.characters(formattedValue);
{code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira