[
https://issues.apache.org/jira/browse/TIKA-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666449#action_12666449
]
Uwe Schindler commented on TIKA-189:
------------------------------------
Hi Georger,
sorry this was my fault (I only tested with a local open office file). The
problem in this issue is not whitespace: Even the XHTML output looks wrong:
<table>
<tbody>
<tr> <td>NameEmailSanta [email protected]</td>
</tr>
<tr> <td>Tooth [email protected]</td>
</tr>
</tbody>
As you see, it is not a whitespace problem, it seems, that the ExcelExtractor
forgets to insert a new TD element. I will investigate this a little bit, but I
am not sure, if it is a POI bug or an error in the table cell loop.
The other small bug, I found in whitespace handling, is a new issue: TIKA-190
> Text extraction from Excel files juxtaposes cells
> -------------------------------------------------
>
> Key: TIKA-189
> URL: https://issues.apache.org/jira/browse/TIKA-189
> Project: Tika
> Issue Type: Bug
> Components: general
> Affects Versions: 0.3
> Environment: Tika revision is svn-20090116, platform is Windows XP
> Pro SP3, JDK version is 1.6.0_06.
> Reporter: Georger Araújo
> Priority: Minor
> Attachments: no_cell_separators_when_extracted.zip, TIKA-189.patch
>
>
> I plan on using Tika to extract text from Excel (both .xls and .xlsx) files
> for indexing. But, I found that Tika juxtaposes cells on output. The example
> worksheets are in the attached .zip file.
> I took the time to run Apache POI and it does not have this bug i.e. cells
> are properly separated.
> When I run
> --begin--
> java -jar tika-0.3-SNAPSHOT-standalone.jar --text
> no_cell_separators_when_extracted.xls
> --end--
> I get the following output:
> --begin--
> Plan1
> NameEmailSanta [email protected]
> Tooth [email protected]
> --end--
> Same thing with a .xlxs file:
> --begin--
> java -jar tika-0.3-SNAPSHOT-standalone.jar --text
> no_cell_separators_when_extracted.xlsx
> --end--
> The output is:
> --begin--
> [Content_Types].xml
> _rels/.rels
> xl/_rels/workbook.xml.rels
> xl/workbook.xml
> xl/theme/theme1.xml
> xl/worksheets/_rels/sheet1.xml.rels
> xl/worksheets/sheet2.xml
> xl/worksheets/sheet3.xml
> xl/sharedStrings.xml
> NameEmailSanta [email protected] [email protected]
> xl/styles.xml
> xl/worksheets/sheet1.xml
> 012345
> docProps/core.xml
> GeorgerGeorger2009-01-17T15:29:04Z2009-01-17T15:30:56Z
> docProps/app.xml
> Microsoft Excel0falsePlanilhas3Plan1Plan2Plan3falsefalsefalse12.0000
> --end--
> Also note that the values from docProps/app.xml have been juxtaposed as well.
> This way, after indexing these files using the output from Tika, a search
> engine will only find "Fairy" when substring matching is used, because "Tooth
> Fairy" becomes "Tooth [email protected]". This is suboptimal and wrong.
> Thanks for your attention. Best regards,
> Georger
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.