Text extraction from Excel files juxtaposes cells
-------------------------------------------------
Key: TIKA-189
URL: https://issues.apache.org/jira/browse/TIKA-189
Project: Tika
Issue Type: Bug
Components: general
Affects Versions: 0.3
Environment: Tika revision is svn-20090116, platform is Windows XP Pro
SP3, JDK version is 1.6.0_06.
Reporter: Georger Rommel Ferreira de Araújo
Priority: Minor
I plan on using Tika to extract text from Excel (both .xls and .xlsx) files for
indexing. But, I found that Tika juxtaposes cells on output. The example
worksheets are in the attached .zip file.
I took the time to run Apache POI and it does not have this bug i.e. cells are
properly separated.
When I run
--begin--
java -jar tika-0.3-SNAPSHOT-standalone.jar --text
no_cell_separators_when_extracted.xls
--end--
I get the following output:
--begin--
Plan1
NameEmailSanta [email protected]
Tooth [email protected]
--end--
Same thing with a .xlxs file:
--begin--
java -jar tika-0.3-SNAPSHOT-standalone.jar --text
no_cell_separators_when_extracted.xlsx
--end--
The output is:
--begin--
[Content_Types].xml
_rels/.rels
xl/_rels/workbook.xml.rels
xl/workbook.xml
xl/theme/theme1.xml
xl/worksheets/_rels/sheet1.xml.rels
xl/worksheets/sheet2.xml
xl/worksheets/sheet3.xml
xl/sharedStrings.xml
NameEmailSanta [email protected] [email protected]
xl/styles.xml
xl/worksheets/sheet1.xml
012345
docProps/core.xml
GeorgerGeorger2009-01-17T15:29:04Z2009-01-17T15:30:56Z
docProps/app.xml
Microsoft Excel0falsePlanilhas3Plan1Plan2Plan3falsefalsefalse12.0000
--end--
Also note that the values from docProps/app.xml have been juxtaposed as well.
This way, after indexing these files using the output from Tika, a search
engine will only find "Fairy" when substring matching is used, because "Tooth
Fairy" becomes "Tooth [email protected]". This is suboptimal and wrong.
Thanks for your attention. Best regards,
Georger
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.