Text extraction from Excel files juxtaposes cells
-------------------------------------------------

                 Key: TIKA-189
                 URL: https://issues.apache.org/jira/browse/TIKA-189
             Project: Tika
          Issue Type: Bug
          Components: general
    Affects Versions: 0.3
         Environment: Tika revision is svn-20090116, platform is Windows XP Pro 
SP3, JDK version is 1.6.0_06.
            Reporter: Georger Rommel Ferreira de Araújo
            Priority: Minor


I plan on using Tika to extract text from Excel (both .xls and .xlsx) files for 
indexing. But, I found that Tika juxtaposes cells on output. The example 
worksheets are in the attached .zip file.
I took the time to run Apache POI and it does not have this bug i.e. cells are 
properly separated.

When I run

--begin--
java -jar tika-0.3-SNAPSHOT-standalone.jar --text 
no_cell_separators_when_extracted.xls
--end--

I get the following output:

--begin--
Plan1
    NameEmailSanta [email protected]
    Tooth [email protected]
--end--

Same thing with a .xlxs file:
--begin--
java -jar tika-0.3-SNAPSHOT-standalone.jar --text 
no_cell_separators_when_extracted.xlsx
--end--

The output is:

--begin--
[Content_Types].xml



_rels/.rels



xl/_rels/workbook.xml.rels



xl/workbook.xml



xl/theme/theme1.xml



xl/worksheets/_rels/sheet1.xml.rels



xl/worksheets/sheet2.xml



xl/worksheets/sheet3.xml



xl/sharedStrings.xml
NameEmailSanta [email protected] [email protected]


xl/styles.xml



xl/worksheets/sheet1.xml
012345


docProps/core.xml
GeorgerGeorger2009-01-17T15:29:04Z2009-01-17T15:30:56Z


docProps/app.xml
Microsoft Excel0falsePlanilhas3Plan1Plan2Plan3falsefalsefalse12.0000
--end--

Also note that the values from docProps/app.xml have been juxtaposed as well.

This way, after indexing these files using the output from Tika, a search 
engine will only find "Fairy" when substring matching is used, because "Tooth 
Fairy" becomes "Tooth [email protected]". This is suboptimal and wrong.

Thanks for your attention. Best regards,

Georger

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to