[jira] Issue Comment Edited: (TIKA-189) Text extraction from Excel files juxtaposes cells

Uwe Schindler (JIRA) Thu, 22 Jan 2009 09:16:24 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666216#action_12666216
 ]


thetaphi edited comment on TIKA-189 at 1/22/09 9:14 AM:
-------------------------------------------------------------

TIKA-188 (and before that TIKA-171) resolves this for XLS files. XLSX files are 
not supported by TIKA at the moment, so they are handled as XML files and only 
the text parts are extracted (without extra whitespace).
XLS files are outputted using <table> with <td> tags. TIKA-188 has an extension 
to the XHTMLContentHandler (see 
http://svn.apache.org/viewvc/lucene/tika/trunk/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java?p2=%2Flucene%2Ftika%2Ftrunk%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ftika%2Fsax%2FXHTMLContentHandler.java&p1=%2Flucene%2Ftika%2Ftrunk%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ftika%2Fsax%2FXHTMLContentHandler.java&r1=734844&r2=734843&view=diff&pathrev=734844
 ), that automatically inserts newlines (\n) and TAB (\t) characters as SAX 
ignoreableWhitespace. Any handler that listens to characters() and 
(ignoreableWhitespace) gets a good working text-only stream. 
TextContentHandlers does this.

      was (Author: thetaphi):
    TIKA-188 (and before that TIKA-171) resolves this for XLS files. XLSX files 
are not supported by TIKA at the moment, so they are handled as XML files and 
only the text pars are extracted (without extra whitespace).
XLS files are outputted using <table> with <td> tags. TIKA-188 has an extension 
to the XHTMLContentHandler (see 
http://svn.apache.org/viewvc/lucene/tika/trunk/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java?p2=%2Flucene%2Ftika%2Ftrunk%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ftika%2Fsax%2FXHTMLContentHandler.java&p1=%2Flucene%2Ftika%2Ftrunk%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ftika%2Fsax%2FXHTMLContentHandler.java&r1=734844&r2=734843&view=diff&pathrev=734844),
 that automatically inserts newlines (\n) and TAB (\t) characters as SAX 
ignoreableWhitespace. Any handler that listens to characters() and 
(ignoreableWhitespace) gets a good working text-only stream. 
TextContentHandlers does this.
  
> Text extraction from Excel files juxtaposes cells
> -------------------------------------------------
>
>                 Key: TIKA-189
>                 URL: https://issues.apache.org/jira/browse/TIKA-189
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.3
>         Environment: Tika revision is svn-20090116, platform is Windows XP 
> Pro SP3, JDK version is 1.6.0_06.
>            Reporter: Georger Rommel Ferreira de Araújo
>            Priority: Minor
>         Attachments: no_cell_separators_when_extracted.zip
>
>
> I plan on using Tika to extract text from Excel (both .xls and .xlsx) files 
> for indexing. But, I found that Tika juxtaposes cells on output. The example 
> worksheets are in the attached .zip file.
> I took the time to run Apache POI and it does not have this bug i.e. cells 
> are properly separated.
> When I run
> --begin--
> java -jar tika-0.3-SNAPSHOT-standalone.jar --text 
> no_cell_separators_when_extracted.xls
> --end--
> I get the following output:
> --begin--
> Plan1
>     NameEmailSanta [email protected]
>     Tooth [email protected]
> --end--
> Same thing with a .xlxs file:
> --begin--
> java -jar tika-0.3-SNAPSHOT-standalone.jar --text 
> no_cell_separators_when_extracted.xlsx
> --end--
> The output is:
> --begin--
> [Content_Types].xml
> _rels/.rels
> xl/_rels/workbook.xml.rels
> xl/workbook.xml
> xl/theme/theme1.xml
> xl/worksheets/_rels/sheet1.xml.rels
> xl/worksheets/sheet2.xml
> xl/worksheets/sheet3.xml
> xl/sharedStrings.xml
> NameEmailSanta [email protected] [email protected]
> xl/styles.xml
> xl/worksheets/sheet1.xml
> 012345
> docProps/core.xml
> GeorgerGeorger2009-01-17T15:29:04Z2009-01-17T15:30:56Z
> docProps/app.xml
> Microsoft Excel0falsePlanilhas3Plan1Plan2Plan3falsefalsefalse12.0000
> --end--
> Also note that the values from docProps/app.xml have been juxtaposed as well.
> This way, after indexing these files using the output from Tika, a search 
> engine will only find "Fairy" when substring matching is used, because "Tooth 
> Fairy" becomes "Tooth [email protected]". This is suboptimal and wrong.
> Thanks for your attention. Best regards,
> Georger

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (TIKA-189) Text extraction from Excel files juxtaposes cells

Reply via email to