Re: [jira] Commented: (TIKA-105) Excel parser implementation based on POI's Event API

Jukka Zitting Wed, 26 Dec 2007 13:16:49 -0800

Hi,

On Dec 26, 2007 9:38 PM, Niall Pemberton <[EMAIL PROTECTED]> wrote:
> On Dec 26, 2007 7:19 PM, Keith R. Bennett <[EMAIL PROTECTED]> wrote:
> > When you say it includes the sheet name, you mean the name of each sheet
> > (tab) in the Excel file, right? Does it come out as bare text, or is it
> > encoded in a way that can be parsed (e.g. "{[Sheet: MySheet1]}")?  Or is
> > this configurable?
>
> Just plain text and not configurable ATM.


Having to use a yet another parser on Tika output is something that we
should IMHO avoid as much as possible. A more reasonable way to make
the sheet structure available to clients that need it would be to use
the features of the XHTML output serialization.

How about something like this:

    <div class="sheet">
        <h1 class="sheet-title">....</h1>
        <p>...</p>
    </div>

or, if one wants to match Excel's screen representation more closely
(IMHO not a goal for Tika):

    <div class="sheet">
        <table>...</table>
        <p class="sheet-title">....</p>
    </div>

A client that needs the sheet content as structured data can then use
XPath queries like //[EMAIL PROTECTED]'sheet'] or //[EMAIL 
PROTECTED]'sheet-title']
to selectively extract the content of entire sheets or just their
titles.

> > We have a need to read Excel files with more structure than the usual
> > unstructured text document.  At minimum, it would be great to be able to be
> > able to know where one sheet ends and the next begins.  Is this something
> > that would be appropriate to support, or does that go beyond the generic
> > unstructured text parsing mission of Tika?
>
> I'm leave that for the Tika devs to comment on.

One of the stated goals for Tika is to support not only unstructured
but also structured text extraction. This goal was discussed at the
search roundtable in Amsterdam (see the followup thread at
http://markmail.org/message/ggihw2cns53t6ayl) and implemented on the
Parser API level by making the parsers output XHTML SAX events instead
of character streams (see TIKA-53).

Note however that the goal here is not to make Tika replace the native
Parser APIs, just produce structured enough output to satisfy the
needs of typical Tika clients.

I think Keith's need to distinguish sheet boundaries is within the
scope of Tika, but if one for example wants to find out detailed cell
formatting information they should instead be looking at the
underlying POI APIs.

BR,

Jukka Zitting

Re: [jira] Commented: (TIKA-105) Excel parser implementation based on POI's Event API

Reply via email to