Hi,

On Thu, Mar 20, 2008 at 4:11 AM, Niall Pemberton
<[EMAIL PROTECTED]> wrote:
> On Wed, Mar 19, 2008 at 4:58 PM, Jukka Zitting <[EMAIL PROTECTED]> wrote:
>  >  I was looking at implementing link extraction for Excel files, and
>  >  found out that the link information is only available at the end of
>  >  the file as a special "cell X links to URI Y" record. The parser could
>
>  Its probably academic, but I believe they come at the end of each
>  sheet, rather than file.

You're right, good point!

PDF parsing can typically be streamed one page at a time, i.e. you
need to parse a whole page to be able to render the output, and this
is something we might want to consider doing also for Excel sheets:

How about if the streaming Excel parser maintained a sparse in-memory
table of the contents of the sheet that is currently being parsed and
would only generate the respective SAX events once the sheet has been
parsed? Since we can focus on only the information that's relevant to
Tika clients, the memory requirements sould be moderate even for huge
sheets (i.e. much less than the file size even for a single-sheet
file). This should satisfy the low memory footprint requirements
reasonably well while allowing us to generate more accurate output.

>  I didn't think link support was in the latest POI release and was only
>  added a few weeks ago:
>  
> http://svn.apache.org/viewvc/poi/trunk/src/java/org/apache/poi/hssf/record/HyperlinkRecord.java
>
>  Not trying to make any point, just wondering whether I got this wrong
>  or you found another way or you tried the lastest POI from svn?

I'm using POI trunk.

>  I think a low-memory-footprint parser still has value, despite this
>  drawback - I'm pretty sure that where I work lack of hyperlink support
>  is not an issue. Is there not room for two implementations in Tika?

There certainly is, my main concern are just the duplicate maintenance
effort and the added configuration complexity.

Would the above sheet-by-sheet streaming option work for your
requirements? Alternatively, we could avoid much duplication by making
the sheet-by-sheet feature a configurable mode of the normal streaming
Excel parser instead of using a separate parser class.

BR,

Jukka Zitting

Reply via email to