On Thu, Mar 20, 2008 at 2:56 AM, Jukka Zitting <[EMAIL PROTECTED]> wrote:
> Hi,
>
>
>  On Thu, Mar 20, 2008 at 4:11 AM, Niall Pemberton
>  <[EMAIL PROTECTED]> wrote:
>  > On Wed, Mar 19, 2008 at 4:58 PM, Jukka Zitting <[EMAIL PROTECTED]> wrote:
>
> >  >  I was looking at implementing link extraction for Excel files, and
>  >  >  found out that the link information is only available at the end of
>  >  >  the file as a special "cell X links to URI Y" record. The parser could
>  >
>  >  Its probably academic, but I believe they come at the end of each
>  >  sheet, rather than file.
>
>  You're right, good point!
>
>  PDF parsing can typically be streamed one page at a time, i.e. you
>  need to parse a whole page to be able to render the output, and this
>  is something we might want to consider doing also for Excel sheets:
>
>  How about if the streaming Excel parser maintained a sparse in-memory
>  table of the contents of the sheet that is currently being parsed and
>  would only generate the respective SAX events once the sheet has been
>  parsed? Since we can focus on only the information that's relevant to
>  Tika clients, the memory requirements sould be moderate even for huge
>  sheets (i.e. much less than the file size even for a single-sheet
>  file). This should satisfy the low memory footprint requirements
>  reasonably well while allowing us to generate more accurate output.
>
>
>  >  I didn't think link support was in the latest POI release and was only
>  >  added a few weeks ago:
>  >  
> http://svn.apache.org/viewvc/poi/trunk/src/java/org/apache/poi/hssf/record/HyperlinkRecord.java
>  >
>  >  Not trying to make any point, just wondering whether I got this wrong
>  >  or you found another way or you tried the lastest POI from svn?
>
>  I'm using POI trunk.
>
>
>  >  I think a low-memory-footprint parser still has value, despite this
>  >  drawback - I'm pretty sure that where I work lack of hyperlink support
>  >  is not an issue. Is there not room for two implementations in Tika?
>
>  There certainly is, my main concern are just the duplicate maintenance
>  effort and the added configuration complexity.
>
>  Would the above sheet-by-sheet streaming option work for your
>  requirements?

Sounds good to me. I'll put a patch together.

Niall

> Alternatively, we could avoid much duplication by making
>  the sheet-by-sheet feature a configurable mode of the normal streaming
>  Excel parser instead of using a separate parser class.
>
>  BR,
>
>  Jukka Zitting
>

Reply via email to