On Thu, Mar 20, 2008 at 2:56 AM, Jukka Zitting <[EMAIL PROTECTED]> wrote: > Hi, > > > On Thu, Mar 20, 2008 at 4:11 AM, Niall Pemberton > <[EMAIL PROTECTED]> wrote: > > On Wed, Mar 19, 2008 at 4:58 PM, Jukka Zitting <[EMAIL PROTECTED]> wrote: > > > > I was looking at implementing link extraction for Excel files, and > > > found out that the link information is only available at the end of > > > the file as a special "cell X links to URI Y" record. The parser could > > > > Its probably academic, but I believe they come at the end of each > > sheet, rather than file. > > You're right, good point! > > PDF parsing can typically be streamed one page at a time, i.e. you > need to parse a whole page to be able to render the output, and this > is something we might want to consider doing also for Excel sheets: > > How about if the streaming Excel parser maintained a sparse in-memory > table of the contents of the sheet that is currently being parsed and > would only generate the respective SAX events once the sheet has been > parsed? Since we can focus on only the information that's relevant to > Tika clients, the memory requirements sould be moderate even for huge > sheets (i.e. much less than the file size even for a single-sheet > file). This should satisfy the low memory footprint requirements > reasonably well while allowing us to generate more accurate output. > > > > I didn't think link support was in the latest POI release and was only > > added a few weeks ago: > > > http://svn.apache.org/viewvc/poi/trunk/src/java/org/apache/poi/hssf/record/HyperlinkRecord.java > > > > Not trying to make any point, just wondering whether I got this wrong > > or you found another way or you tried the lastest POI from svn? > > I'm using POI trunk. > > > > I think a low-memory-footprint parser still has value, despite this > > drawback - I'm pretty sure that where I work lack of hyperlink support > > is not an issue. Is there not room for two implementations in Tika? > > There certainly is, my main concern are just the duplicate maintenance > effort and the added configuration complexity. > > Would the above sheet-by-sheet streaming option work for your > requirements?
Sounds good to me. I'll put a patch together. Niall > Alternatively, we could avoid much duplication by making > the sheet-by-sheet feature a configurable mode of the normal streaming > Excel parser instead of using a separate parser class. > > BR, > > Jukka Zitting >