On Thu, Mar 20, 2008 at 5:05 PM, Niall Pemberton <[EMAIL PROTECTED]> wrote: > > On Thu, Mar 20, 2008 at 2:56 AM, Jukka Zitting <[EMAIL PROTECTED]> wrote: > > Hi, > > > > > > On Thu, Mar 20, 2008 at 4:11 AM, Niall Pemberton > > <[EMAIL PROTECTED]> wrote: > > > On Wed, Mar 19, 2008 at 4:58 PM, Jukka Zitting <[EMAIL PROTECTED]> > wrote: > > > > > > I was looking at implementing link extraction for Excel files, and > > > > found out that the link information is only available at the end of > > > > the file as a special "cell X links to URI Y" record. The parser > could > > > > > > Its probably academic, but I believe they come at the end of each > > > sheet, rather than file. > > > > You're right, good point! > > > > PDF parsing can typically be streamed one page at a time, i.e. you > > need to parse a whole page to be able to render the output, and this > > is something we might want to consider doing also for Excel sheets: > > > > How about if the streaming Excel parser maintained a sparse in-memory > > table of the contents of the sheet that is currently being parsed and > > would only generate the respective SAX events once the sheet has been > > parsed? Since we can focus on only the information that's relevant to > > Tika clients, the memory requirements sould be moderate even for huge > > sheets (i.e. much less than the file size even for a single-sheet > > file). This should satisfy the low memory footprint requirements > > reasonably well while allowing us to generate more accurate output. > > > > > > > I didn't think link support was in the latest POI release and was only > > > added a few weeks ago: > > > > http://svn.apache.org/viewvc/poi/trunk/src/java/org/apache/poi/hssf/record/HyperlinkRecord.java > > > > > > Not trying to make any point, just wondering whether I got this wrong > > > or you found another way or you tried the lastest POI from svn? > > > > I'm using POI trunk. > > > > > > > I think a low-memory-footprint parser still has value, despite this > > > drawback - I'm pretty sure that where I work lack of hyperlink support > > > is not an issue. Is there not room for two implementations in Tika? > > > > There certainly is, my main concern are just the duplicate maintenance > > effort and the added configuration complexity. > > > > Would the above sheet-by-sheet streaming option work for your > > requirements? > > Sounds good to me. I'll put a patch together.
I've created a JIRA ticket and attached a patch: https://issues.apache.org/jira/browse/TIKA-132 Suggestions welcome, if you don't like how it resolves this - I can work up another patch Niall > Niall > > > > > Alternatively, we could avoid much duplication by making > > the sheet-by-sheet feature a configurable mode of the normal streaming > > Excel parser instead of using a separate parser class. > > > > BR, > > > > Jukka Zitting > > >