[ 
https://issues.apache.org/jira/browse/TIKA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582432#action_12582432
 ] 

Jukka Zitting commented on TIKA-132:
------------------------------------

I'm now done streamlining the class. Most notably I extracted and abstracted 
the TikaExcelCell class to the Cell interface and the related implementation 
classes TextCell and NumberCell and the LinkedCell decorator. These classes 
have no dependencies to Excel parsing, and could be used for similar 
page-by-page rendering purposes also by other parser implementations. I'll 
follow up with another issue to generalize the Cell classes.

I'll leave this issue open until POI releases the next version with hyperlink 
support.

> Refactor Excel extractor to parse per sheet and add hyperlink support
> ---------------------------------------------------------------------
>
>                 Key: TIKA-132
>                 URL: https://issues.apache.org/jira/browse/TIKA-132
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.1-incubating
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: TIKA-132-ExcelExtractor-refactor-v2.patch
>
>
> In the excel record stream, hyperlink records come at the end of the sheet, 
> after the cell value records. This is a problem for the current streaming 
> implementation of the excel parser since it means the hyperlink cannot be 
> output when a cell is being processed.
> Jukka suggested the following on the mailing list:
> "How about if the streaming Excel parser maintained a sparse in-memory table 
> of the contents of the sheet that is currently being parsed and would only 
> generate the respective SAX events once the sheet has been parsed? Since we 
> can focus on only the information that's relevant to Tika clients, the memory 
> requirements sould be moderate even for huge sheets (i.e. much less than the 
> file size even for a single-sheet file). This should satisfy the low memory 
> footprint requirements reasonably well while allowing us to generate more 
> accurate output."
> See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to