[ 
https://issues.apache.org/jira/browse/TIKA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niall Pemberton updated TIKA-132:
---------------------------------

    Attachment:     (was: TIKA-132-ExcelExtractor-refactor-v1.patch)

> Refactor Excel extractor to parse per sheet and add hyperlink support
> ---------------------------------------------------------------------
>
>                 Key: TIKA-132
>                 URL: https://issues.apache.org/jira/browse/TIKA-132
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.1-incubating
>            Reporter: Niall Pemberton
>            Priority: Minor
>         Attachments: TIKA-132-ExcelExtractor-refactor-v2.patch
>
>
> In the excel record stream, hyperlink records come at the end of the sheet, 
> after the cell value records. This is a problem for the current streaming 
> implementation of the excel parser since it means the hyperlink cannot be 
> output when a cell is being processed.
> Jukka suggested the following on the mailing list:
> "How about if the streaming Excel parser maintained a sparse in-memory table 
> of the contents of the sheet that is currently being parsed and would only 
> generate the respective SAX events once the sheet has been parsed? Since we 
> can focus on only the information that's relevant to Tika clients, the memory 
> requirements sould be moderate even for huge sheets (i.e. much less than the 
> file size even for a single-sheet file). This should satisfy the low memory 
> footprint requirements reasonably well while allowing us to generate more 
> accurate output."
> See here: http://tika.markmail.org/message/ac3kgujkcrgqyb4i

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to