Sam H created TIKA-1840:
---------------------------

             Summary: No way to link slide notes to slide in PPT output.
                 Key: TIKA-1840
                 URL: https://issues.apache.org/jira/browse/TIKA-1840
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.11
            Reporter: Sam H


I'm integrating Apache Tika into my project, and I want to extract (text) 
information from Powerpoint slides. Both PPT and PPTX

I've noticed when using PPT format, the slide notes are all aggregated at the 
end of the XML output, and there is no way to identify which note belongs to 
which slide.

I began looking at the code and found the following:

{code}
// TODO Find the Notes for this slide and extract inline
{code}
in 
[HSLFExtractor.java|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java]
 on line 140 

I would like to implement this part and contribute




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to