[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217407#comment-14217407 ]
Lewis John McGibbney commented on TIKA-1445: -------------------------------------------- OK so in Any23, if we were to take the following example where we are focusing on a *single document extraction* e.g. (0) then it can be said that for any given document, when we run (1) the extraction we: * from all registered extractors, filter the extractors by MimeType (2) * from all matching extractors for the given MimeType, create the extractor (3) * loop through the matching extractors and actually run (4) each extractor on the local document source as an InputStream (5) for instance. We also have an Extraction Content and Extraction Reporting layers within Any23 which may be of use to Tika. To be honest I find the reports and context objects extremely useful for obtaining metrics from extraction... maybe we could do the same for Tika? There are some improvements which can be made to SingleDocumentExtraction within Any23 however that conversation is not relevant here. Hopefully the high level overview of the chaining extraction algorithm within Any23 is of some value to this conversation. (0) https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java (1) https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L205 (2) https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L223 (3) https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L252 (4) https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L440 (5) https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L465 > Figure out how to add Image metadata extraction to Tesseract parser > ------------------------------------------------------------------- > > Key: TIKA-1445 > URL: https://issues.apache.org/jira/browse/TIKA-1445 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Fix For: 1.8 > > Attachments: TIKA-1445.Mattmann.101214.patch.txt, > TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, > TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch > > > Now that Tesseract is the default image parser in Tika for many image types, > consider how to add back in the metadata extraction capabilities by the other > Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)