[ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Staffan Olsson updated TIKA-482: -------------------------------- Description: When I added support for more image metadata in TIKA-472, i realized the current design had some restrictions: * I could not access the typed getters from Metadata Extractor, such as getDate (to format iso date) and getStringArray (for keywords). * The handler function was called one field at a time which prevents logic where one field depends on the value of another (there is for example record versions and fields that specify encoding) See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor. The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794 We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor. was: When I added support for more image metadata in TIKA-472, i realized the current design had some restrictions: * I could not access the typed getters from Metadata Extractor, such as getDate (to format iso date) and getStringArray (for keywords). * The handler function was called one field at a time which prevents logic where one field depends on the value of another (there is for example record versions and fields that specify encoding) See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor. The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794 We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor. To support more image formats we could investigate XMP, for example using http://www.pkg.dk/projects/XMP-Utilities-for-Java-XMPUtil4J/. Noticed that we already have jempbox in Tika so a JempboxExtractor in the image package would probably be the best approach to reading XMP. I'll make a separate ticket for this. > Refactor image and jpeg parsers for access to MetadataExtractor API > ------------------------------------------------------------------- > > Key: TIKA-482 > URL: https://issues.apache.org/jira/browse/TIKA-482 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 0.7 > Reporter: Staffan Olsson > Attachments: TIKA-451-DublinCore_and_TIKA-482.patch > > > When I added support for more image metadata in TIKA-472, i realized > the current design had some restrictions: > * I could not access the typed getters from Metadata Extractor, such > as getDate (to format iso date) and getStringArray (for keywords). > * The handler function was called one field at a time which prevents > logic where one field depends on the value of another (there is for > example record versions and fields that specify encoding) > See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor. > The patch also includes the date fix, see > https://issues.apache.org/jira/browse/TIKA-451#action_12898794 > We can later add more Extractors using other libraries, and map to parsers > based on format. For example we already use ImageIO in ImageParser so maybe > there should be an ImageIOExtractor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.