Tim Allison created TIKA-1814:
---------------------------------

             Summary: Add a standalone XMPScannerParser
                 Key: TIKA-1814
                 URL: https://issues.apache.org/jira/browse/TIKA-1814
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison
            Priority: Minor


Several parsers make use of XMP data and normalize it via dc or other standards 
into our metadata object.  We're currently either relying on dependencies to 
make sense of multiple XMP packets within a file (PDFBox for PDFParser) or 
we're just grabbing the first (TiffParser via JempboxExtractor and 
XMPPacketScanner) or...which other parsers are processing XMP?

It might be useful to extract all XMPPackets from a file and store those raw 
bytes as Base64 encoded Strings in the Metadata object.  Advanced users could 
then have access to the raw XMP streams.

For Tika 1.x, unless users configured it, nothing would call it.  For Tika 2.x, 
once we get the combo configurable parsers set up, a user could configure a 
combo/additive parser, e.g., a PDFParser that is a combination of our current 
PDFParser and then this new XMPScannerParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to