#640: WebSubmit (plugin wsm_pdftk_plugin) - loss of metadata while it's being
processed
-------------------------------------------+-----------------
Reporter: jpcorral | Owner:
Type: defect | Status: new
Priority: major | Milestone:
Component: WebSubmit | Version:
Keywords: WebSubmit plugin pdf metadata |
-------------------------------------------+-----------------
This bug appears when there is more than one "custom key" with the same
name in the metadata of a PDF. A "custom key" is the name for those keys
that they are not "InfoKey" or "InfoValue".
Let's take the PDF of this e-proceedings
http://cdsweb.cern.ch/record/1090859, as example. It has a long Table of
contents and if the plugin is used to extract the metadata:
{{{
from invenio.websubmit_file_metadata_plugins import wsm_pdftk_plugin
# The second parameter is mandatory but it never used
wsm_pdftk_plugin.read_metadata_local('/path/to/the/file/care-
conf-06-049.pdf', 0)
}}}
This information is got:
{{{
{'BookmarkLevel': '1',
'BookmarkPageNumber': '297',
'BookmarkTitle': 'Session 8b_McIntyre_slides4.pdf',
'CreationDate': "D:20070312163531+01'00'",
'Creator': 'Adobe Acrobat 7.0',
'ModDate': "D:20070314164628+01'00'",
'NumberOfPages': '305',
'PdfID0': '35ef8d4d0af11db8788011242e3266',
'PdfID1': '2c54896bd24311db8788011242e3266',
'Producer': 'Mac OS X 10.4.8 Quartz PDFContext'}
}}}
But if the command {{{pdftk /path/to/the/file/care-conf-06-049.pdf dump-
data | less}}} is used, all metadata is extracted.
These "customs keys" are keys of a dictionary (line 98 of the plugin), so
the value of the those keys is always overwritten when a new line with the
same key appears.
In this example, only the last entrance of the Table of contents of this
PDF is retrieved:
{{{
{'BookmarkLevel': '1',
'BookmarkPageNumber': '297',
'BookmarkTitle': 'Session 8b_McIntyre_slides4.pdf',
[...]
}
}}}
--
Ticket URL: <http://invenio-software.org/ticket/640>
Invenio <http://invenio-software.org>