Using pdf-box-2.0.2:
I am trying to set dc:publisher to "Çâmára Münícìpål de Matelâñdia" in
the metadata of my PDF however my diacritical characters seem to get
mangled when I try and read the PDF back.
My writing code looks like:
PDDocument doc = ...
PDDocumentCatalog catalog = ...
PDMetadata metadataStream = Optional.ofNullable(catalog.getMetadata())
.orElseGet(() -> new PDMetadata(doc));
XMPMetadata xmpMetadata = null;
try(COSInputStream is = metadataStream.createInputStream()) {
xmpMetadata = new DomXmpParser().parse(is);
} catch(XmpParsingException e) {
LOG.warn(e);
xmpMetadata = XMPMetadata.createXMPMetadata();
}
DublinCoreSchema dcMetadata = xmpMetadata.createAndAddDublinCoreSchema();
dcMetadata.addPublisher("Çâmára Münícìpål de Matelâñdia");
catalog.setMetadata(xmpMetadata);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
XmpSerializer serializer = new XmpSerializer();
serializer.serialize(xmpMetadata, baos, false);
metadataStream.importXMPMetadata(baos.toByteArray());
My reading code looks like:
PDDocment doc = PDDocument.load(is);
PDDocumentCatalog catalog = doc.getDocumentCatalog()
PDMetadata metadata = catalog.getMetadata()
try(InputStream is = metadata.createInputStream()) {
Files.copy(is, Paths.get("/tmp/metadata.xml"));
}
However in the output XML I am seeing this:
<dc:publisher>
<rdf:Bag>
<rdf:li>??m?ra M?n?c?p?l de Matel??dia</rdf:li>
</rdf:Bag>
</dc:publisher>
So I guess something is up with the character encoding somewhere? Is
this something I am doing incorrectly, perhaps I need to specify UTF-8
somewhere (my character set)? or is this a bug in pdf-box?
Cheers Adam.
--
Adam Retter
skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]