Thank you Maruan,

Apologies for the noise. I have now resolved this. I simplified my
code for the examples I gave in the email. The issue is not with PDF
Box, rather a 3rd party library which was processing the string
"Çâmára Münícìpål de Matelâñdia" before it reached PDFBox was mangling
it.

Thanks again.

Adam.

On 19 July 2016 at 08:13, Maruan Sahyoun <[email protected]> wrote:
> Hi,
>
>> Am 18.07.2016 um 14:15 schrieb Adam Retter <[email protected]>:
>>
>> Using pdf-box-2.0.2:
>>
>> I am trying to set dc:publisher to "Çâmára Münícìpål de Matelâñdia" in
>> the metadata of my PDF however my diacritical characters seem to get
>> mangled when I try and read the PDF back.
>>
>> My writing code looks like:
>>
>> PDDocument doc = ...
>> PDDocumentCatalog catalog = ...
>>
>> PDMetadata metadataStream = Optional.ofNullable(catalog.getMetadata())
>>  .orElseGet(() -> new PDMetadata(doc));
>> XMPMetadata xmpMetadata = null;
>> try(COSInputStream is = metadataStream.createInputStream()) {
>>  xmpMetadata = new DomXmpParser().parse(is);
>> } catch(XmpParsingException e) {
>>  LOG.warn(e);
>>  xmpMetadata = XMPMetadata.createXMPMetadata();
>> }
>> DublinCoreSchema dcMetadata = xmpMetadata.createAndAddDublinCoreSchema();
>> dcMetadata.addPublisher("Çâmára Münícìpål de Matelâñdia");
>> catalog.setMetadata(xmpMetadata);
>> ByteArrayOutputStream baos = new ByteArrayOutputStream();
>> XmpSerializer serializer = new XmpSerializer();
>> serializer.serialize(xmpMetadata, baos, false);
>> metadataStream.importXMPMetadata(baos.toByteArray());
>>
>>
>> My reading code looks like:
>>
>> PDDocment doc = PDDocument.load(is);
>> PDDocumentCatalog catalog = doc.getDocumentCatalog()
>> PDMetadata metadata = catalog.getMetadata()
>> try(InputStream is = metadata.createInputStream()) {
>>   Files.copy(is, Paths.get("/tmp/metadata.xml"));
>> }
>>
>>
>> However in the output XML I am seeing this:
>>
>> <dc:publisher>
>>    <rdf:Bag>
>>        <rdf:li>??m?ra M?n?c?p?l de Matel??dia</rdf:li>
>>    </rdf:Bag>
>> </dc:publisher>
>>
>>
>
> I've tested various ways of saving the file, yours, serializing to 
> FileOutputStream … and all work with when viewing the content in a browser ot 
> a text editor.
>
>
> <dc:publisher>
>         <rdf:Bag>
>           <rdf:li>Çâmára Münícìpål de Matelâñdia</rdf:li>
>         </rdf:Bag>
>       </dc:publisher>
>
> Where do you see that string?
>
> BR
> Maruan
>
>
>
>> So I guess something is up with the character encoding somewhere? Is
>> this something I am doing incorrectly, perhaps I need to specify UTF-8
>> somewhere (my character set)? or is this a bug in pdf-box?
>>
>> Cheers Adam.
>>
>>
>>
>>
>>
>> --
>> Adam Retter
>>
>> skype: adam.retter
>> tweet: adamretter
>> http://www.adamretter.org.uk
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>



-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to