[jira] [Comment Edited] (PDFBOX-3323) Cannot set destination meta data in PDFMergerUtility

Alexander Kriegisch (JIRA) Fri, 22 Apr 2016 15:09:56 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253877#comment-15253877
 ]


Alexander Kriegisch edited comment on PDFBOX-3323 at 4/22/16 10:08 PM:
-----------------------------------------------------------------------

Okay, the solution is more complex than I thought because before the merge I do 
not have a PDDocument and need to create a COSStream for the XMP meta data. 
Furthermore, it is non-trivial to set the creator property for XMP. I had to 
look into the XMPBox source code in order to find out how to do that. Maybe you 
want to publish this as an example if you find it useful and comprehensive. I 
think it is important to return something to the community, especially because 
Tilman supported me so well.

{code}
package de.scrum_master.pdf_tools;

import org.apache.pdfbox.cos.COSStream;
import org.apache.pdfbox.io.MemoryUsageSetting;
import org.apache.pdfbox.multipdf.PDFMergerUtility;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.pdfbox.pdmodel.common.PDMetadata;
import org.apache.xmpbox.XMPMetadata;
import org.apache.xmpbox.schema.DublinCoreSchema;
import org.apache.xmpbox.schema.PDFAIdentificationSchema;
import org.apache.xmpbox.schema.XMPBasicSchema;
import org.apache.xmpbox.type.AgentNameType;
import org.apache.xmpbox.type.BadFieldValueException;
import org.apache.xmpbox.xml.XmpSerializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import javax.xml.transform.TransformerException;
import java.io.*;
import java.util.Calendar;
import java.util.List;

public class PDFMerger {
  private final static Logger logger = LoggerFactory.getLogger(PDFMerger.class);

  /**
   * Modified {@link ByteArrayOutputStream} whose {@link 
StateExposingByteArrayOutputStream#toByteArray()}
   * method directly returns its internal byte buffer in order to avoid 
in-memory copies during PDF merge.
   * <p></p>
   * Please use carefully!
   */
  private static class StateExposingByteArrayOutputStream extends 
ByteArrayOutputStream {
    @Override
    public synchronized byte[] toByteArray() {
      return buf;
    }
  }

  /**
   * Creates a compound PDF document from a list of input documents
   * <p></p>
   * The merged document is PDF/A-1b compliant, provided the source documents 
are as well.
   * It contains document properties title, creator and subject, currently 
hard-coded.
   *
   * @param sources list of source PDF document streams
   * @return compound PDF document as a readable stream
   * @throws if anything goes wrong during PDF merge
   */
  public InputStream merge(final List<InputStream> sources) throws IOException {
    String title = "My title";
    String creator = "Alexander Kriegisch";
    String subject = "Subject with umlauts ÄÖÜ";

    try (
      ByteArrayOutputStream mergedPDFOutputStream = new 
StateExposingByteArrayOutputStream();
      COSStream cosStream = new COSStream()
    ) {
      PDFMergerUtility pdfMerger = createPDFMergerUtility(sources, 
mergedPDFOutputStream);

      // PDF and XMP properties must be identical, otherwise document is not 
PDF/A compliant
      PDDocumentInformation pdfDocumentInfo = createPDFDocumentInfo(title, 
creator, subject);
      PDMetadata xmpMetadata = createXMPMetadata(cosStream, title, creator, 
subject);
      pdfMerger.setDestinationDocumentInformation(pdfDocumentInfo);
      pdfMerger.setDestinationMetadata(xmpMetadata);

      logger.trace("Merging {} source documents into one PDF", sources.size());
      pdfMerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
      logger.trace("PDF merge successful, size = {} bytes", 
mergedPDFOutputStream.size());

      return new ByteArrayInputStream(mergedPDFOutputStream.toByteArray(), 0, 
mergedPDFOutputStream.size());

    } catch (BadFieldValueException | TransformerException e) {
      throw new IOException("PDF merge problem", e);
    } finally {
      for (InputStream source : sources) {
        try {
          source.close();
        } catch (IOException e) {}
      }
    }
  }

  private PDFMergerUtility createPDFMergerUtility(
    List<InputStream> sources,
    ByteArrayOutputStream mergedPDFOutputStream
  ) {
    logger.trace("Initialising PDF merge utility");
    PDFMergerUtility pdfMerger = new PDFMergerUtility();
    pdfMerger.addSources(sources);
    pdfMerger.setDestinationStream(mergedPDFOutputStream);
    return pdfMerger;
  }

  private PDDocumentInformation createPDFDocumentInfo(
    String title, String creator, String subject
  ) {
    logger.trace("Setting document info (title, author, subject) for merged 
PDF");
    PDDocumentInformation documentInformation = new PDDocumentInformation();
    documentInformation.setTitle(title);
    documentInformation.setCreator(creator);
    documentInformation.setSubject(subject);
    return documentInformation;
  }

  private PDMetadata createXMPMetadata(
    COSStream cosStream,
    String title, String creator, String subject
  )
    throws BadFieldValueException, TransformerException, IOException
  {
    logger.trace("Setting XMP metadata (title, author, subject) for merged 
PDF");
    XMPMetadata xmpMetadata = XMPMetadata.createXMPMetadata();

    // PDF/A-1b properties
    PDFAIdentificationSchema pdfaSchema = 
xmpMetadata.createAndAddPFAIdentificationSchema();
    pdfaSchema.setPart(1);
    pdfaSchema.setConformance("B");

    // Dublin Core properties
    DublinCoreSchema dublinCoreSchema = 
xmpMetadata.createAndAddDublinCoreSchema();
    dublinCoreSchema.setTitle(title);
    dublinCoreSchema.addCreator(creator);
    dublinCoreSchema.setDescription(subject);

    // XMP Basic properties
    XMPBasicSchema basicSchema = xmpMetadata.createAndAddXMPBasicSchema();
    Calendar creationDate = Calendar.getInstance();
    basicSchema.setCreateDate(creationDate);
    basicSchema.setModifyDate(creationDate);
    basicSchema.setMetadataDate(creationDate);
    basicSchema.setCreatorToolProperty(
      (AgentNameType) xmpMetadata
        .getTypeMapping()
        .instanciateSimpleField(basicSchema.getClass(), null, 
basicSchema.getPrefix(), basicSchema.CREATORTOOL, creator)
    );

    // Create and return XMP data structure in XML format
    try (
      ByteArrayOutputStream xmpOutputStream = new 
StateExposingByteArrayOutputStream();
      OutputStream cosXMPStream = cosStream.createOutputStream()
    ) {
      new XmpSerializer().serialize(xmpMetadata, xmpOutputStream, true);
      cosXMPStream.write(xmpOutputStream.toByteArray());
      return new PDMetadata(cosStream);
    }
  }
}
{code}

*Edit:* BTW, [~tilman], if you only ask me, you can put the functionality into 
release 2.0.1 because it works for me, even though it is a bit hard to 
implement it in a PDF/A compliant way. It would be nice if some time in the 
future I could just set PDF properties and say "please save as PDF/A-1b and the 
corresponding XMP would automatically be added.


was (Author: kriegaex):
Okay, the solution is more complex than I thought because before the merge I do 
not have a PDDocument and need to create a COSStream for the XMP meta data. 
Furthermore, it is non-trivial to set the creator property for XMP. I had to 
look into the XMPBox source code in order to find out how to do that. Maybe you 
want to publish this as an example if you find it useful and comprehensive. I 
think it is important to return something to the community, especially because 
Tilman supported me so well.

{code}
package de.scrum_master.pdf_tools;

import org.apache.pdfbox.cos.COSStream;
import org.apache.pdfbox.io.MemoryUsageSetting;
import org.apache.pdfbox.multipdf.PDFMergerUtility;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.pdfbox.pdmodel.common.PDMetadata;
import org.apache.xmpbox.XMPMetadata;
import org.apache.xmpbox.schema.DublinCoreSchema;
import org.apache.xmpbox.schema.PDFAIdentificationSchema;
import org.apache.xmpbox.schema.XMPBasicSchema;
import org.apache.xmpbox.type.AgentNameType;
import org.apache.xmpbox.type.BadFieldValueException;
import org.apache.xmpbox.xml.XmpSerializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import javax.xml.transform.TransformerException;
import java.io.*;
import java.util.Calendar;
import java.util.List;

public class PDFMerger {
  private final static Logger logger = LoggerFactory.getLogger(PDFMerger.class);

  /**
   * Modified {@link ByteArrayOutputStream} whose {@link 
StateExposingByteArrayOutputStream#toByteArray()}
   * method directly returns its internal byte buffer in order to avoid 
im-memory copies during PDF merge.
   * <p></p>
   * Please use carefully!
   */
  private static class StateExposingByteArrayOutputStream extends 
ByteArrayOutputStream {
    @Override
    public synchronized byte[] toByteArray() {
      return buf;
    }
  }

  /**
   * Creates a compound PDF document from a list of input documents
   * <p></p>
   * The merged document is PDF/A-1b compliant, provided the source documents 
are as well.
   * It contains document properties title, creator and subject, currently 
hard-coded.
   *
   * @param sources list of source PDF document streams
   * @return compound PDF document as a readable stream
   * @throws if anything goes wrong during PDF merge
   */
  public InputStream merge(final List<InputStream> sources) throws IOException {
    String title = "My title";
    String creator = "Alexander Kriegisch";
    String subject = "Subject with umlauts ÄÖÜ";

    try (
      ByteArrayOutputStream mergedPDFOutputStream = new 
StateExposingByteArrayOutputStream();
      COSStream cosStream = new COSStream()
    ) {
      PDFMergerUtility pdfMerger = createPDFMergerUtility(sources, 
mergedPDFOutputStream);

      // PDF and XMP properties must be identical, otherwise document is not 
PDF/A compliant
      PDDocumentInformation pdfDocumentInfo = createPDFDocumentInfo(title, 
creator, subject);
      PDMetadata xmpMetadata = createXMPMetadata(cosStream, title, creator, 
subject);
      pdfMerger.setDestinationDocumentInformation(pdfDocumentInfo);
      pdfMerger.setDestinationMetadata(xmpMetadata);

      logger.trace("Merging {} source documents into one PDF", sources.size());
      pdfMerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
      logger.trace("PDF merge successful, size = {} bytes", 
mergedPDFOutputStream.size());

      return new ByteArrayInputStream(mergedPDFOutputStream.toByteArray(), 0, 
mergedPDFOutputStream.size());

    } catch (BadFieldValueException | TransformerException e) {
      throw new IOException("PDF merge problem", e);
    } finally {
      for (InputStream source : sources) {
        try {
          source.close();
        } catch (IOException e) {}
      }
    }
  }

  private PDFMergerUtility createPDFMergerUtility(
    List<InputStream> sources,
    ByteArrayOutputStream mergedPDFOutputStream
  ) {
    logger.trace("Initialising PDF merge utility");
    PDFMergerUtility pdfMerger = new PDFMergerUtility();
    pdfMerger.addSources(sources);
    pdfMerger.setDestinationStream(mergedPDFOutputStream);
    return pdfMerger;
  }

  private PDDocumentInformation createPDFDocumentInfo(
    String title, String creator, String subject
  ) {
    logger.trace("Setting document info (title, author, subject) for merged 
PDF");
    PDDocumentInformation documentInformation = new PDDocumentInformation();
    documentInformation.setTitle(title);
    documentInformation.setCreator(creator);
    documentInformation.setSubject(subject);
    return documentInformation;
  }

  private PDMetadata createXMPMetadata(
    COSStream cosStream,
    String title, String creator, String subject
  )
    throws BadFieldValueException, TransformerException, IOException
  {
    logger.trace("Setting XMP metadata (title, author, subject) for merged 
PDF");
    XMPMetadata xmpMetadata = XMPMetadata.createXMPMetadata();

    // PDF/A-1b properties
    PDFAIdentificationSchema pdfaSchema = 
xmpMetadata.createAndAddPFAIdentificationSchema();
    pdfaSchema.setPart(1);
    pdfaSchema.setConformance("B");

    // Dublin Core properties
    DublinCoreSchema dublinCoreSchema = 
xmpMetadata.createAndAddDublinCoreSchema();
    dublinCoreSchema.setTitle(title);
    dublinCoreSchema.addCreator(creator);
    dublinCoreSchema.setDescription(subject);

    // XMP Basic properties
    XMPBasicSchema basicSchema = xmpMetadata.createAndAddXMPBasicSchema();
    Calendar creationDate = Calendar.getInstance();
    basicSchema.setCreateDate(creationDate);
    basicSchema.setModifyDate(creationDate);
    basicSchema.setMetadataDate(creationDate);
    basicSchema.setCreatorToolProperty(
      (AgentNameType) xmpMetadata
        .getTypeMapping()
        .instanciateSimpleField(basicSchema.getClass(), null, 
basicSchema.getPrefix(), basicSchema.CREATORTOOL, creator)
    );

    // Create and return XMP data structure in XML format
    try (
      ByteArrayOutputStream xmpOutputStream = new 
StateExposingByteArrayOutputStream();
      OutputStream cosXMPStream = cosStream.createOutputStream()
    ) {
      new XmpSerializer().serialize(xmpMetadata, xmpOutputStream, true);
      cosXMPStream.write(xmpOutputStream.toByteArray());
      return new PDMetadata(cosStream);
    }
  }
}
{code}

*Edit:* BTW, [~tilman], if you only ask me, you can put the functionality into 
release 2.0.1 because it works for me, even though it is a bit hard to 
implement it in a PDF/A compliant way. It would be nice if some time in the 
future I could just set PDF properties and say "please save as PDF/A-1b and the 
corresponding XMP would automatically be added.

> Cannot set destination meta data in PDFMergerUtility
> ----------------------------------------------------
>
>                 Key: PDFBOX-3323
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3323
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 1.8.9, 2.0.0
>            Reporter: Alexander Kriegisch
>            Assignee: Tilman Hausherr
>              Labels: merge, metadata
>             Fix For: 2.0.1, 2.1.0
>
>
> When merging multiple PDFs into one compound document via 
> {{PDFMergerUtility}}, meta data like title, author, subject cannot be set but 
> seem to be taken from one of the input documents. This is usually not the 
> desired behaviour because as a user I have no direct influence on the meta 
> data. As a user I would like to explicitly set or at least overwrite certain 
> meta data for the destination document. Currently I can only set the 
> destination stream or file name, but not the meta data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-3323) Cannot set destination meta data in PDFMergerUtility

Reply via email to