[ https://issues.apache.org/jira/browse/PDFBOX-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andre updated PDFBOX-5528: -------------------------- Description: We need to support PDF/UA compliant documents to some extent. I noticed that when we take a PDF/UA compliant PDF document and flatten it via PDAcroForm#flatten, the resulting output is not PDF/UA compliant anymore. After a little bit of research, the problem is that PDFBox creates /DO operators with paths representing the appearance of the form fields. According to the PDF/UA standard, such paths need to be enclosed in marked content sections (BMC ... EMC, BDC ... EMC, see attached images) By copying some code from AcroForm#flatten and adding contentStream.beginMarkedContent and contentStream.endMarkedContent myself, I can workaround the problem, but that's less than ideal, it would be great if this could be included in PDFBox. {code:java} final var dict = new COSDictionary(); dict.setLong(COSName.MCID, mcid); dict.setItem(COSName.BBOX, bBox); dict.setItem(COSName.TYPE, COSName.BACKGROUND); final var propList = PDPropertyList.create(dict); contentStream.beginMarkedContent(COSName.ARTIFACT, propList); contentStream.saveGraphicsState(); // see https://stackoverflow.com/a/54091766/1729265 for an explanation // of the steps required // this will transform the appearance stream form object into the rectangle of the // annotation bbox and map the coordinate systems final var transformationMatrix = pdfbox_resolveTransformationMatrix(form, annotation, appearanceStream); contentStream.transform(transformationMatrix); contentStream.drawForm(fieldObject); contentStream.restoreGraphicsState(); contentStream.endMarkedContent(); {code} was: We need to support PDF/UA compliant documents to some extent. I noticed that when we take a PDF/UA compliant PDF document and flatten it via PDAcroForm#flatten, the resulting output is not PDF/UA compliant anymore. After a little bit of research, the problem is that PDFBox creates /DO operators with paths representing the appearance of the form fields. According to the PDF/UA standard, such paths need to be enclosed in marked content sections (BMC ... EMC, BDC ... EMC, see attached images) By copying some code from AcroForm#flatten and adding contentStream.beginMarkedContent and contentStream.endMarkedContent myself, I can workaround the problem, but that's less than ideal, it would be great if this could be included in PDFBox. <pre> final var dict = new COSDictionary(); dict.setLong(COSName.MCID, mcid); dict.setItem(COSName.BBOX, bBox); dict.setItem(COSName.TYPE, COSName.BACKGROUND); final var propList = PDPropertyList.create(dict); contentStream.beginMarkedContent(COSName.ARTIFACT, propList); contentStream.saveGraphicsState(); // see https://stackoverflow.com/a/54091766/1729265 for an explanation // of the steps required // this will transform the appearance stream form object into the rectangle of the // annotation bbox and map the coordinate systems final var transformationMatrix = pdfbox_resolveTransformationMatrix(form, annotation, appearanceStream); contentStream.transform(transformationMatrix); contentStream.drawForm(fieldObject); contentStream.restoreGraphicsState(); contentStream.endMarkedContent(); </pre> > PDF/UA: Add marked content sections when flattening acro forms > -------------------------------------------------------------- > > Key: PDFBOX-5528 > URL: https://issues.apache.org/jira/browse/PDFBOX-5528 > Project: PDFBox > Issue Type: Improvement > Components: AcroForm > Reporter: Andre > Priority: Minor > Attachments: correct.png, wrong.png > > > We need to support PDF/UA compliant documents to some extent. I noticed that > when we take a PDF/UA compliant PDF document and flatten it via > PDAcroForm#flatten, the resulting output is not PDF/UA compliant anymore. > After a little bit of research, the problem is that PDFBox creates /DO > operators with paths representing the appearance of the form fields. > According to the PDF/UA standard, such paths need to be enclosed in marked > content sections (BMC ... EMC, BDC ... EMC, see attached images) > By copying some code from AcroForm#flatten and adding > contentStream.beginMarkedContent and contentStream.endMarkedContent myself, I > can workaround the problem, but that's less than ideal, it would be great if > this could be included in PDFBox. > > {code:java} > final var dict = new COSDictionary(); > dict.setLong(COSName.MCID, mcid); > dict.setItem(COSName.BBOX, bBox); > dict.setItem(COSName.TYPE, COSName.BACKGROUND); > final var propList = PDPropertyList.create(dict); > contentStream.beginMarkedContent(COSName.ARTIFACT, propList); > contentStream.saveGraphicsState(); > // see https://stackoverflow.com/a/54091766/1729265 for an > explanation > // of the steps required > // this will transform the appearance stream form object into the > rectangle of the > // annotation bbox and map the coordinate systems > final var transformationMatrix = > pdfbox_resolveTransformationMatrix(form, annotation, appearanceStream); > contentStream.transform(transformationMatrix); > contentStream.drawForm(fieldObject); > contentStream.restoreGraphicsState(); > contentStream.endMarkedContent(); > > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org