[jira] [Comment Edited] (PDFBOX-4007) Merged documents don't retain tags

Dave Hill (JIRA) Wed, 10 Jan 2018 06:59:48 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-4007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320386#comment-16320386
 ]


Dave Hill edited comment on PDFBOX-4007 at 1/10/18 2:58 PM:
------------------------------------------------------------

Tilman I am responding to your 3999 comment here. Yes,  this issue has been 
quite the stumbling block for our project but I have not succeeded in 
correcting it, which is why I haven't supplied new code.

"I haven't worked much on the merge myself because I haven't understood 
everything about tagged PDFs." I'm sure I don't have anything to teach you but 
I will share what I understand. I will upload "FourFontsTagged.pdf" which 
represents our best understanding of tagging. This file is human readable text 
with comments. The important highlights are as follows:
Root object 6 0 contains element /MarkInfo <</Marked true>> which turns tagging 
"on", and element /StructTreeRoot 18 0 R which points to the start of the 
tagging.

Object 18 0 points to two lists of arrays, both of which are identical and both 
of which point to an array of tag objects. We don't know why there are two 
lists. The first is /K 41 0 R which points to an object with this array /K [23 
0 R 22 0 R 21 0 R 20 0 R], the second is /ParentTree 43 0 R which then points 
to 44 0 which contains another copy of the array [23 0 R 22 0 R 21 0 R 20 0 R].

There are four examples of tags in this document, one for each object in the 
array. The first object, 23 0, looks like this

{noformat}
23 0 obj        %actual tag object
<<
 /K 99                          %the ID# here matches the MCID ID# for the text 
to be tagged
 /C /Heading#201                %we don't know what this does
 /P 41 0 R                      %parent id of the /K
 /S /Hello#20in#20Italics  %tag title
 /Pg 9 0 R                       %the page object containing the tagged text
>>
endobj
{noformat}

When I created this document I set /K 99 so it stood out more when you see the 
matching text being tagged. In the pages content stream, the tagged text is

{noformat}
        /P <</MCID 99>>
        BDC
        0 50 Td
        /TimesItalic 48 Tf
        (Hello Times)Tj
        EMC
{noformat}

The /MCID defined with this text is how tag object 23 ties to this tag. This 
document was created after reading
https://www.adobe.com/technology/pdfs/presentations/KingPDFTutorial.pdf
starting at page 86 where it discusses tagging.

I'm going to take one more swing at this code. I will almost certainly post 
questions here about the internal workings of PDFBox related to this issue.


was (Author: davesplanet):
Tilman I am responding to your 3999 comment here. Yes,  this issue has been 
quite the stumbling block for our project but I have not succeeded in 
correcting it, which is why I haven't supplied new code.

@"I haven't worked much on the merge myself because I haven't understood 
everything about tagged PDFs." I'm sure I don't have anything to teach you but 
I will share what I understand. I will upload "FourFontsTagged.pdf" which 
represents our best understanding of tagging. This file is human readable text 
with comments. The important highlights are as follows:
Root object 6 0 contains element /MarkInfo <</Marked true>> which turns tagging 
"on", and element /StructTreeRoot 18 0 R which points to the start of the 
tagging.

Object 18 0 points to two lists of arrays, both of which are identical and both 
of which point to an array of tag objects. We don't know why there are two 
lists. The first is /K 41 0 R which points to an object with this array /K [23 
0 R 22 0 R 21 0 R 20 0 R], the second is /ParentTree 43 0 R which then points 
to 44 0 which contains another copy of the array [23 0 R 22 0 R 21 0 R 20 0 R].

There are four examples of tags in this document, one for each object in the 
array. The first object, 23 0, looks like this

{noformat}
23 0 obj        %actual tag object
<<
 /K 99                          %the ID# here matches the MCID ID# for the text 
to be tagged
 /C /Heading#201                %we don't know what this does
 /P 41 0 R                      %parent id of the /K
 /S /Hello#20in#20Italics  %tag title
 /Pg 9 0 R                       %the page object containing the tagged text
>>
endobj
{noformat}

When I created this document I set /K 99 so it stood out more when you see the 
matching text being tagged. In the pages content stream, the tagged text is

{noformat}
        /P <</MCID 99>>
        BDC
        0 50 Td
        /TimesItalic 48 Tf
        (Hello Times)Tj
        EMC
{noformat}

The /MCID defined with this text is how tag object 23 ties to this tag. This 
document was created after reading
https://www.adobe.com/technology/pdfs/presentations/KingPDFTutorial.pdf
starting at page 86 where it discusses tagging.

I'm going to take one more swing at this code. I will almost certainly post 
questions here about the internal workings of PDFBox related to this issue.

> Merged documents don't retain tags
> ----------------------------------
>
>                 Key: PDFBOX-4007
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4007
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.0.8
>            Reporter: Dave Hill
>            Priority: Minor
>              Labels: StructureTree, merge
>         Attachments: HelloWorldTagged.pdf, PDFMergeUtility-2.patch, 
> PDFMergeUtility.patch, Tagged+GeneralForbearance-Merged.pdf, Tagged.pdf
>
>
> Certain combinations of documents don't retain tags when merged. The document 
> [^Tagged.pdf] is just a basic one word PDF created and tagged with Pro DC. If 
> you try to merge this with the government [General Forbearance 
> form|https://studentloans.gov/myDirectLoan/downloadForm.action?searchType=library&shortName=general&localeCode=en-us]
>  the output crashes DC when you try to view the tags. If you use a flattened 
> version of the General Forbearance form then the tags are just munged.
> {code}
>     public static void main(String[] args) throws Exception {
>         PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
>         PDDocument src = PDDocument.load(new File("Tagged.pdf"));
>         PDDocument dest = PDDocument.load(new File("GeneralForbearance.pdf"));
>         pdfMergerUtility.appendDocument(dest, src);
>         src.close();
>         dest.save(new File("BrokenTags.pdf"));
>         dest.close();
>     }
> {code}
> The included patch appears to make tagging more reliable, but I'm still 
> relying heavily on cloning which can apparently cause other issues.  The 
> documents I get out with this code seem present correctly in Adobe readers 
> for all combinations of documents that I tested against.
> My patch is made and tested against yesterdays production head and it 
> includes my changes from 
> [PDFBOX-3999|https://issues.apache.org/jira/browse/PDFBOX-3999] since it is 
> in the exact same place in the code.
> The priority of this is a blocker for 508 compliance of merged documents but 
> I guessed it to be more of a minor issue in the overall scheme of things, 
> please correct me if I am mistaken.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-4007) Merged documents don't retain tags

Reply via email to