[jira] [Updated] (SOLR-10934) create a link+anchor checker for the ref-guide PDF using PDFBox

Hoss Man (JIRA) Mon, 30 Oct 2017 17:38:37 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated SOLR-10934:
----------------------------
    Attachment: SOLR-10934.patch



Ok, I'm attaching a really rough and dirty patch that includes:

* A quick and dirty CheckPDFLinksAndAnchors inspired by the SO post mentioned 
and the original PrintURLs.java demo from pdfbox
* a build.xml 'nocommit' target to run it against our PDF
* some "broken" changes to our ref-guide content to deliberatey introduce a few 
errors...
*# anchor duplicated in multiple source pages
*# links to each of the diff dup anchors
*# link to an anchor that doesn't exist in the specified source doc, but does 
exist in a diff doc
*# links to an source doc thta doesn't exist
*# links to an anchor that doesn't exist (in a source doc that does)

The results aren't promising...

# FAIL: the dup anchors cause asciidoctor to print a WARNING (even w/o any link 
checking) that i'd forgotten about, but as far as i can tell from my 
exploration of the {{PDDocumentCatalog}} that duplicated information is lost in 
the underlying PDF (or if it does make it into the PDF, PDFBox loses it when 
parsing the PDF, because the "Catalog" is just a Map)
# FAIL: the PDF Annotations to each of the dup links both wind up mapping to 
the page with the first occurange -- again: either because the catalog in the 
file can only track one location for a given anchor, or because that's just how 
PDF Box deals with the precedence of dup dict keys when reading the file
# FAIL: if an anchor doesn't exist in the specified source {{\*.adoc}} file, 
but does exist somehwere else in the final PDF, then that's where asciidoctor 
points the generated link -- there's nothing weird about it i can detect from 
PDFBox
# GOOD: link's to a source {{\*.adoc}} file that doesn't actaully exist on disk 
are fairly easy to detect -- asciidoctor's default behavior is to assume that 
these are links to other docs that will be converted seperately, so they show 
up as "relative URIs" which we can treat as a failure (ie: if a link in a PDF 
is to a non-absolute URI, it must be a content error)
# GOOD: link's to an anchor that doesn't exist are likewise easy to identify: 
the "annotation" is preserved but has no destiation, which we can treat as a 
failure.

The important bits of the output w/this patch are included below...

{noformat}
-build-raw-pdf:
[asciidoctor:convert] Render SolrRefGuide-all.adoc from 
/home/hossman/lucene/dev/solr/build/solr-ref-guide/content/pdf to 
/home/hossman/lucene/dev/solr/build/solr-ref-guide/pdf-tmp with backend=pdf
[asciidoctor:convert] asciidoctor: ERROR: about-this-guide.adoc: line 1: 
invalid part, must have at least one section (e.g., chapter, appendix, etc.)
[asciidoctor:convert] asciidoctor: ERROR: solr-glossary.adoc: line 1: invalid 
part, must have at least one section (e.g., chapter, appendix, etc.)
[asciidoctor:convert] asciidoctor: WARNING: errata.adoc: line 30: id assigned 
to section already in use: nocommit_dup_anchor_name
[asciidoctor:convert] asciidoctor: ERROR: SolrRefGuide-all.adoc: line 37: 
invalid part, must have at least one section (e.g., chapter, appendix, etc.)
     [move] Moving 1 file to 
/home/hossman/lucene/dev/solr/build/solr-ref-guide/pdf-tmp
...
nocommit:
     [java] Page 753:'Link to bogus page @ anchor that does not exist'=> BOGUS 
URI: nocommit_bogus_page.pdf#nocommit_bogus_x2
     [java] Page 753:'Link to about @ anchor that does not exist' => link with 
no page dest

{noformat}

----

All in all these results are disappointing.

The "Single Page" output behavior of asciidoctor, combined with the "bugs" in 
asciidoctors handling of duplicated anchors in page includes, combined with the 
underlying structure of the PDF, make it really hard to find the same types of 
failures we can find when parsing the jekyll generated pages using our 
white-box knowledge of "there must be no dup anchors across all pages"


> create a link+anchor checker for the ref-guide PDF using PDFBox
> ---------------------------------------------------------------
>
>                 Key: SOLR-10934
>                 URL: https://issues.apache.org/jira/browse/SOLR-10934
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: documentation
>            Reporter: Hoss Man
>         Attachments: SOLR-10934.patch
>
>
> We currently have CheckLinksAndAnchors.java which is automatically run 
> against the ref-guide HTML as part of the build to use JSoup to find bad 
> links/anchors that asciidoctor doesn't complain about -- but not everyone 
> does/can build the HTML version of the ref-guide sincif we can e it requires 
> manually installing jekyll.
> The PDF build only requires things installed by ivy (via JRuby) and we 
> already have some PDFBox based code in ReducePDFSize.java that operates on 
> this PDF every time it's run -- so if we can find a way to do similar checks 
> using the PDFBox API we could catch these broken links faster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10934) create a link+anchor checker for the ref-guide PDF using PDFBox

Reply via email to