Hello,

When I use the latest build of the Tika application jar's CLI with the
-h option to parse testAnnotations.pdf (from the parsers' test
documents folder), added in TIKA-738, the result has two "<p>"
elements and three "</p>" elements.  Attempting to open this file in
the GUI also causes it to crash with a NPE--the same one described in
TIKA-778.  I see in issue PDFBox-1143 that the code introduced for
TIKA-738 will go away once this PDFBox issue is resolved, but perhaps
meanwhile PDF2XHTML.java should be modified to produce a different
number of "</p>" elements:  should one of the
"handler.endElement("p");" lines be removed from the endPage method?

Thanks,
John Mastarone

Reply via email to