[jira] [Updated] (TIKA-742) PDF2XHTML fails to insert
nor space around page marker

Michael McCandless (Updated) (JIRA) Tue, 04 Oct 2011 03:36:58 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated TIKA-742:
------------------------------------

    Attachment: TIKA-742.patch

Patch.
                
> PDF2XHTML fails to insert <p> nor space around page marker
> ----------------------------------------------------------
>
>                 Key: TIKA-742
>                 URL: https://issues.apache.org/jira/browse/TIKA-742
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: 000086.pdf, TIKA-742.patch
>
>
> I have a test document (unfortunately not committable) whose page
> numbers are rendered with no separator (<p> nor space) before the next
> word.  So I have words like:
>   * 1Massachusetts
>   * 2Course
>   * 3also
>   * 4The
> But then when I ran the ExtractText -html command-line from PDFBox, I
> can see that <p> is inserted after these page numbers (spookily, not
> closing the previous <p>; I opened PDFBOX-1130 for that).
> So I made a simple change to Tika's PDF2XHTML, to have it override the
> writeStart/EndParagraph, and call handler.start/EndElement("p"), ie to
> preserve the paragraph structure that PDFBOX detects out to the
> resulting XHTML handler, and this fixes the issue (I now see the page
> number as a separate paragraph, rendered w/ newline in "text" mode
> from TikaCLI).
> Note that this test document is the same document from PDFBOX-1129
> (there are some quote characters that are not extracted correctly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-742) PDF2XHTML fails to insert nor space around page marker

Reply via email to

[jira] [Updated] (TIKA-742) PDF2XHTML fails to insert
nor space around page marker