Extracting PDF text without soft line breaks

Clément Doumouro via user Mon, 12 Feb 2024 01:46:47 -0800

Hi all,

I need to extract PDF text without soft line breaks in order to process PDF
content as part of a NLP pipeline (NER). Soft line breaks appearing as hard
line breaks in the text content is responsible for most of my NER model
errors.


When PDF text is extracted with line breaks, it's impossible for me to
post-process the content and distinguish soft line breaks from hard/real
line breaks, so I would like to avoid post-processing extracted text and
rather handle line breaks differently when extracting.

Adding a few break points in the source code, I have the feeling that my
problem would be solved by overriding
the AbstractPDF2XHTML/PDFTextStripper.handleLineSeparation (let's call t
PDF2XHTMLWithoutSoftBreaks) not to write the lineSeparator when a new line
is detected and we're not starting a new paragraph. Then I would have to
use my new PDF2XHTMLWithoutSoftBreaks inside a new
PDFParserWithoutSoftBreaks and then configure Tika to use that parser for
PDFs.

Doing so sounds very heavy and will require rewriting a lot more code than
just the handleLineSeparation method since it's actually private.

I wanted to ask if there were any alternative approaches which do not imply
post processing (since as mentioned above during processing we lose the
soft line break information and we can't retrieve it precisely afterwards) ?

Thank you for your help !
Best,

<https://www.icij.org/>
Clément Doumouro
Machine Learning Engineer

+1 301-244-8803‬ <+13012448803%E2%80%AC>    ICIJ.org  <https://icij.org/>
[email protected]   PGP key
<https://keys.openpgp.org/vks/v1/by-fingerprint/DFA5082713A7D671F384489886AB2EB14650FD50>

 1730 Rhode Island Ave NW, Suite 317 | Washington D.C. 20036 | United States
<https://maps.google.com/?q=1800+M+Street+NW,+Front+1+#33019+%7C+Washington,+DC+20033+%7C+United+States>
<https://www.facebook.com/ICIJ.org> <https://twitter.com/icijorg>
<https://www.linkedin.com/company/international-consortium-of-investigative-journalists/mycompany/>
<https://www.instagram.com/icijorg/> <https://www.youtube.com/c/IcijOrg>

Subscribe:  Get our stories in your inbox <https://www.icij.org/newsletter>
<https://www.icij.org/donate>

Extracting PDF text without soft line breaks

Reply via email to