Hi,

Can you share a non confidential file and explain what you did, what got and what you want instead? I also fail to grasp the first sentence. Do you want soft line breaks, or no line breaks at all?

Tilman

On 12.02.2024 10:46, Clément Doumouro via user wrote:
Hi all,

I need to extract PDF text without soft line breaks in order to process PDF content as part of a NLP pipeline (NER). Soft line breaks appearing as hard line breaks in the text content is responsible for most of my NER model errors.

When PDF text is extracted with line breaks, it's impossible for me to post-process the content and distinguish soft line breaks from hard/real line breaks, so I would like to avoid post-processing extracted text and rather handle line breaks differently when extracting.

Adding a few break points in the source code, I have the feeling that my problem would be solved by overriding the AbstractPDF2XHTML/PDFTextStripper.handleLineSeparation (let's call t PDF2XHTMLWithoutSoftBreaks) not to write the lineSeparator when a new line is detected and we're not starting a new paragraph. Then I would have to use my new PDF2XHTMLWithoutSoftBreaks inside a new PDFParserWithoutSoftBreaks and then configure Tika to use that parser for PDFs.

Doing so sounds very heavy and will require rewriting a lot more code than just the handleLineSeparation method since it's actually private.

I wanted to ask if there were any alternative approaches which do not imply post processing (since as mentioned above during processing we lose the soft line break information and we can't retrieve it precisely afterwards) ?

Thank you for your help !
Best,

<https://www.icij.org/>   
Clément Doumouro
Machine Learning Engineer

+1 301-244-8803‬ <tel:+13012448803%E2%80%AC>  ICIJ.org <https://icij.org/> [email protected] <mailto:[email protected]>  PGP key <https://keys.openpgp.org/vks/v1/by-fingerprint/DFA5082713A7D671F384489886AB2EB14650FD50>

 1730 Rhode Island Ave NW, Suite 317 | Washington D.C. 20036 | United States <https://maps.google.com/?q=1800+M+Street+NW,+Front+1+#33019+%7C+Washington,+DC+20033+%7C+United+States>

<https://www.facebook.com/ICIJ.org> <https://twitter.com/icijorg> <https://www.linkedin.com/company/international-consortium-of-investigative-journalists/mycompany/> <https://www.instagram.com/icijorg/> <https://www.youtube.com/c/IcijOrg>



Subscribe: Get our stories in your inbox <https://www.icij.org/newsletter>


<https://www.icij.org/donate>


Reply via email to