Hi,
Can you share a non confidential file and explain what you did, what got
and what you want instead? I also fail to grasp the first sentence. Do
you want soft line breaks, or no line breaks at all?
Tilman
On 12.02.2024 10:46, Clément Doumouro via user wrote:
Hi all,
I need to extract PDF text without soft line breaks in order to
process PDF content as part of a NLP pipeline (NER). Soft line breaks
appearing as hard line breaks in the text content is responsible for
most of my NER model errors.
When PDF text is extracted with line breaks, it's impossible for me to
post-process the content and distinguish soft line breaks from
hard/real line breaks, so I would like to avoid post-processing
extracted text and rather handle line breaks differently when extracting.
Adding a few break points in the source code, I have the feeling that
my problem would be solved by overriding
the AbstractPDF2XHTML/PDFTextStripper.handleLineSeparation (let's call
t PDF2XHTMLWithoutSoftBreaks) not to write the lineSeparator when a
new line is detected and we're not starting a new paragraph. Then I
would have to use my new PDF2XHTMLWithoutSoftBreaks inside a new
PDFParserWithoutSoftBreaks and then configure Tika to use that parser
for PDFs.
Doing so sounds very heavy and will require rewriting a lot more code
than just the handleLineSeparation method since it's actually private.
I wanted to ask if there were any alternative approaches which do not
imply post processing (since as mentioned above during processing we
lose the soft line break information and we can't retrieve it
precisely afterwards) ?
Thank you for your help !
Best,
<https://www.icij.org/>
Clément Doumouro
Machine Learning Engineer
+1 301-244-8803 <tel:+13012448803%E2%80%AC> ICIJ.org
<https://icij.org/> [email protected] <mailto:[email protected]>
PGP key
<https://keys.openpgp.org/vks/v1/by-fingerprint/DFA5082713A7D671F384489886AB2EB14650FD50>
1730 Rhode Island Ave NW, Suite 317 | Washington D.C. 20036 | United
States
<https://maps.google.com/?q=1800+M+Street+NW,+Front+1+#33019+%7C+Washington,+DC+20033+%7C+United+States>
<https://www.facebook.com/ICIJ.org> <https://twitter.com/icijorg>
<https://www.linkedin.com/company/international-consortium-of-investigative-journalists/mycompany/>
<https://www.instagram.com/icijorg/>
<https://www.youtube.com/c/IcijOrg>
Subscribe: Get our stories in your inbox
<https://www.icij.org/newsletter>
<https://www.icij.org/donate>