you might simply insert a space wherever you find two consecutive upper case letters which are painted in boldface font.
Emmm, doing so will make extraction much slower, but how can I know if text is painted in bold ?

Best regards ,
Hesham


--------------------------------------

Hello there,


There is another notice ... A phrase "A Worldly" in the same line in the PDF was extracted also as "AWorldly" without space !!
You can check it in this file :
http://www.4shared.com/file/186430363/628fea7f/Enter-sample2.html


The phrase "A Worldly" occurs in the title section of the article and
is painted using a boldface font.

To my knowledge, PDFBox is not very sophisticated and uses the same
word separation detection algorithm with all normal|italic|boldface
fonts. However, as this issue demonstrates, it might be justified to
tweak some threshold values etc. in a font dependent manner.

In the mean time, to overcome this particular problem, you might
simply insert a space wherever you find two consecutive upper case
letters which are painted in boldface font.


VR

Reply via email to