Hello there, > > There is another notice ... A phrase "A Worldly" in the same line in the PDF > was extracted also as "AWorldly" without space !! > You can check it in this file : > http://www.4shared.com/file/186430363/628fea7f/Enter-sample2.html >
The phrase "A Worldly" occurs in the title section of the article and is painted using a boldface font. To my knowledge, PDFBox is not very sophisticated and uses the same word separation detection algorithm with all normal|italic|boldface fonts. However, as this issue demonstrates, it might be justified to tweak some threshold values etc. in a font dependent manner. In the mean time, to overcome this particular problem, you might simply insert a space wherever you find two consecutive upper case letters which are painted in boldface font. VR

