I tried running your code and I can't because it was written for an older version of PDFBox (probably 1.8) and it has a syntax error and the parameters are missing so I doubt your code ever ran that way. I tried running ExtractText on PDFBox 1.8 and yes, many blanks are missing. So please use the current version 2.0.8. I found one occurrence where the blank was missing ("Wewould") but Adobe Reader has the same problem.

Tilman


Am 25.01.2018 um 04:22 schrieb Laxmi Narayan:

Hi Team,

I have a problem while text extracting from pdf. When we extracting the text words merge together.  Can you suggest me , what we have to do for the same.

I have attached the PDF file from which I am extracting the text. And I am using the below code to extract the text.

Please help me as soon as possible.

privatestatic string GetTextByArea_Orgnal(PDDocument doc, int x, int y, int w, int h)

        {

PDFTextStripperByArea stripper = new PDFTextStripperByArea("UTF-8");

stripper.setLineSeparator(" ");

stripper.setDropThreshold(3);

stripper.setWordSeparator(" ");

stripper.setParagraphStart("<p>");

stripper.setParagraphEnd("</p>");

stripper.setIndentThreshold(1);

stripper.setSortByPosition(true);

//==================

//==================

Dimension d = new Dimension(w, h);

Rectangle rect = new Rectangle(new Point(x, y), d);

stripper.addRegion("class1", rect);

java.util.List allPages = doc.getDocumentCatalog().getAllPages();

PDPage firstPage = (PDPage)allPages.get(0);

//// overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right

PDPageContentStream contentStream = new PDPageContentStream(doc, firstPage, true, true);

contentStream.setNonStrokingColor(Color.CYAN);

contentStream.fillRect(x, y, w, h);

contentStream.close();

////=============

stripper.extractRegions(firstPage);

return stripper.getTextForRegion("class1");

        }

Thanks,

Laxmi Narayan



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


Reply via email to