I tried running your code and I can't because it was written for an
older version of PDFBox (probably 1.8) and it has a syntax error and the
parameters are missing so I doubt your code ever ran that way. I tried
running ExtractText on PDFBox 1.8 and yes, many blanks are missing. So
please use the current version 2.0.8. I found one occurrence where the
blank was missing ("Wewould") but Adobe Reader has the same problem.
Tilman
Am 25.01.2018 um 04:22 schrieb Laxmi Narayan:
Hi Team,
I have a problem while text extracting from pdf. When we extracting
the text words merge together. Can you suggest me , what we have to
do for the same.
I have attached the PDF file from which I am extracting the text. And
I am using the below code to extract the text.
Please help me as soon as possible.
privatestatic string GetTextByArea_Orgnal(PDDocument doc, int x, int
y, int w, int h)
{
PDFTextStripperByArea stripper = new PDFTextStripperByArea("UTF-8");
stripper.setLineSeparator(" ");
stripper.setDropThreshold(3);
stripper.setWordSeparator(" ");
stripper.setParagraphStart("<p>");
stripper.setParagraphEnd("</p>");
stripper.setIndentThreshold(1);
stripper.setSortByPosition(true);
//==================
//==================
Dimension d = new Dimension(w, h);
Rectangle rect = new Rectangle(new Point(x, y), d);
stripper.addRegion("class1", rect);
java.util.List allPages = doc.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get(0);
//// overlay the region with a cyan rectangle to check if I got the
coordinates and dimensions right
PDPageContentStream contentStream = new PDPageContentStream(doc,
firstPage, true, true);
contentStream.setNonStrokingColor(Color.CYAN);
contentStream.fillRect(x, y, w, h);
contentStream.close();
////=============
stripper.extractRegions(firstPage);
return stripper.getTextForRegion("class1");
}
Thanks,
Laxmi Narayan
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org