Here's what I get with the 2.0 version:
1/435 CÂNDIDO FELIX LOPESABEL DIAS LOPES 27-09-1964
FRANCISCA MARIA DIAS
this is mostly correct. "CÂNDIDO FELIX LOPES" is on a higher line than
"ABEL DIAS LOPES". The only problem is the missing line break, this can
possibly be set with an option.
Assuming you want to extract all this to fill a database, you could also
try the non sorted output. The only problem is to get the correct count
per page.
Tilman
Am 16.01.2016 um 12:52 schrieb Diogo Ribeiro:
Hi guys,
I'm using PDFBox 1.8.10 to extract some text from a PDF (see attachment).
The output lines are not correctly sorted.
Got:
1/435 S LOPES CÂNDIDO FELIX LOPESABEL DIA 27-09-1964
FRANCISCA MARIA DIAS
Was expecting:
1/435 ABEL DIAS LOPES CÂNDIDO FELIX LOPES 27-09-1964
FRANCISCA MARIA DIAS
My simple code:
PDDocument pdf = PDDocument.load(new File(FILE_PATH));
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(1);
stripper.setSortByPosition(true);
String plainText = stripper.getText(pdf);
System.out.println(plainText);
Thanks in advance.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]