Re: Extract text from PDF, wrong sort order

Tilman Hausherr Sat, 16 Jan 2016 05:10:58 -0800

Here's what I get with the 2.0 version:


1/435 CÂNDIDO FELIX LOPESABEL DIAS LOPES 27-09-1964
FRANCISCA MARIA DIAS

this is mostly correct. "CÂNDIDO FELIX LOPES" is on a higher line than"ABEL DIAS LOPES". The only problem is the missing line break, this canpossibly be set with an option.

Assuming you want to extract all this to fill a database, you could alsotry the non sorted output. The only problem is to get the correct countper page.



Tilman

Am 16.01.2016 um 12:52 schrieb Diogo Ribeiro:

Hi guys,

I'm using PDFBox 1.8.10 to extract some text from a PDF (see attachment).

The output lines are not correctly sorted.

Got:

1/435 S LOPES CÂNDIDO FELIX LOPESABEL DIA 27-09-1964
FRANCISCA MARIA DIAS

Was expecting:

1/435 ABEL DIAS LOPES CÂNDIDO FELIX LOPES 27-09-1964
FRANCISCA MARIA DIAS

My simple code:

         PDDocument pdf = PDDocument.load(new File(FILE_PATH));

        PDFTextStripper stripper = new PDFTextStripper();

        stripper.setStartPage(1);
        stripper.setEndPage(1);
        stripper.setSortByPosition(true);

        String plainText = stripper.getText(pdf);

        System.out.println(plainText);


Thanks in advance.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Extract text from PDF, wrong sort order

Reply via email to