Hi,
> We've tried to extract text from PDF
> When we tried to extract Korean from text in PDF file, the order of those
> have been broken while English was done well.
> This does not mean that Korean is not extracted from PDF, it is well done,
> but sequence has some problem.
> This Problem occurred when
> 1. if PDF files have chart
> 2. size of the character is different one another
>
> when we extracted PDF that have chart, then the text in the lowest row
> shows at the beginning and the text in the highest row shows at the end
>
> ex) | 가 | 나 | (in the chart)
> |다 | 라 |
> -> 다라
> 가나(extracted)
>
> and when PDF has multiple text size and font
> the smallest and the the most simple font text have been extracted at the
> beginning and
> the largest and less simple text font text have been extracted at the end.
>
> please check if this is a bug when extracting Korean
>
> public static void extractStringfromPDF() throws IOException{
> final FileChooser filechooser = new FileChooser();
> File file = filechooser.showOpenDialog(null);
> try {
> PDDocument document = PDDocument.load(file);
> PDFTextStripper pdfStripper = new PDFTextStripper();
> String text = pdfStripper.getText(document);
>
> File txtFile = new File(file.getPath() + ".txt");
> FileWriter fw = new FileWriter(txtFile, true);
> fw.write(text);
> fw.flush();
> fw.close();
> System.out.println(text);
> document.close();
> }catch(Exception e) {e.printStackTrace();}
> }
> the above code is that we used in our program
please try using the setSortByPosition option
https://pdfbox.apache.org/docs/2.0.12/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setSortByPosition-boolean-
as this will return the text in "visual" order and not in the order the text
objects appear in the PDF. Dependent on the input
PDF this might give you a better result.
Maruan
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]