Key Hutu created PDFBOX-5613:
--------------------------------
Summary: uncorrent paragraph split
Key: PDFBOX-5613
URL: https://issues.apache.org/jira/browse/PDFBOX-5613
Project: PDFBox
Issue Type: Improvement
Components: Parsing, Text extraction
Affects Versions: 2.0.1
Reporter: Key Hutu
Attachments: Daily Report.pdf
when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info
<code>
public class PDFParagraphTextStripper extends PDFTextStripper {
public PDFParagraphTextStripper() throws IOException {
this.setLineSeparator(" ");
this.setParagraphStart("");
this.setParagraphEnd(this.LINE_SEPARATOR);
this.setPageStart("");
this.setPageEnd("");
this.setArticleStart(this.LINE_SEPARATOR);
this.setArticleEnd(this.LINE_SEPARATOR);
}
}
public class PdfParser {
private static final String dataPath =
"D:\\IdeaProject\\PdfParser\\PdfParser\\data\\";
public static void main(String[] args) {
String fileName = "Daily Report.pdf";
try {
extract_pdfbox(dataPath + fileName);
} catch (Exception e) {
e.printStackTrace();
}
}
private static void extract_pdfbox(String filePath) throws Exception {
File file = new File(filePath);
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper();
String text = pdfTextStripper.getText(document);
System.out.println(text);
document.close();
}
}
</code>
<output>
Daily Report 1) which language is your text in? - English
2) some examples of sentences containing
addresses you'd want to pick up - Data are
contarct documents, it contains addresses in
different formates(of different
countries),some are comma saperated, some
are new line saperated etc 3) perhaps
examples of mistakes - currently en model
of SpaCy is even not able to tag entities
clearly 4) Are you training your own model
or are you using a model as is? - tried as it is
but very poor in results to need to know a
generic approach to train own model. any
</output>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]