[
https://issues.apache.org/jira/browse/PDFBOX-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Key Hutu updated PDFBOX-5613:
-----------------------------
Description:
when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info
<code>
public class PDFParagraphTextStripper extends PDFTextStripper {
public PDFParagraphTextStripper() throws IOException{
this.setLineSeparator(" ");
this.setParagraphStart("");
this.setParagraphEnd(this.LINE_SEPARATOR);
this.setPageStart("");
this.setPageEnd("");
this.setArticleStart(this.LINE_SEPARATOR);
this.setArticleEnd(this.LINE_SEPARATOR);
}
}
public class PdfParser {
private static final String dataPath =
"D:\\IdeaProject\\PdfParser\\PdfParser\\data";
public static void main(String[] args) {
String fileName = "Daily Report.pdf";
try{
extract_pdfbox(dataPath + fileName);
}catch (Exception e)\{ e.printStackTrace(); }
}
private static void extract_pdfbox(String filePath) throws Exception{
File file = new File(filePath);
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper();
String text = pdfTextStripper.getText(document);
System.out.println(text);
document.close();
}
}
</code>
<output>
Daily Report 1) which language is your text in? - English
2) some examples of sentences containing
addresses you'd want to pick up - Data are
contarct documents, it contains addresses in
different formates(of different
countries),some are comma saperated, some
are new line saperated etc 3) perhaps
examples of mistakes - currently en model
of SpaCy is even not able to tag entities
clearly 4) Are you training your own model
or are you using a model as is? - tried as it is
but very poor in results to need to know a
generic approach to train own model. any
</output>
was:
when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info
<code>
public class PDFParagraphTextStripper extends PDFTextStripper {
public PDFParagraphTextStripper() throws IOException {
this.setLineSeparator(" ");
this.setParagraphStart("");
this.setParagraphEnd(this.LINE_SEPARATOR);
this.setPageStart("");
this.setPageEnd("");
this.setArticleStart(this.LINE_SEPARATOR);
this.setArticleEnd(this.LINE_SEPARATOR);
}
}
public class PdfParser {
private static final String dataPath =
"D:\\IdeaProject\\PdfParser\\PdfParser\\data\\";
public static void main(String[] args) {
String fileName = "Daily Report.pdf";
try {
extract_pdfbox(dataPath + fileName);
} catch (Exception e) {
e.printStackTrace();
}
}
private static void extract_pdfbox(String filePath) throws Exception {
File file = new File(filePath);
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper();
String text = pdfTextStripper.getText(document);
System.out.println(text);
document.close();
}
}
</code>
<output>
Daily Report 1) which language is your text in? - English
2) some examples of sentences containing
addresses you'd want to pick up - Data are
contarct documents, it contains addresses in
different formates(of different
countries),some are comma saperated, some
are new line saperated etc 3) perhaps
examples of mistakes - currently en model
of SpaCy is even not able to tag entities
clearly 4) Are you training your own model
or are you using a model as is? - tried as it is
but very poor in results to need to know a
generic approach to train own model. any
</output>
> uncorrent paragraph split
> -------------------------
>
> Key: PDFBOX-5613
> URL: https://issues.apache.org/jira/browse/PDFBOX-5613
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing, Text extraction
> Affects Versions: 2.0.1
> Reporter: Key Hutu
> Priority: Major
> Attachments: Daily Report.pdf
>
>
> when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info
> <code>
> public class PDFParagraphTextStripper extends PDFTextStripper {
> public PDFParagraphTextStripper() throws IOException{
> this.setLineSeparator(" ");
> this.setParagraphStart("");
> this.setParagraphEnd(this.LINE_SEPARATOR);
> this.setPageStart("");
> this.setPageEnd("");
> this.setArticleStart(this.LINE_SEPARATOR);
> this.setArticleEnd(this.LINE_SEPARATOR);
> }
> }
> public class PdfParser {
> private static final String dataPath =
> "D:\\IdeaProject\\PdfParser\\PdfParser\\data";
> public static void main(String[] args) {
> String fileName = "Daily Report.pdf";
> try{
> extract_pdfbox(dataPath + fileName);
> }catch (Exception e)\{ e.printStackTrace(); }
> }
> private static void extract_pdfbox(String filePath) throws Exception{
> File file = new File(filePath);
> PDDocument document = PDDocument.load(file);
> PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper();
> String text = pdfTextStripper.getText(document);
> System.out.println(text);
> document.close();
> }
> }
> </code>
> <output>
> Daily Report 1) which language is your text in? - English
> 2) some examples of sentences containing
> addresses you'd want to pick up - Data are
> contarct documents, it contains addresses in
> different formates(of different
> countries),some are comma saperated, some
> are new line saperated etc 3) perhaps
> examples of mistakes - currently en model
> of SpaCy is even not able to tag entities
> clearly 4) Are you training your own model
> or are you using a model as is? - tried as it is
> but very poor in results to need to know a
> generic approach to train own model. any
> </output>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]