[jira] [Closed] (PDFBOX-1858) Extracted text does not have spaces
[ https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu closed PDFBOX-1858. --- Resolution: Not A Problem It is seem that it is a problem inside of our software. Extracted text does not have spaces --- Key: PDFBOX-1858 URL: https://issues.apache.org/jira/browse/PDFBOX-1858 Project: PDFBox Issue Type: Bug Components: Parsing, Text extraction Affects Versions: 1.8.3 Environment: Linux 64bit, Java Reporter: Vitalie Bureanu Attachments: Screenshot.jpg, test.pdf Original Estimate: 3h Remaining Estimate: 3h Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the result still remain the same. For us it is a big problem. Can it be resolved, please? With respect, Vitalie -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (PDFBOX-1858) Extracted text does not have spaces
Vitalie Bureanu created PDFBOX-1858: --- Summary: Extracted text does not have spaces Key: PDFBOX-1858 URL: https://issues.apache.org/jira/browse/PDFBOX-1858 Project: PDFBox Issue Type: Bug Components: Parsing, Text extraction Affects Versions: 1.8.3 Environment: Linux 64bit, Java Reporter: Vitalie Bureanu Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the resultstill remain the same. Can it be solved, please? With respect, Vitalie -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces
[ https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1858: Attachment: Screenshot.jpg test.pdf Extracted text does not have spaces --- Key: PDFBOX-1858 URL: https://issues.apache.org/jira/browse/PDFBOX-1858 Project: PDFBox Issue Type: Bug Components: Parsing, Text extraction Affects Versions: 1.8.3 Environment: Linux 64bit, Java Reporter: Vitalie Bureanu Attachments: Screenshot.jpg, test.pdf Original Estimate: 3h Remaining Estimate: 3h Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the resultstill remain the same. Can it be solved, please? With respect, Vitalie -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces
[ https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1858: Attachment: (was: Untitled-1.jpg) Extracted text does not have spaces --- Key: PDFBOX-1858 URL: https://issues.apache.org/jira/browse/PDFBOX-1858 Project: PDFBox Issue Type: Bug Components: Parsing, Text extraction Affects Versions: 1.8.3 Environment: Linux 64bit, Java Reporter: Vitalie Bureanu Attachments: Screenshot.jpg, test.pdf Original Estimate: 3h Remaining Estimate: 3h Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the resultstill remain the same. Can it be solved, please? With respect, Vitalie -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces
[ https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1858: Attachment: Untitled-1.jpg Extracted text does not have spaces --- Key: PDFBOX-1858 URL: https://issues.apache.org/jira/browse/PDFBOX-1858 Project: PDFBox Issue Type: Bug Components: Parsing, Text extraction Affects Versions: 1.8.3 Environment: Linux 64bit, Java Reporter: Vitalie Bureanu Attachments: Screenshot.jpg, test.pdf Original Estimate: 3h Remaining Estimate: 3h Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the resultstill remain the same. Can it be solved, please? With respect, Vitalie -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces
[ https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1858: Description: Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the result still remain the same. For us it is a big problem. Can it be resolved, please? With respect, Vitalie was: Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the result still remain the same. Can it be resolved, please? With respect, Vitalie Extracted text does not have spaces --- Key: PDFBOX-1858 URL: https://issues.apache.org/jira/browse/PDFBOX-1858 Project: PDFBox Issue Type: Bug Components: Parsing, Text extraction Affects Versions: 1.8.3 Environment: Linux 64bit, Java Reporter: Vitalie Bureanu Attachments: Screenshot.jpg, test.pdf Original Estimate: 3h Remaining Estimate: 3h Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the result still remain the same. For us it is a big problem. Can it be resolved, please? With respect, Vitalie -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces
[ https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1858: Description: Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the result still remain the same. Can it be resolved, please? With respect, Vitalie was: Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the result still remain the same. Can it be solved, please? With respect, Vitalie Extracted text does not have spaces --- Key: PDFBOX-1858 URL: https://issues.apache.org/jira/browse/PDFBOX-1858 Project: PDFBox Issue Type: Bug Components: Parsing, Text extraction Affects Versions: 1.8.3 Environment: Linux 64bit, Java Reporter: Vitalie Bureanu Attachments: Screenshot.jpg, test.pdf Original Estimate: 3h Remaining Estimate: 3h Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the result still remain the same. Can it be resolved, please? With respect, Vitalie -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces
[ https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1858: Description: Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the result still remain the same. Can it be solved, please? With respect, Vitalie was: Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the resultstill remain the same. Can it be solved, please? With respect, Vitalie Extracted text does not have spaces --- Key: PDFBOX-1858 URL: https://issues.apache.org/jira/browse/PDFBOX-1858 Project: PDFBox Issue Type: Bug Components: Parsing, Text extraction Affects Versions: 1.8.3 Environment: Linux 64bit, Java Reporter: Vitalie Bureanu Attachments: Screenshot.jpg, test.pdf Original Estimate: 3h Remaining Estimate: 3h Extracted text does not have spaces between some words. Use to test please a string on line 74a... inside of attached test.pdf. It will be extracted as: 74a Amount of line73youwant refunded toyou . If Form isattached , checkhere The result is not seems to be good, the words are glued. I tried to use a class PDF Text Stripper but the result still remain the same. Can it be solved, please? With respect, Vitalie -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PDFBOX-1575) PDFTextStripper sometimes adds spaces after a detached words
[ https://issues.apache.org/jira/browse/PDFBOX-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13639065#comment-13639065 ] Vitalie Bureanu commented on PDFBOX-1575: - Update, I noticed that this bug happens almost always... on different documents. For us these whitespaces after completely detached words are very-very problematic... :( I can not fix it, can somebody help? PDFTextStripper sometimes adds spaces after a detached words Key: PDFBOX-1575 URL: https://issues.apache.org/jira/browse/PDFBOX-1575 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.1 Environment: Linux 64bit Reporter: Vitalie Bureanu Labels: pdfbox, space, whitespace, Attachments: example.pdf Original Estimate: 2h Remaining Estimate: 2h Hello dear developers, I noticed that PDFTextStripper sometimes adds spaces after a completely detached words... For example - if you make text extraction for attached file you will se that PDFTextStripper adds one space after words: Qty and Unit Price but not adds after Description and Line Total. I think this is a bug, because after words Qty and Unit Price should not be present the whitespace. Can you please fix it? (see attach) Thank you very much, Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PDFBOX-1575) PDFTextStripper adds spaces after a detached words
[ https://issues.apache.org/jira/browse/PDFBOX-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1575: Summary: PDFTextStripper adds spaces after a detached words (was: PDFTextStripper sometimes adds spaces after a detached words) PDFTextStripper adds spaces after a detached words -- Key: PDFBOX-1575 URL: https://issues.apache.org/jira/browse/PDFBOX-1575 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.1 Environment: Linux 64bit Reporter: Vitalie Bureanu Labels: pdfbox, space, whitespace, Attachments: example.pdf Original Estimate: 2h Remaining Estimate: 2h Hello dear developers, I noticed that PDFTextStripper sometimes adds spaces after a completely detached words... For example - if you make text extraction for attached file you will se that PDFTextStripper adds one space after words: Qty and Unit Price but not adds after Description and Line Total. I think this is a bug, because after words Qty and Unit Price should not be present the whitespace. Can you please fix it? (see attach) Thank you very much, Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PDFBOX-1575) PDFTextStripper sometimes adds spaces after a detached one word
[ https://issues.apache.org/jira/browse/PDFBOX-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1575: Attachment: example.pdf PDFTextStripper sometimes adds spaces after a detached one word --- Key: PDFBOX-1575 URL: https://issues.apache.org/jira/browse/PDFBOX-1575 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.1 Environment: Linux 64bit Reporter: Vitalie Bureanu Labels: pdfbox, space, whitespace, Attachments: example.pdf Original Estimate: 2h Remaining Estimate: 2h Hello dear developers, I noticed that PDFTextStripper sometimes adds spaces after a completely detached words... For example - if you make text extraction for attached file you will se that PDFTextStripper adds one space after words: Qty and Unit Price but not adds after Description and Line Total. I think this is a bug, because after words Qty and Unit Price should not be present the whitespace. Can you please fix it? (see attach) Thank you very much, Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PDFBOX-1575) PDFTextStripper sometimes adds spaces after a detached words
[ https://issues.apache.org/jira/browse/PDFBOX-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1575: Summary: PDFTextStripper sometimes adds spaces after a detached words (was: PDFTextStripper sometimes adds spaces after a detached one word) PDFTextStripper sometimes adds spaces after a detached words Key: PDFBOX-1575 URL: https://issues.apache.org/jira/browse/PDFBOX-1575 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.1 Environment: Linux 64bit Reporter: Vitalie Bureanu Labels: pdfbox, space, whitespace, Attachments: example.pdf Original Estimate: 2h Remaining Estimate: 2h Hello dear developers, I noticed that PDFTextStripper sometimes adds spaces after a completely detached words... For example - if you make text extraction for attached file you will se that PDFTextStripper adds one space after words: Qty and Unit Price but not adds after Description and Line Total. I think this is a bug, because after words Qty and Unit Price should not be present the whitespace. Can you please fix it? (see attach) Thank you very much, Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PDFBOX-1553) Offset of extracted coordinates
[ https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617629#comment-13617629 ] Vitalie Bureanu commented on PDFBOX-1553: - Hello Andreas, I checked it but I have same result - coordinates have an offset. Offset of extracted coordinates --- Key: PDFBOX-1553 URL: https://issues.apache.org/jira/browse/PDFBOX-1553 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.0 Environment: Linux Ubuntu 64 bit, Java Reporter: Vitalie Bureanu Priority: Minor Labels: offset Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png Original Estimate: 24h Remaining Estimate: 24h Hello, Preamble: We are glad to use PDFBox and I personally grateful to all developers who sustain this project. It is good work, guys! We have one problem. For our application purposes we extract from pdf char by char with rispective coordinates for each char. (see attached Parser) After this we group chars into the words. We noticed that for some pdf documents we have a strange offset for extracted rect coordinates. (see screens) The offset is seems to be incremental (not sure) - at left top corner of document is near to real coordinates of character, but at right bottom corner is near to 0.5 cm.. If I make selection in Adobe Reader - it seems all ok. I attached two pdf files with offset to this post. If you want to see the offset in action you can use our service to do it at http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising) Please can you test these files and tell me if it is a really bug? How we can resolve it? Thanks, Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PDFBOX-1542) Whitespaces between words are not created
[ https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614995#comment-13614995 ] Vitalie Bureanu commented on PDFBOX-1542: - Thank you very much, Andreas! We will try to use PDFStripper to insert white spaces! Whitespaces between words are not created - Key: PDFBOX-1542 URL: https://issues.apache.org/jira/browse/PDFBOX-1542 Project: PDFBox Issue Type: Wish Components: Text extraction Affects Versions: 1.7.1 Reporter: Vitalie Bureanu Priority: Minor Attachments: Parser.java Original Estimate: 1h Remaining Estimate: 1h Hello, I extract the text with PDFBox from PDF files. I noticed that extraction of text from some pdf files are not so good as expected. I have a seria of pdf invoices from which I try to extract the text with coordinates and resultat is pretty well, but I noticed very strange thing: when I extract text - the words are extracted without whitespaces bettween. Example: if I try to extract Unit Price the result is UnitPrice. But if I open the invoice in Adobe Reader and make Copy/Past into Notepad... I have the Unit Price with whitespaces! I think the whitespaces are not present in original pdf document... but the Adobe Reader in some way insert whitespaces between words when it show content of the pdf. Guys, can you please suggest me how I can have the strings with spaces after the parsing? See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf PS: I want to try the 1.8.0. version of PDFBox - how I can download it? Many thanks, Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PDFBOX-1553) Offset of extracted coordinates
Vitalie Bureanu created PDFBOX-1553: --- Summary: Offset of extracted coordinates Key: PDFBOX-1553 URL: https://issues.apache.org/jira/browse/PDFBOX-1553 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.0 Environment: Linux Ubuntu 64 bit, Java Reporter: Vitalie Bureanu Hello, Preamble: We are glad to use PDFBox and I personally grateful to all developers who sustain this project. It is good work, guys! We have one problem. For our application purposes we extract from pdf char by char with rispective coordinates for each char. (see attached Parser) After this we group chars into the words. We noticed that for some pdf documents we have a strange offset for extracted coordinates. (see screens) The offset is incremental - at left top corner of document is near to real coordinates of charcater, but at right bottom corner is near to 0.5 cm.. If I make selection in Adobe Reader - it seems all ok. I attached two pdf files with offset to this post. If you want to see the offset in action you can use our service to do it at http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates
[ https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1553: Priority: Minor (was: Major) Offset of extracted coordinates --- Key: PDFBOX-1553 URL: https://issues.apache.org/jira/browse/PDFBOX-1553 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.0 Environment: Linux Ubuntu 64 bit, Java Reporter: Vitalie Bureanu Priority: Minor Labels: offset Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png Original Estimate: 24h Remaining Estimate: 24h Hello, Preamble: We are glad to use PDFBox and I personally grateful to all developers who sustain this project. It is good work, guys! We have one problem. For our application purposes we extract from pdf char by char with rispective coordinates for each char. (see attached Parser) After this we group chars into the words. We noticed that for some pdf documents we have a strange offset for extracted coordinates. (see screens) The offset is incremental - at left top corner of document is near to real coordinates of charcater, but at right bottom corner is near to 0.5 cm.. If I make selection in Adobe Reader - it seems all ok. I attached two pdf files with offset to this post. If you want to see the offset in action you can use our service to do it at http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates
[ https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1553: Attachment: Selection in Adobe Reader.png Extracted coordinates of rects.jpg Parser.java EnSt11_offset.pdf EnSt10_offset.pdf Offset of extracted coordinates --- Key: PDFBOX-1553 URL: https://issues.apache.org/jira/browse/PDFBOX-1553 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.0 Environment: Linux Ubuntu 64 bit, Java Reporter: Vitalie Bureanu Labels: offset Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png Original Estimate: 24h Remaining Estimate: 24h Hello, Preamble: We are glad to use PDFBox and I personally grateful to all developers who sustain this project. It is good work, guys! We have one problem. For our application purposes we extract from pdf char by char with rispective coordinates for each char. (see attached Parser) After this we group chars into the words. We noticed that for some pdf documents we have a strange offset for extracted coordinates. (see screens) The offset is incremental - at left top corner of document is near to real coordinates of charcater, but at right bottom corner is near to 0.5 cm.. If I make selection in Adobe Reader - it seems all ok. I attached two pdf files with offset to this post. If you want to see the offset in action you can use our service to do it at http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates
[ https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1553: Description: Hello, Preamble: We are glad to use PDFBox and I personally grateful to all developers who sustain this project. It is good work, guys! We have one problem. For our application purposes we extract from pdf char by char with rispective coordinates for each char. (see attached Parser) After this we group chars into the words. We noticed that for some pdf documents we have a strange offset for extracted rect coordinates. (see screens) The offset is seems to be incremental (not sure) - at left top corner of document is near to real coordinates of character, but at right bottom corner is near to 0.5 cm.. If I make selection in Adobe Reader - it seems all ok. I attached two pdf files with offset to this post. If you want to see the offset in action you can use our service to do it at http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising) Please can you test these files and tell me if it is a really bug? How we can resolve it? Thanks, Vitalie was: Hello, Preamble: We are glad to use PDFBox and I personally grateful to all developers who sustain this project. It is good work, guys! We have one problem. For our application purposes we extract from pdf char by char with rispective coordinates for each char. (see attached Parser) After this we group chars into the words. We noticed that for some pdf documents we have a strange offset for extracted rect coordinates. (see screens) The offset is seems to be incremental (not sure) - at left top corner of document is near to real coordinates of character, but at right bottom corner is near to 0.5 cm.. If I make selection in Adobe Reader - it seems all ok. I attached two pdf files with offset to this post. If you want to see the offset in action you can use our service to do it at http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising) Please can you test these files and tell me if it is a really bug? Offset of extracted coordinates --- Key: PDFBOX-1553 URL: https://issues.apache.org/jira/browse/PDFBOX-1553 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.0 Environment: Linux Ubuntu 64 bit, Java Reporter: Vitalie Bureanu Priority: Minor Labels: offset Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png Original Estimate: 24h Remaining Estimate: 24h Hello, Preamble: We are glad to use PDFBox and I personally grateful to all developers who sustain this project. It is good work, guys! We have one problem. For our application purposes we extract from pdf char by char with rispective coordinates for each char. (see attached Parser) After this we group chars into the words. We noticed that for some pdf documents we have a strange offset for extracted rect coordinates. (see screens) The offset is seems to be incremental (not sure) - at left top corner of document is near to real coordinates of character, but at right bottom corner is near to 0.5 cm.. If I make selection in Adobe Reader - it seems all ok. I attached two pdf files with offset to this post. If you want to see the offset in action you can use our service to do it at http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising) Please can you test these files and tell me if it is a really bug? How we can resolve it? Thanks, Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PDFBOX-1542) Whitespaces between words are not created
[ https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1542: Attachment: Parser.java Our text extractor (with coordinates for each simbol). Whitespaces between words are not created - Key: PDFBOX-1542 URL: https://issues.apache.org/jira/browse/PDFBOX-1542 Project: PDFBox Issue Type: Wish Components: Text extraction Affects Versions: 1.7.1 Reporter: Vitalie Bureanu Priority: Minor Attachments: Parser.java Original Estimate: 1h Remaining Estimate: 1h Hello, I extract the text with PDFBox from PDF files. I noticed that extraction of text from some pdf files are not so good as expected. I have a seria of pdf invoices from which I try to extract the text with coordinates and resultat is pretty well, but I noticed very strange thing: when I extract text - the words are extracted without whitespaces bettween. Example: if I try to extract Unit Price the result is UnitPrice. But if I open the invoice in Adobe Reader and make Copy/Past into Notepad... I have the Unit Price with whitespaces! I think the whitespaces are not present in original pdf document... but the Adobe Reader in some way insert whitespaces between words when it show content of the pdf. Guys, can you please suggest me how I can have the strings with spaces after the parsing? See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf PS: I want to try the 1.8.0. version of PDFBox - how I can download it? Many thanks, Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PDFBOX-1542) Whitespaces between words are not created
[ https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613609#comment-13613609 ] Vitalie Bureanu commented on PDFBOX-1542: - Hello Andreas, Thank you for the promptness, I attached to post the source code of our Parser which we use for text extraction. We extract simbol by simbol with rispective coordinates for each simbol. When we extract all simbols - in middle of these simbols white spacings are missed. Many thanks, Vitalie Whitespaces between words are not created - Key: PDFBOX-1542 URL: https://issues.apache.org/jira/browse/PDFBOX-1542 Project: PDFBox Issue Type: Wish Components: Text extraction Affects Versions: 1.7.1 Reporter: Vitalie Bureanu Priority: Minor Attachments: Parser.java Original Estimate: 1h Remaining Estimate: 1h Hello, I extract the text with PDFBox from PDF files. I noticed that extraction of text from some pdf files are not so good as expected. I have a seria of pdf invoices from which I try to extract the text with coordinates and resultat is pretty well, but I noticed very strange thing: when I extract text - the words are extracted without whitespaces bettween. Example: if I try to extract Unit Price the result is UnitPrice. But if I open the invoice in Adobe Reader and make Copy/Past into Notepad... I have the Unit Price with whitespaces! I think the whitespaces are not present in original pdf document... but the Adobe Reader in some way insert whitespaces between words when it show content of the pdf. Guys, can you please suggest me how I can have the strings with spaces after the parsing? See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf PS: I want to try the 1.8.0. version of PDFBox - how I can download it? Many thanks, Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PDFBOX-1542) Whitespaces between words are not created
Vitalie Bureanu created PDFBOX-1542: --- Summary: Whitespaces between words are not created Key: PDFBOX-1542 URL: https://issues.apache.org/jira/browse/PDFBOX-1542 Project: PDFBox Issue Type: Wish Components: Text extraction Affects Versions: 1.7.1 Reporter: Vitalie Bureanu Priority: Minor Hello, I extract the text with PDFBox from PDF files. I noticed that extraction of text from some pdf files are not so good as expected. I have a seria of pdf invoices from which I try to extract the text with coordinates and resultat is pretty well, but I noticed very strange thing: when I extract text - the words are extracted without whitespaces bettween. Example: if I try to extract Total Amount the result is TotalAmount. But if I open the invoice in Adobe Reader and make Copy/Past into Notepad... I have the Total Amount with whitespaces! I think the whitespaces are not present in original pdf document... but the Adobe Reader in some way insert whitespaces between words when it show content of the pdf. Guys, can you please suggest me how I can have the strings with spaces after the parsing? See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf PS: I want to try the 1.8.0. version of PDFBox - how I can download it? Many thanks, Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PDFBOX-1542) Whitespaces between words are not created
[ https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1542: Description: Hello, I extract the text with PDFBox from PDF files. I noticed that extraction of text from some pdf files are not so good as expected. I have a seria of pdf invoices from which I try to extract the text with coordinates and resultat is pretty well, but I noticed very strange thing: when I extract text - the words are extracted without whitespaces bettween. Example: if I try to extract Unit Price the result is UnitPrice. But if I open the invoice in Adobe Reader and make Copy/Past into Notepad... I have the Unit Price with whitespaces! I think the whitespaces are not present in original pdf document... but the Adobe Reader in some way insert whitespaces between words when it show content of the pdf. Guys, can you please suggest me how I can have the strings with spaces after the parsing? See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf PS: I want to try the 1.8.0. version of PDFBox - how I can download it? Many thanks, Vitalie was: Hello, I extract the text with PDFBox from PDF files. I noticed that extraction of text from some pdf files are not so good as expected. I have a seria of pdf invoices from which I try to extract the text with coordinates and resultat is pretty well, but I noticed very strange thing: when I extract text - the words are extracted without whitespaces bettween. Example: if I try to extract Total Amount the result is TotalAmount. But if I open the invoice in Adobe Reader and make Copy/Past into Notepad... I have the Total Amount with whitespaces! I think the whitespaces are not present in original pdf document... but the Adobe Reader in some way insert whitespaces between words when it show content of the pdf. Guys, can you please suggest me how I can have the strings with spaces after the parsing? See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf PS: I want to try the 1.8.0. version of PDFBox - how I can download it? Many thanks, Vitalie Whitespaces between words are not created - Key: PDFBOX-1542 URL: https://issues.apache.org/jira/browse/PDFBOX-1542 Project: PDFBox Issue Type: Wish Components: Text extraction Affects Versions: 1.7.1 Reporter: Vitalie Bureanu Priority: Minor Original Estimate: 1h Remaining Estimate: 1h Hello, I extract the text with PDFBox from PDF files. I noticed that extraction of text from some pdf files are not so good as expected. I have a seria of pdf invoices from which I try to extract the text with coordinates and resultat is pretty well, but I noticed very strange thing: when I extract text - the words are extracted without whitespaces bettween. Example: if I try to extract Unit Price the result is UnitPrice. But if I open the invoice in Adobe Reader and make Copy/Past into Notepad... I have the Unit Price with whitespaces! I think the whitespaces are not present in original pdf document... but the Adobe Reader in some way insert whitespaces between words when it show content of the pdf. Guys, can you please suggest me how I can have the strings with spaces after the parsing? See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf PS: I want to try the 1.8.0. version of PDFBox - how I can download it? Many thanks, Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira