[jira] [Closed] (PDFBOX-1858) Extracted text does not have spaces

2014-01-23 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu closed PDFBOX-1858.
---

Resolution: Not A Problem

It is seem that it is a problem inside of our software.

 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the result still remain the same.
 For us it is a big problem. Can it be resolved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)
Vitalie Bureanu created PDFBOX-1858:
---

 Summary: Extracted text does not have spaces
 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu


Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the resultstill remain the same.

Can it be solved, please?

With respect,
Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1858:


Attachment: Screenshot.jpg
test.pdf

 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the resultstill remain the same.
 Can it be solved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1858:


Attachment: (was: Untitled-1.jpg)

 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the resultstill remain the same.
 Can it be solved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1858:


Attachment: Untitled-1.jpg

 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the resultstill remain the same.
 Can it be solved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1858:


Description: 
Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the result still remain the same.

For us it is a big problem. Can it be resolved, please?

With respect,
Vitalie

  was:
Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the result still remain the same.

Can it be resolved, please?

With respect,
Vitalie


 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the result still remain the same.
 For us it is a big problem. Can it be resolved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1858:


Description: 
Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the result still remain the same.

Can it be resolved, please?

With respect,
Vitalie

  was:
Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the result still remain the same.

Can it be solved, please?

With respect,
Vitalie


 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the result still remain the same.
 Can it be resolved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1858:


Description: 
Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the result still remain the same.

Can it be solved, please?

With respect,
Vitalie

  was:
Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the resultstill remain the same.

Can it be solved, please?

With respect,
Vitalie


 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the result still remain the same.
 Can it be solved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PDFBOX-1575) PDFTextStripper sometimes adds spaces after a detached words

2013-04-23 Thread Vitalie Bureanu (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13639065#comment-13639065
 ] 

Vitalie Bureanu commented on PDFBOX-1575:
-

Update, I noticed that this bug happens almost always... on different documents.
For us these whitespaces after completely detached words are very-very 
problematic... :( 
I can not fix it, can somebody help?


 PDFTextStripper sometimes adds spaces after a detached words
 

 Key: PDFBOX-1575
 URL: https://issues.apache.org/jira/browse/PDFBOX-1575
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.1
 Environment: Linux 64bit
Reporter: Vitalie Bureanu
  Labels: pdfbox, space, whitespace,
 Attachments: example.pdf

   Original Estimate: 2h
  Remaining Estimate: 2h

 Hello dear developers,
 I noticed that PDFTextStripper sometimes adds spaces after a completely 
 detached words...
 For example - if you make text extraction for attached file you will se that 
 PDFTextStripper adds one space after words: Qty  and Unit Price  but not 
 adds after Description and Line Total.
 I think this is a bug, because after words Qty  and Unit Price  should 
 not be present the whitespace.
 Can you please fix it?
 (see attach)
 Thank you very much,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PDFBOX-1575) PDFTextStripper adds spaces after a detached words

2013-04-23 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1575:


Summary: PDFTextStripper adds spaces after a detached words  (was: 
PDFTextStripper sometimes adds spaces after a detached words)

 PDFTextStripper adds spaces after a detached words
 --

 Key: PDFBOX-1575
 URL: https://issues.apache.org/jira/browse/PDFBOX-1575
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.1
 Environment: Linux 64bit
Reporter: Vitalie Bureanu
  Labels: pdfbox, space, whitespace,
 Attachments: example.pdf

   Original Estimate: 2h
  Remaining Estimate: 2h

 Hello dear developers,
 I noticed that PDFTextStripper sometimes adds spaces after a completely 
 detached words...
 For example - if you make text extraction for attached file you will se that 
 PDFTextStripper adds one space after words: Qty  and Unit Price  but not 
 adds after Description and Line Total.
 I think this is a bug, because after words Qty  and Unit Price  should 
 not be present the whitespace.
 Can you please fix it?
 (see attach)
 Thank you very much,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PDFBOX-1575) PDFTextStripper sometimes adds spaces after a detached one word

2013-04-22 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1575:


Attachment: example.pdf

 PDFTextStripper sometimes adds spaces after a detached one word
 ---

 Key: PDFBOX-1575
 URL: https://issues.apache.org/jira/browse/PDFBOX-1575
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.1
 Environment: Linux 64bit
Reporter: Vitalie Bureanu
  Labels: pdfbox, space, whitespace,
 Attachments: example.pdf

   Original Estimate: 2h
  Remaining Estimate: 2h

 Hello dear developers,
 I noticed that PDFTextStripper sometimes adds spaces after a completely 
 detached words...
 For example - if you make text extraction for attached file you will se that 
 PDFTextStripper adds one space after words: Qty  and Unit Price  but not 
 adds after Description and Line Total.
 I think this is a bug, because after words Qty  and Unit Price  should 
 not be present the whitespace.
 Can you please fix it?
 (see attach)
 Thank you very much,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PDFBOX-1575) PDFTextStripper sometimes adds spaces after a detached words

2013-04-22 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1575:


Summary: PDFTextStripper sometimes adds spaces after a detached words  
(was: PDFTextStripper sometimes adds spaces after a detached one word)

 PDFTextStripper sometimes adds spaces after a detached words
 

 Key: PDFBOX-1575
 URL: https://issues.apache.org/jira/browse/PDFBOX-1575
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.1
 Environment: Linux 64bit
Reporter: Vitalie Bureanu
  Labels: pdfbox, space, whitespace,
 Attachments: example.pdf

   Original Estimate: 2h
  Remaining Estimate: 2h

 Hello dear developers,
 I noticed that PDFTextStripper sometimes adds spaces after a completely 
 detached words...
 For example - if you make text extraction for attached file you will se that 
 PDFTextStripper adds one space after words: Qty  and Unit Price  but not 
 adds after Description and Line Total.
 I think this is a bug, because after words Qty  and Unit Price  should 
 not be present the whitespace.
 Can you please fix it?
 (see attach)
 Thank you very much,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PDFBOX-1553) Offset of extracted coordinates

2013-03-29 Thread Vitalie Bureanu (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617629#comment-13617629
 ] 

Vitalie Bureanu commented on PDFBOX-1553:
-

Hello Andreas, I checked it but I have same result - coordinates have an offset.

 Offset of extracted coordinates
 ---

 Key: PDFBOX-1553
 URL: https://issues.apache.org/jira/browse/PDFBOX-1553
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.0
 Environment: Linux Ubuntu 64 bit, Java
Reporter: Vitalie Bureanu
Priority: Minor
  Labels: offset
 Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted 
 coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png

   Original Estimate: 24h
  Remaining Estimate: 24h

 Hello,
 Preamble: We are glad to use PDFBox and I personally grateful to all 
 developers who sustain this project. It is good work, guys!
 We have one problem. For our application purposes we extract from pdf char 
 by char with rispective coordinates for each char. (see attached Parser)
 After this we group chars into the words. We noticed that for some pdf 
 documents we have a strange offset for extracted rect coordinates. (see 
 screens)
 The offset is seems to be incremental (not sure) - at left top corner of 
 document is near to real coordinates of character, but at right bottom corner 
 is near to 0.5 cm..
 If I make selection in Adobe Reader - it seems all ok.
 I attached two pdf files with offset to this post.
 If you want to see the offset in action you can use our service to do it at 
 http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)
 Please can you test these files and tell me if it is a really bug?
 How we can resolve it?
 Thanks,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PDFBOX-1542) Whitespaces between words are not created

2013-03-27 Thread Vitalie Bureanu (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614995#comment-13614995
 ] 

Vitalie Bureanu commented on PDFBOX-1542:
-

Thank you very much, Andreas! We will try to use PDFStripper to insert white 
spaces!

 Whitespaces between words are not created
 -

 Key: PDFBOX-1542
 URL: https://issues.apache.org/jira/browse/PDFBOX-1542
 Project: PDFBox
  Issue Type: Wish
  Components: Text extraction
Affects Versions: 1.7.1
Reporter: Vitalie Bureanu
Priority: Minor
 Attachments: Parser.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 Hello, I extract the text with PDFBox from PDF files. I noticed that 
 extraction of text from some pdf files are not so good as expected. I have a 
 seria of pdf invoices from which I try to extract the text with coordinates 
 and resultat is pretty well, but I noticed very strange thing: when I extract 
 text - the words are extracted without whitespaces bettween. Example: if I 
 try to extract Unit Price the result is UnitPrice.
 But if I open the invoice in Adobe Reader and make Copy/Past into 
 Notepad... I have the Unit Price with whitespaces!
 I think the whitespaces are not present in original pdf document... but the 
 Adobe Reader in some way insert whitespaces between words when it show 
 content of the pdf.
  
 Guys, can you please suggest me how I can have the strings with spaces after 
 the parsing? 
 See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf
 PS: I want to try the 1.8.0. version of PDFBox - how I can download it?
 Many thanks,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread Vitalie Bureanu (JIRA)
Vitalie Bureanu created PDFBOX-1553:
---

 Summary: Offset of extracted coordinates
 Key: PDFBOX-1553
 URL: https://issues.apache.org/jira/browse/PDFBOX-1553
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.0
 Environment: Linux Ubuntu 64 bit, Java
Reporter: Vitalie Bureanu


Hello,

Preamble: We are glad to use PDFBox and I personally grateful to all developers 
who sustain this project. It is good work, guys!

We have one problem. For our application purposes we extract from pdf char by 
char with rispective coordinates for each char. (see attached Parser)
After this we group chars into the words. We noticed that for some pdf 
documents we have a strange offset for extracted coordinates. (see screens)

The offset is incremental - at left top corner of document is near to real 
coordinates of charcater, but at right bottom corner is near to 0.5 cm..
If I make selection in Adobe Reader - it seems all ok.

I attached two pdf files with offset to this post.
If you want to see the offset in action you can use our service to do it at 
http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1553:


Priority: Minor  (was: Major)

 Offset of extracted coordinates
 ---

 Key: PDFBOX-1553
 URL: https://issues.apache.org/jira/browse/PDFBOX-1553
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.0
 Environment: Linux Ubuntu 64 bit, Java
Reporter: Vitalie Bureanu
Priority: Minor
  Labels: offset
 Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted 
 coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png

   Original Estimate: 24h
  Remaining Estimate: 24h

 Hello,
 Preamble: We are glad to use PDFBox and I personally grateful to all 
 developers who sustain this project. It is good work, guys!
 We have one problem. For our application purposes we extract from pdf char 
 by char with rispective coordinates for each char. (see attached Parser)
 After this we group chars into the words. We noticed that for some pdf 
 documents we have a strange offset for extracted coordinates. (see screens)
 The offset is incremental - at left top corner of document is near to real 
 coordinates of charcater, but at right bottom corner is near to 0.5 cm..
 If I make selection in Adobe Reader - it seems all ok.
 I attached two pdf files with offset to this post.
 If you want to see the offset in action you can use our service to do it at 
 http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1553:


Attachment: Selection in Adobe Reader.png
Extracted coordinates of rects.jpg
Parser.java
EnSt11_offset.pdf
EnSt10_offset.pdf

 Offset of extracted coordinates
 ---

 Key: PDFBOX-1553
 URL: https://issues.apache.org/jira/browse/PDFBOX-1553
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.0
 Environment: Linux Ubuntu 64 bit, Java
Reporter: Vitalie Bureanu
  Labels: offset
 Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted 
 coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png

   Original Estimate: 24h
  Remaining Estimate: 24h

 Hello,
 Preamble: We are glad to use PDFBox and I personally grateful to all 
 developers who sustain this project. It is good work, guys!
 We have one problem. For our application purposes we extract from pdf char 
 by char with rispective coordinates for each char. (see attached Parser)
 After this we group chars into the words. We noticed that for some pdf 
 documents we have a strange offset for extracted coordinates. (see screens)
 The offset is incremental - at left top corner of document is near to real 
 coordinates of charcater, but at right bottom corner is near to 0.5 cm..
 If I make selection in Adobe Reader - it seems all ok.
 I attached two pdf files with offset to this post.
 If you want to see the offset in action you can use our service to do it at 
 http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1553:


Description: 
Hello,

Preamble: We are glad to use PDFBox and I personally grateful to all developers 
who sustain this project. It is good work, guys!

We have one problem. For our application purposes we extract from pdf char by 
char with rispective coordinates for each char. (see attached Parser)
After this we group chars into the words. We noticed that for some pdf 
documents we have a strange offset for extracted rect coordinates. (see 
screens)

The offset is seems to be incremental (not sure) - at left top corner of 
document is near to real coordinates of character, but at right bottom corner 
is near to 0.5 cm..
If I make selection in Adobe Reader - it seems all ok.

I attached two pdf files with offset to this post.
If you want to see the offset in action you can use our service to do it at 
http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)

Please can you test these files and tell me if it is a really bug?
How we can resolve it?

Thanks,
Vitalie


  was:
Hello,

Preamble: We are glad to use PDFBox and I personally grateful to all developers 
who sustain this project. It is good work, guys!

We have one problem. For our application purposes we extract from pdf char by 
char with rispective coordinates for each char. (see attached Parser)
After this we group chars into the words. We noticed that for some pdf 
documents we have a strange offset for extracted rect coordinates. (see 
screens)

The offset is seems to be incremental (not sure) - at left top corner of 
document is near to real coordinates of character, but at right bottom corner 
is near to 0.5 cm..
If I make selection in Adobe Reader - it seems all ok.

I attached two pdf files with offset to this post.
If you want to see the offset in action you can use our service to do it at 
http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)

Please can you test these files and tell me if it is a really bug?



 Offset of extracted coordinates
 ---

 Key: PDFBOX-1553
 URL: https://issues.apache.org/jira/browse/PDFBOX-1553
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.0
 Environment: Linux Ubuntu 64 bit, Java
Reporter: Vitalie Bureanu
Priority: Minor
  Labels: offset
 Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted 
 coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png

   Original Estimate: 24h
  Remaining Estimate: 24h

 Hello,
 Preamble: We are glad to use PDFBox and I personally grateful to all 
 developers who sustain this project. It is good work, guys!
 We have one problem. For our application purposes we extract from pdf char 
 by char with rispective coordinates for each char. (see attached Parser)
 After this we group chars into the words. We noticed that for some pdf 
 documents we have a strange offset for extracted rect coordinates. (see 
 screens)
 The offset is seems to be incremental (not sure) - at left top corner of 
 document is near to real coordinates of character, but at right bottom corner 
 is near to 0.5 cm..
 If I make selection in Adobe Reader - it seems all ok.
 I attached two pdf files with offset to this post.
 If you want to see the offset in action you can use our service to do it at 
 http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)
 Please can you test these files and tell me if it is a really bug?
 How we can resolve it?
 Thanks,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PDFBOX-1542) Whitespaces between words are not created

2013-03-26 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1542:


Attachment: Parser.java

Our text extractor (with coordinates for each simbol).

 Whitespaces between words are not created
 -

 Key: PDFBOX-1542
 URL: https://issues.apache.org/jira/browse/PDFBOX-1542
 Project: PDFBox
  Issue Type: Wish
  Components: Text extraction
Affects Versions: 1.7.1
Reporter: Vitalie Bureanu
Priority: Minor
 Attachments: Parser.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 Hello, I extract the text with PDFBox from PDF files. I noticed that 
 extraction of text from some pdf files are not so good as expected. I have a 
 seria of pdf invoices from which I try to extract the text with coordinates 
 and resultat is pretty well, but I noticed very strange thing: when I extract 
 text - the words are extracted without whitespaces bettween. Example: if I 
 try to extract Unit Price the result is UnitPrice.
 But if I open the invoice in Adobe Reader and make Copy/Past into 
 Notepad... I have the Unit Price with whitespaces!
 I think the whitespaces are not present in original pdf document... but the 
 Adobe Reader in some way insert whitespaces between words when it show 
 content of the pdf.
  
 Guys, can you please suggest me how I can have the strings with spaces after 
 the parsing? 
 See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf
 PS: I want to try the 1.8.0. version of PDFBox - how I can download it?
 Many thanks,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PDFBOX-1542) Whitespaces between words are not created

2013-03-26 Thread Vitalie Bureanu (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613609#comment-13613609
 ] 

Vitalie Bureanu commented on PDFBOX-1542:
-

Hello Andreas,

Thank you for the promptness, I attached to post the source code of our Parser 
which we use for text extraction. We extract simbol by simbol with rispective 
coordinates for each simbol. When we extract all simbols - in middle of these 
simbols white spacings are missed.

Many thanks,
Vitalie

 Whitespaces between words are not created
 -

 Key: PDFBOX-1542
 URL: https://issues.apache.org/jira/browse/PDFBOX-1542
 Project: PDFBox
  Issue Type: Wish
  Components: Text extraction
Affects Versions: 1.7.1
Reporter: Vitalie Bureanu
Priority: Minor
 Attachments: Parser.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 Hello, I extract the text with PDFBox from PDF files. I noticed that 
 extraction of text from some pdf files are not so good as expected. I have a 
 seria of pdf invoices from which I try to extract the text with coordinates 
 and resultat is pretty well, but I noticed very strange thing: when I extract 
 text - the words are extracted without whitespaces bettween. Example: if I 
 try to extract Unit Price the result is UnitPrice.
 But if I open the invoice in Adobe Reader and make Copy/Past into 
 Notepad... I have the Unit Price with whitespaces!
 I think the whitespaces are not present in original pdf document... but the 
 Adobe Reader in some way insert whitespaces between words when it show 
 content of the pdf.
  
 Guys, can you please suggest me how I can have the strings with spaces after 
 the parsing? 
 See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf
 PS: I want to try the 1.8.0. version of PDFBox - how I can download it?
 Many thanks,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PDFBOX-1542) Whitespaces between words are not created

2013-03-15 Thread Vitalie Bureanu (JIRA)
Vitalie Bureanu created PDFBOX-1542:
---

 Summary: Whitespaces between words are not created
 Key: PDFBOX-1542
 URL: https://issues.apache.org/jira/browse/PDFBOX-1542
 Project: PDFBox
  Issue Type: Wish
  Components: Text extraction
Affects Versions: 1.7.1
Reporter: Vitalie Bureanu
Priority: Minor


Hello, I extract the text with PDFBox from PDF files. I noticed that extraction 
of text from some pdf files are not so good as expected. I have a seria of pdf 
invoices from which I try to extract the text with coordinates and resultat is 
pretty well, but I noticed very strange thing: when I extract text - the words 
are extracted without whitespaces bettween. Example: if I try to extract Total 
Amount the result is TotalAmount.
But if I open the invoice in Adobe Reader and make Copy/Past into Notepad... 
I have the Total Amount with whitespaces!
I think the whitespaces are not present in original pdf document... but the 
Adobe Reader in some way insert whitespaces between words when it show 
content of the pdf.
 
Guys, can you please suggest me how I can have the strings with spaces after 
the parsing? 

See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf

PS: I want to try the 1.8.0. version of PDFBox - how I can download it?

Many thanks,
Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PDFBOX-1542) Whitespaces between words are not created

2013-03-15 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1542:


Description: 
Hello, I extract the text with PDFBox from PDF files. I noticed that extraction 
of text from some pdf files are not so good as expected. I have a seria of pdf 
invoices from which I try to extract the text with coordinates and resultat is 
pretty well, but I noticed very strange thing: when I extract text - the words 
are extracted without whitespaces bettween. Example: if I try to extract Unit 
Price the result is UnitPrice.
But if I open the invoice in Adobe Reader and make Copy/Past into Notepad... 
I have the Unit Price with whitespaces!
I think the whitespaces are not present in original pdf document... but the 
Adobe Reader in some way insert whitespaces between words when it show 
content of the pdf.
 
Guys, can you please suggest me how I can have the strings with spaces after 
the parsing? 

See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf

PS: I want to try the 1.8.0. version of PDFBox - how I can download it?

Many thanks,
Vitalie

  was:
Hello, I extract the text with PDFBox from PDF files. I noticed that extraction 
of text from some pdf files are not so good as expected. I have a seria of pdf 
invoices from which I try to extract the text with coordinates and resultat is 
pretty well, but I noticed very strange thing: when I extract text - the words 
are extracted without whitespaces bettween. Example: if I try to extract Total 
Amount the result is TotalAmount.
But if I open the invoice in Adobe Reader and make Copy/Past into Notepad... 
I have the Total Amount with whitespaces!
I think the whitespaces are not present in original pdf document... but the 
Adobe Reader in some way insert whitespaces between words when it show 
content of the pdf.
 
Guys, can you please suggest me how I can have the strings with spaces after 
the parsing? 

See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf

PS: I want to try the 1.8.0. version of PDFBox - how I can download it?

Many thanks,
Vitalie


 Whitespaces between words are not created
 -

 Key: PDFBOX-1542
 URL: https://issues.apache.org/jira/browse/PDFBOX-1542
 Project: PDFBox
  Issue Type: Wish
  Components: Text extraction
Affects Versions: 1.7.1
Reporter: Vitalie Bureanu
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 Hello, I extract the text with PDFBox from PDF files. I noticed that 
 extraction of text from some pdf files are not so good as expected. I have a 
 seria of pdf invoices from which I try to extract the text with coordinates 
 and resultat is pretty well, but I noticed very strange thing: when I extract 
 text - the words are extracted without whitespaces bettween. Example: if I 
 try to extract Unit Price the result is UnitPrice.
 But if I open the invoice in Adobe Reader and make Copy/Past into 
 Notepad... I have the Unit Price with whitespaces!
 I think the whitespaces are not present in original pdf document... but the 
 Adobe Reader in some way insert whitespaces between words when it show 
 content of the pdf.
  
 Guys, can you please suggest me how I can have the strings with spaces after 
 the parsing? 
 See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf
 PS: I want to try the 1.8.0. version of PDFBox - how I can download it?
 Many thanks,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira