from:"Vitalie Bureanu \\\\\\\(JIRA\\\\\\\)"

[jira] [Closed] (PDFBOX-1858) Extracted text does not have spaces

2014-01-23 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu closed PDFBOX-1858.
---

Resolution: Not A Problem

It is seem that it is a problem inside of our software.

 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the result still remain the same.
 For us it is a big problem. Can it be resolved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)

Vitalie Bureanu created PDFBOX-1858:
---

 Summary: Extracted text does not have spaces
 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu


Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the resultstill remain the same.

Can it be solved, please?

With respect,
Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1858:


Attachment: Screenshot.jpg
test.pdf

 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the resultstill remain the same.
 Can it be solved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1858:


Attachment: (was: Untitled-1.jpg)

 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the resultstill remain the same.
 Can it be solved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1858:


Attachment: Untitled-1.jpg

 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the resultstill remain the same.
 Can it be solved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1858:


Description: 
Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the result still remain the same.

For us it is a big problem. Can it be resolved, please?

With respect,
Vitalie

  was:
Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the result still remain the same.

Can it be resolved, please?

With respect,
Vitalie


 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the result still remain the same.
 For us it is a big problem. Can it be resolved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1858:


Description: 
Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the result still remain the same.

Can it be resolved, please?

With respect,
Vitalie

  was:
Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the result still remain the same.

Can it be solved, please?

With respect,
Vitalie


 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the result still remain the same.
 Can it be resolved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

2014-01-22 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1858:


Description: 
Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the result still remain the same.

Can it be solved, please?

With respect,
Vitalie

  was:
Extracted text does not have spaces between some words.

Use to test please a string on line 74a... inside of attached test.pdf.

It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
Form isattached , checkhere

The result is not seems to be good, the words are glued.

I tried to use a class PDF Text Stripper but the resultstill remain the same.

Can it be solved, please?

With respect,
Vitalie


 Extracted text does not have spaces
 ---

 Key: PDFBOX-1858
 URL: https://issues.apache.org/jira/browse/PDFBOX-1858
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing, Text extraction
Affects Versions: 1.8.3
 Environment: Linux 64bit, Java
Reporter: Vitalie Bureanu
 Attachments: Screenshot.jpg, test.pdf

   Original Estimate: 3h
  Remaining Estimate: 3h

 Extracted text does not have spaces between some words.
 Use to test please a string on line 74a... inside of attached test.pdf.
 It will be extracted as: 74a Amount of line73youwant refunded toyou . If 
 Form isattached , checkhere
 The result is not seems to be good, the words are glued.
 I tried to use a class PDF Text Stripper but the result still remain the same.
 Can it be solved, please?
 With respect,
 Vitalie



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1575) PDFTextStripper sometimes adds spaces after a detached words

2013-04-23 Thread Vitalie Bureanu (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13639065#comment-13639065
 ] 

Vitalie Bureanu commented on PDFBOX-1575:
-

Update, I noticed that this bug happens almost always... on different documents.
For us these whitespaces after completely detached words are very-very 
problematic... :( 
I can not fix it, can somebody help?


 PDFTextStripper sometimes adds spaces after a detached words
 

 Key: PDFBOX-1575
 URL: https://issues.apache.org/jira/browse/PDFBOX-1575
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.1
 Environment: Linux 64bit
Reporter: Vitalie Bureanu
  Labels: pdfbox, space, whitespace,
 Attachments: example.pdf

   Original Estimate: 2h
  Remaining Estimate: 2h

 Hello dear developers,
 I noticed that PDFTextStripper sometimes adds spaces after a completely 
 detached words...
 For example - if you make text extraction for attached file you will se that 
 PDFTextStripper adds one space after words: Qty  and Unit Price  but not 
 adds after Description and Line Total.
 I think this is a bug, because after words Qty  and Unit Price  should 
 not be present the whitespace.
 Can you please fix it?
 (see attach)
 Thank you very much,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1575) PDFTextStripper adds spaces after a detached words

2013-04-23 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1575:


Summary: PDFTextStripper adds spaces after a detached words  (was: 
PDFTextStripper sometimes adds spaces after a detached words)

 PDFTextStripper adds spaces after a detached words
 --

 Key: PDFBOX-1575
 URL: https://issues.apache.org/jira/browse/PDFBOX-1575
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.1
 Environment: Linux 64bit
Reporter: Vitalie Bureanu
  Labels: pdfbox, space, whitespace,
 Attachments: example.pdf

   Original Estimate: 2h
  Remaining Estimate: 2h

 Hello dear developers,
 I noticed that PDFTextStripper sometimes adds spaces after a completely 
 detached words...
 For example - if you make text extraction for attached file you will se that 
 PDFTextStripper adds one space after words: Qty  and Unit Price  but not 
 adds after Description and Line Total.
 I think this is a bug, because after words Qty  and Unit Price  should 
 not be present the whitespace.
 Can you please fix it?
 (see attach)
 Thank you very much,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1575) PDFTextStripper sometimes adds spaces after a detached one word

2013-04-22 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1575:


Attachment: example.pdf

 PDFTextStripper sometimes adds spaces after a detached one word
 ---

 Key: PDFBOX-1575
 URL: https://issues.apache.org/jira/browse/PDFBOX-1575
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.1
 Environment: Linux 64bit
Reporter: Vitalie Bureanu
  Labels: pdfbox, space, whitespace,
 Attachments: example.pdf

   Original Estimate: 2h
  Remaining Estimate: 2h

 Hello dear developers,
 I noticed that PDFTextStripper sometimes adds spaces after a completely 
 detached words...
 For example - if you make text extraction for attached file you will se that 
 PDFTextStripper adds one space after words: Qty  and Unit Price  but not 
 adds after Description and Line Total.
 I think this is a bug, because after words Qty  and Unit Price  should 
 not be present the whitespace.
 Can you please fix it?
 (see attach)
 Thank you very much,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1575) PDFTextStripper sometimes adds spaces after a detached words

2013-04-22 Thread Vitalie Bureanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1575:


Summary: PDFTextStripper sometimes adds spaces after a detached words  
(was: PDFTextStripper sometimes adds spaces after a detached one word)

 PDFTextStripper sometimes adds spaces after a detached words
 

 Key: PDFBOX-1575
 URL: https://issues.apache.org/jira/browse/PDFBOX-1575
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.1
 Environment: Linux 64bit
Reporter: Vitalie Bureanu
  Labels: pdfbox, space, whitespace,
 Attachments: example.pdf

   Original Estimate: 2h
  Remaining Estimate: 2h

 Hello dear developers,
 I noticed that PDFTextStripper sometimes adds spaces after a completely 
 detached words...
 For example - if you make text extraction for attached file you will se that 
 PDFTextStripper adds one space after words: Qty  and Unit Price  but not 
 adds after Description and Line Total.
 I think this is a bug, because after words Qty  and Unit Price  should 
 not be present the whitespace.
 Can you please fix it?
 (see attach)
 Thank you very much,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1553) Offset of extracted coordinates

2013-03-29 Thread Vitalie Bureanu (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617629#comment-13617629
]

Vitalie Bureanu commented on PDFBOX-1553:
-

Hello Andreas, I checked it but I have same result - coordinates have an offset.

Offset of extracted coordinates
---

Key: PDFBOX-1553
URL: https://issues.apache.org/jira/browse/PDFBOX-1553
Project: PDFBox
Issue Type: Bug
Affects Versions: 1.8.0
Environment: Linux Ubuntu 64 bit, Java
Reporter: Vitalie Bureanu
Priority: Minor
Labels: offset
Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted
coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png

Original Estimate: 24h
Remaining Estimate: 24h

Hello,
Preamble: We are glad to use PDFBox and I personally grateful to all
developers who sustain this project. It is good work, guys!
We have one problem. For our application purposes we extract from pdf char
by char with rispective coordinates for each char. (see attached Parser)
After this we group chars into the words. We noticed that for some pdf
documents we have a strange offset for extracted rect coordinates. (see
screens)
The offset is seems to be incremental (not sure) - at left top corner of
document is near to real coordinates of character, but at right bottom corner
is near to 0.5 cm..
If I make selection in Adobe Reader - it seems all ok.
I attached two pdf files with offset to this post.
If you want to see the offset in action you can use our service to do it at
http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)
Please can you test these files and tell me if it is a really bug?
How we can resolve it?
Thanks,
Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1542) Whitespaces between words are not created

2013-03-27 Thread Vitalie Bureanu (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614995#comment-13614995
]

Vitalie Bureanu commented on PDFBOX-1542:
-

Thank you very much, Andreas! We will try to use PDFStripper to insert white
spaces!

Whitespaces between words are not created
-

Key: PDFBOX-1542
URL: https://issues.apache.org/jira/browse/PDFBOX-1542
Project: PDFBox
Issue Type: Wish
Components: Text extraction
Affects Versions: 1.7.1
Reporter: Vitalie Bureanu
Priority: Minor
Attachments: Parser.java

Original Estimate: 1h
Remaining Estimate: 1h

Hello, I extract the text with PDFBox from PDF files. I noticed that
extraction of text from some pdf files are not so good as expected. I have a
seria of pdf invoices from which I try to extract the text with coordinates
and resultat is pretty well, but I noticed very strange thing: when I extract
text - the words are extracted without whitespaces bettween. Example: if I
try to extract Unit Price the result is UnitPrice.
But if I open the invoice in Adobe Reader and make Copy/Past into
Notepad... I have the Unit Price with whitespaces!
I think the whitespaces are not present in original pdf document... but the
Adobe Reader in some way insert whitespaces between words when it show
content of the pdf.

Guys, can you please suggest me how I can have the strings with spaces after
the parsing?
See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf
PS: I want to try the 1.8.0. version of PDFBox - how I can download it?
Many thanks,
Vitalie

[jira] [Created] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread Vitalie Bureanu (JIRA)

Vitalie Bureanu created PDFBOX-1553:
---

 Summary: Offset of extracted coordinates
 Key: PDFBOX-1553
 URL: https://issues.apache.org/jira/browse/PDFBOX-1553
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.0
 Environment: Linux Ubuntu 64 bit, Java
Reporter: Vitalie Bureanu


Hello,

Preamble: We are glad to use PDFBox and I personally grateful to all developers 
who sustain this project. It is good work, guys!

We have one problem. For our application purposes we extract from pdf char by 
char with rispective coordinates for each char. (see attached Parser)
After this we group chars into the words. We noticed that for some pdf 
documents we have a strange offset for extracted coordinates. (see screens)

The offset is incremental - at left top corner of document is near to real 
coordinates of charcater, but at right bottom corner is near to 0.5 cm..
If I make selection in Adobe Reader - it seems all ok.

I attached two pdf files with offset to this post.
If you want to see the offset in action you can use our service to do it at 
http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread Vitalie Bureanu (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vitalie Bureanu updated PDFBOX-1553:

Priority: Minor (was: Major)

Offset of extracted coordinates
---

Original Estimate: 24h
Remaining Estimate: 24h

Hello,
Preamble: We are glad to use PDFBox and I personally grateful to all
developers who sustain this project. It is good work, guys!
We have one problem. For our application purposes we extract from pdf char
by char with rispective coordinates for each char. (see attached Parser)
After this we group chars into the words. We noticed that for some pdf
documents we have a strange offset for extracted coordinates. (see screens)
The offset is incremental - at left top corner of document is near to real
coordinates of charcater, but at right bottom corner is near to 0.5 cm..
If I make selection in Adobe Reader - it seems all ok.
I attached two pdf files with offset to this post.
If you want to see the offset in action you can use our service to do it at
http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread Vitalie Bureanu (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vitalie Bureanu updated PDFBOX-1553:

Attachment: Selection in Adobe Reader.png
Extracted coordinates of rects.jpg
Parser.java
EnSt11_offset.pdf
EnSt10_offset.pdf

Offset of extracted coordinates
---

Key: PDFBOX-1553
URL: https://issues.apache.org/jira/browse/PDFBOX-1553
Project: PDFBox
Issue Type: Bug
Affects Versions: 1.8.0
Environment: Linux Ubuntu 64 bit, Java
Reporter: Vitalie Bureanu
Labels: offset
Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted
coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png

Original Estimate: 24h
Remaining Estimate: 24h

Hello,
Preamble: We are glad to use PDFBox and I personally grateful to all
developers who sustain this project. It is good work, guys!
We have one problem. For our application purposes we extract from pdf char
by char with rispective coordinates for each char. (see attached Parser)
After this we group chars into the words. We noticed that for some pdf
documents we have a strange offset for extracted coordinates. (see screens)
The offset is incremental - at left top corner of document is near to real
coordinates of charcater, but at right bottom corner is near to 0.5 cm..
If I make selection in Adobe Reader - it seems all ok.
I attached two pdf files with offset to this post.
If you want to see the offset in action you can use our service to do it at
http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

2013-03-27 Thread Vitalie Bureanu (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vitalie Bureanu updated PDFBOX-1553:

Description:
Hello,

Preamble: We are glad to use PDFBox and I personally grateful to all developers
who sustain this project. It is good work, guys!

We have one problem. For our application purposes we extract from pdf char by
char with rispective coordinates for each char. (see attached Parser)
After this we group chars into the words. We noticed that for some pdf
documents we have a strange offset for extracted rect coordinates. (see
screens)

The offset is seems to be incremental (not sure) - at left top corner of
document is near to real coordinates of character, but at right bottom corner
is near to 0.5 cm..
If I make selection in Adobe Reader - it seems all ok.

I attached two pdf files with offset to this post.
If you want to see the offset in action you can use our service to do it at
http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)

Please can you test these files and tell me if it is a really bug?
How we can resolve it?

Thanks,
Vitalie

was:
Hello,

Preamble: We are glad to use PDFBox and I personally grateful to all developers
who sustain this project. It is good work, guys!

Please can you test these files and tell me if it is a really bug?

Offset of extracted coordinates
---

Original Estimate: 24h
Remaining Estimate: 24h

[jira] [Updated] (PDFBOX-1542) Whitespaces between words are not created

2013-03-26 Thread Vitalie Bureanu (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vitalie Bureanu updated PDFBOX-1542:

Attachment: Parser.java

Our text extractor (with coordinates for each simbol).

Whitespaces between words are not created
-

Original Estimate: 1h
Remaining Estimate: 1h

[jira] [Commented] (PDFBOX-1542) Whitespaces between words are not created

2013-03-26 Thread Vitalie Bureanu (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613609#comment-13613609
]

Vitalie Bureanu commented on PDFBOX-1542:
-

Hello Andreas,

Thank you for the promptness, I attached to post the source code of our Parser
which we use for text extraction. We extract simbol by simbol with rispective
coordinates for each simbol. When we extract all simbols - in middle of these
simbols white spacings are missed.

Many thanks,
Vitalie

Whitespaces between words are not created
-

Original Estimate: 1h
Remaining Estimate: 1h

[jira] [Created] (PDFBOX-1542) Whitespaces between words are not created

2013-03-15 Thread Vitalie Bureanu (JIRA)

Vitalie Bureanu created PDFBOX-1542:
---

 Summary: Whitespaces between words are not created
 Key: PDFBOX-1542
 URL: https://issues.apache.org/jira/browse/PDFBOX-1542
 Project: PDFBox
  Issue Type: Wish
  Components: Text extraction
Affects Versions: 1.7.1
Reporter: Vitalie Bureanu
Priority: Minor


Hello, I extract the text with PDFBox from PDF files. I noticed that extraction 
of text from some pdf files are not so good as expected. I have a seria of pdf 
invoices from which I try to extract the text with coordinates and resultat is 
pretty well, but I noticed very strange thing: when I extract text - the words 
are extracted without whitespaces bettween. Example: if I try to extract Total 
Amount the result is TotalAmount.
But if I open the invoice in Adobe Reader and make Copy/Past into Notepad... 
I have the Total Amount with whitespaces!
I think the whitespaces are not present in original pdf document... but the 
Adobe Reader in some way insert whitespaces between words when it show 
content of the pdf.
 
Guys, can you please suggest me how I can have the strings with spaces after 
the parsing? 

See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf

PS: I want to try the 1.8.0. version of PDFBox - how I can download it?

Many thanks,
Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1542) Whitespaces between words are not created

2013-03-15 Thread Vitalie Bureanu (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vitalie Bureanu updated PDFBOX-1542:

Description:
Hello, I extract the text with PDFBox from PDF files. I noticed that extraction
of text from some pdf files are not so good as expected. I have a seria of pdf
invoices from which I try to extract the text with coordinates and resultat is
pretty well, but I noticed very strange thing: when I extract text - the words
are extracted without whitespaces bettween. Example: if I try to extract Unit
Price the result is UnitPrice.
But if I open the invoice in Adobe Reader and make Copy/Past into Notepad...
I have the Unit Price with whitespaces!
I think the whitespaces are not present in original pdf document... but the
Adobe Reader in some way insert whitespaces between words when it show
content of the pdf.

Guys, can you please suggest me how I can have the strings with spaces after
the parsing?

See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf

PS: I want to try the 1.8.0. version of PDFBox - how I can download it?

Many thanks,
Vitalie

was:
Hello, I extract the text with PDFBox from PDF files. I noticed that extraction
of text from some pdf files are not so good as expected. I have a seria of pdf
invoices from which I try to extract the text with coordinates and resultat is
pretty well, but I noticed very strange thing: when I extract text - the words
are extracted without whitespaces bettween. Example: if I try to extract Total
Amount the result is TotalAmount.
But if I open the invoice in Adobe Reader and make Copy/Past into Notepad...
I have the Total Amount with whitespaces!
I think the whitespaces are not present in original pdf document... but the
Adobe Reader in some way insert whitespaces between words when it show
content of the pdf.

Guys, can you please suggest me how I can have the strings with spaces after
the parsing?

See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf

PS: I want to try the 1.8.0. version of PDFBox - how I can download it?

Many thanks,
Vitalie

Whitespaces between words are not created
-

[jira] [Closed] (PDFBOX-1858) Extracted text does not have spaces

[jira] [Created] (PDFBOX-1858) Extracted text does not have spaces

[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

[jira] [Updated] (PDFBOX-1858) Extracted text does not have spaces

[jira] [Commented] (PDFBOX-1575) PDFTextStripper sometimes adds spaces after a detached words

[jira] [Updated] (PDFBOX-1575) PDFTextStripper adds spaces after a detached words

[jira] [Updated] (PDFBOX-1575) PDFTextStripper sometimes adds spaces after a detached one word

[jira] [Updated] (PDFBOX-1575) PDFTextStripper sometimes adds spaces after a detached words

[jira] [Commented] (PDFBOX-1553) Offset of extracted coordinates

[jira] [Commented] (PDFBOX-1542) Whitespaces between words are not created

[jira] [Created] (PDFBOX-1553) Offset of extracted coordinates

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

[jira] [Updated] (PDFBOX-1553) Offset of extracted coordinates

[jira] [Updated] (PDFBOX-1542) Whitespaces between words are not created

[jira] [Commented] (PDFBOX-1542) Whitespaces between words are not created

[jira] [Created] (PDFBOX-1542) Whitespaces between words are not created

[jira] [Updated] (PDFBOX-1542) Whitespaces between words are not created

22 matches

Site Navigation

Mail list logo

Footer information