[jira] [Updated] (PDFBOX-1542) Whitespaces between words are not created

2013-03-26 Thread Vitalie Bureanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalie Bureanu updated PDFBOX-1542:


Attachment: Parser.java

Our text extractor (with coordinates for each simbol).

 Whitespaces between words are not created
 -

 Key: PDFBOX-1542
 URL: https://issues.apache.org/jira/browse/PDFBOX-1542
 Project: PDFBox
  Issue Type: Wish
  Components: Text extraction
Affects Versions: 1.7.1
Reporter: Vitalie Bureanu
Priority: Minor
 Attachments: Parser.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 Hello, I extract the text with PDFBox from PDF files. I noticed that 
 extraction of text from some pdf files are not so good as expected. I have a 
 seria of pdf invoices from which I try to extract the text with coordinates 
 and resultat is pretty well, but I noticed very strange thing: when I extract 
 text - the words are extracted without whitespaces bettween. Example: if I 
 try to extract Unit Price the result is UnitPrice.
 But if I open the invoice in Adobe Reader and make Copy/Past into 
 Notepad... I have the Unit Price with whitespaces!
 I think the whitespaces are not present in original pdf document... but the 
 Adobe Reader in some way insert whitespaces between words when it show 
 content of the pdf.
  
 Guys, can you please suggest me how I can have the strings with spaces after 
 the parsing? 
 See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf
 PS: I want to try the 1.8.0. version of PDFBox - how I can download it?
 Many thanks,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PDFBOX-1542) Whitespaces between words are not created

2013-03-26 Thread Vitalie Bureanu (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613609#comment-13613609
 ] 

Vitalie Bureanu commented on PDFBOX-1542:
-

Hello Andreas,

Thank you for the promptness, I attached to post the source code of our Parser 
which we use for text extraction. We extract simbol by simbol with rispective 
coordinates for each simbol. When we extract all simbols - in middle of these 
simbols white spacings are missed.

Many thanks,
Vitalie

 Whitespaces between words are not created
 -

 Key: PDFBOX-1542
 URL: https://issues.apache.org/jira/browse/PDFBOX-1542
 Project: PDFBox
  Issue Type: Wish
  Components: Text extraction
Affects Versions: 1.7.1
Reporter: Vitalie Bureanu
Priority: Minor
 Attachments: Parser.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 Hello, I extract the text with PDFBox from PDF files. I noticed that 
 extraction of text from some pdf files are not so good as expected. I have a 
 seria of pdf invoices from which I try to extract the text with coordinates 
 and resultat is pretty well, but I noticed very strange thing: when I extract 
 text - the words are extracted without whitespaces bettween. Example: if I 
 try to extract Unit Price the result is UnitPrice.
 But if I open the invoice in Adobe Reader and make Copy/Past into 
 Notepad... I have the Unit Price with whitespaces!
 I think the whitespaces are not present in original pdf document... but the 
 Adobe Reader in some way insert whitespaces between words when it show 
 content of the pdf.
  
 Guys, can you please suggest me how I can have the strings with spaces after 
 the parsing? 
 See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf
 PS: I want to try the 1.8.0. version of PDFBox - how I can download it?
 Many thanks,
 Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PDFBOX-1552) Uppercase letters are read in lowercase manner

2013-03-26 Thread Hesham (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hesham updated PDFBOX-1552:
---

Attachment: pdf_with_uppercase_letters.pdf

This is a 1 page sample file to test.

 Uppercase letters are read in lowercase manner
 --

 Key: PDFBOX-1552
 URL: https://issues.apache.org/jira/browse/PDFBOX-1552
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.7.1
 Environment: Windows XP
Reporter: Hesham
 Attachments: pdf_with_uppercase_letters.pdf


 I have a PDF that when I read its contents using PDFBox some uppercase 
 letters are being read as lowercase. For example :
 - Word Testing is read as testing
 - Word Eve is read as eve
 - Word Deuteronomy is read as deuteronomy
 Andreas commented on this by: The pdf uses marked content to replace a 
 string (14.9.4 Replacement Text of the PDF specs provides a simple example). 
 And yes, PDFBox doesn't support it, yet.
 Please check this 1-page sample PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Overhaul PDFBox site

2013-03-26 Thread Maruan Sahyoun
Hi there,

what do you think about giving the PDFBox website an overhaul similar to 

http://cloudstack.apache.org/
http://ode.apache.org/index.html
http://cordova.apache.org

with a more prominent user guide such as http://ode.apache.org/userguide/
and a cleaner architecture description (together with main classes) for 
developers

to support a faster intro into pdfbox

Kind regards

Maruan Sahyoun



[jira] [Created] (PDFBOX-1552) Uppercase letters are read in lowercase manner

2013-03-26 Thread Hesham (JIRA)
Hesham created PDFBOX-1552:
--

 Summary: Uppercase letters are read in lowercase manner
 Key: PDFBOX-1552
 URL: https://issues.apache.org/jira/browse/PDFBOX-1552
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.7.1
 Environment: Windows XP
Reporter: Hesham
 Attachments: pdf_with_uppercase_letters.pdf

I have a PDF that when I read its contents using PDFBox some uppercase letters 
are being read as lowercase. For example :
- Word Testing is read as testing
- Word Eve is read as eve
- Word Deuteronomy is read as deuteronomy

Andreas commented on this by: The pdf uses marked content to replace a string 
(14.9.4 Replacement Text of the PDF specs provides a simple example). And yes, 
PDFBox doesn't support it, yet.


Please check this 1-page sample PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Overhaul PDFBox site

2013-03-26 Thread Timo Boehme

Hi,

an update to the website with a cleaner grouping of content etc. would 
help to attract people. While 'ode' and 'cordova' are visually nice I 
would like to keep more navigation possibilities at the start page like 
in 'cloudstack'.



Best regards,
Timo


Am 26.03.2013 14:03, schrieb Maruan Sahyoun:

Hi there,

what do you think about giving the PDFBox website an overhaul similar to

http://cloudstack.apache.org/
http://ode.apache.org/index.html
http://cordova.apache.org

with a more prominent user guide such as http://ode.apache.org/userguide/
and a cleaner architecture description (together with main classes) for 
developers

to support a faster intro into pdfbox

Kind regards

Maruan Sahyoun




--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_



Re: Overhaul PDFBox site

2013-03-26 Thread Maruan Sahyoun
well - the navigation is similar also hidden behind drop downs on ode compared 
to cloudstack. Both are using the same css framework [1] and the navigation can 
even be combined - that should give us enough freedom (and is an implementation 
detail). Both seem to be using the  Apache CMS [2].

Maruan Sahyoun

[1] http://twitter.github.com/bootstrap/
[2] https://svn.apache.org/repos/infra/websites/cms/webgui/content/export.json

Am 26.03.2013 um 15:22 schrieb Timo Boehme timo.boe...@ontochem.com:

 Hi,
 
 an update to the website with a cleaner grouping of content etc. would help 
 to attract people. While 'ode' and 'cordova' are visually nice I would like 
 to keep more navigation possibilities at the start page like in 'cloudstack'.
 
 
 Best regards,
 Timo
 
 
 Am 26.03.2013 14:03, schrieb Maruan Sahyoun:
 Hi there,
 
 what do you think about giving the PDFBox website an overhaul similar to
 
 http://cloudstack.apache.org/
 http://ode.apache.org/index.html
 http://cordova.apache.org
 
 with a more prominent user guide such as http://ode.apache.org/userguide/
 and a cleaner architecture description (together with main classes) for 
 developers
 
 to support a faster intro into pdfbox
 
 Kind regards
 
 Maruan Sahyoun
 
 
 
 -- 
 
 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com
 
 _
 
 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
 _
 



Re: Overhaul PDFBox site

2013-03-26 Thread Andreas Lehmkuehler

Hi,

Am 26.03.2013 17:00, schrieb Maruan Sahyoun:

well - the navigation is similar also hidden behind drop downs on ode

 compared to cloudstack. Both are using the same css framework [1] and the
 navigation can even be combined - that should give us enough freedom (and
 is an implementation detail). Both seem to be using the  Apache CMS [2].
I guess we all know that we have to overhaul the content itself. :-)

But first of all we have to decide how to manage the content. We have to use
either svnpubsub or the Apache CMS [1], the latter is recommended. IMHO
we should use the CMS [2] as it would be more flexible and it is easier to
maintain the content.

As a good starting point I've changed the maven skin of our site to the
bootstrap like fluendo skin [3]. Maybe it is a good idea to fresh up the layout
a little bit in preparation of a possible transition to the CMS.

WDYT and the more interesting question any volunteer to handle the transition?

BR
Andreas Lehmkühler

[1] http://www.apache.org/dev/project-site.html
[2] http://www.apache.org/dev/cmsref.html
[3] http://people.apache.org/~lehmi/pdfbox_fluendo/index.html


Maruan Sahyoun

[1] http://twitter.github.com/bootstrap/
[2] https://svn.apache.org/repos/infra/websites/cms/webgui/content/export.json

Am 26.03.2013 um 15:22 schrieb Timo Boehme timo.boe...@ontochem.com:


Hi,

an update to the website with a cleaner grouping of content etc. would help to 
attract people. While 'ode' and 'cordova' are visually nice I would like to 
keep more navigation possibilities at the start page like in 'cloudstack'.


Best regards,
Timo


Am 26.03.2013 14:03, schrieb Maruan Sahyoun:

Hi there,

what do you think about giving the PDFBox website an overhaul similar to

http://cloudstack.apache.org/
http://ode.apache.org/index.html
http://cordova.apache.org

with a more prominent user guide such as http://ode.apache.org/userguide/
and a cleaner architecture description (together with main classes) for 
developers

to support a faster intro into pdfbox

Kind regards

Maruan Sahyoun




--

Timo Boehme
OntoChem GmbH
H.-Damerow-Str. 4
06120 Halle/Saale
T: +49 345 4780474
F: +49 345 4780471
timo.boe...@ontochem.com

_

OntoChem GmbH
Geschäftsführer: Dr. Lutz Weber
Sitz: Halle / Saale
Registergericht: Stendal
Registernummer: HRB 215461
_








Re: Overhaul PDFBox site

2013-03-26 Thread Maruan Sahyoun
would be happy to handle that

Maruan Sahyoun

Am 26.03.2013 um 22:35 schrieb Andreas Lehmkuehler andr...@lehmi.de:

 Hi,
 
 Am 26.03.2013 17:00, schrieb Maruan Sahyoun:
 well - the navigation is similar also hidden behind drop downs on ode
  compared to cloudstack. Both are using the same css framework [1] and the
  navigation can even be combined - that should give us enough freedom (and
  is an implementation detail). Both seem to be using the  Apache CMS [2].
 I guess we all know that we have to overhaul the content itself. :-)
 
 But first of all we have to decide how to manage the content. We have to use
 either svnpubsub or the Apache CMS [1], the latter is recommended. IMHO
 we should use the CMS [2] as it would be more flexible and it is easier to
 maintain the content.
 
 As a good starting point I've changed the maven skin of our site to the
 bootstrap like fluendo skin [3]. Maybe it is a good idea to fresh up the 
 layout
 a little bit in preparation of a possible transition to the CMS.
 
 WDYT and the more interesting question any volunteer to handle the transition?
 
 BR
 Andreas Lehmkühler
 
 [1] http://www.apache.org/dev/project-site.html
 [2] http://www.apache.org/dev/cmsref.html
 [3] http://people.apache.org/~lehmi/pdfbox_fluendo/index.html
 
 Maruan Sahyoun
 
 [1] http://twitter.github.com/bootstrap/
 [2] 
 https://svn.apache.org/repos/infra/websites/cms/webgui/content/export.json
 
 Am 26.03.2013 um 15:22 schrieb Timo Boehme timo.boe...@ontochem.com:
 
 Hi,
 
 an update to the website with a cleaner grouping of content etc. would help 
 to attract people. While 'ode' and 'cordova' are visually nice I would like 
 to keep more navigation possibilities at the start page like in 
 'cloudstack'.
 
 
 Best regards,
 Timo
 
 
 Am 26.03.2013 14:03, schrieb Maruan Sahyoun:
 Hi there,
 
 what do you think about giving the PDFBox website an overhaul similar to
 
 http://cloudstack.apache.org/
 http://ode.apache.org/index.html
 http://cordova.apache.org
 
 with a more prominent user guide such as http://ode.apache.org/userguide/
 and a cleaner architecture description (together with main classes) for 
 developers
 
 to support a faster intro into pdfbox
 
 Kind regards
 
 Maruan Sahyoun
 
 
 --
 
 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com
 
 _
 
 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
 _
 


Jenkins build is back to normal : PDFBox-trunk #622

2013-03-26 Thread Apache Jenkins Server
See https://builds.apache.org/job/PDFBox-trunk/622/