[jira] [Updated] (PDFBOX-1542) Whitespaces between words are not created
[ https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalie Bureanu updated PDFBOX-1542: Attachment: Parser.java Our text extractor (with coordinates for each simbol). Whitespaces between words are not created - Key: PDFBOX-1542 URL: https://issues.apache.org/jira/browse/PDFBOX-1542 Project: PDFBox Issue Type: Wish Components: Text extraction Affects Versions: 1.7.1 Reporter: Vitalie Bureanu Priority: Minor Attachments: Parser.java Original Estimate: 1h Remaining Estimate: 1h Hello, I extract the text with PDFBox from PDF files. I noticed that extraction of text from some pdf files are not so good as expected. I have a seria of pdf invoices from which I try to extract the text with coordinates and resultat is pretty well, but I noticed very strange thing: when I extract text - the words are extracted without whitespaces bettween. Example: if I try to extract Unit Price the result is UnitPrice. But if I open the invoice in Adobe Reader and make Copy/Past into Notepad... I have the Unit Price with whitespaces! I think the whitespaces are not present in original pdf document... but the Adobe Reader in some way insert whitespaces between words when it show content of the pdf. Guys, can you please suggest me how I can have the strings with spaces after the parsing? See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf PS: I want to try the 1.8.0. version of PDFBox - how I can download it? Many thanks, Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PDFBOX-1542) Whitespaces between words are not created
[ https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613609#comment-13613609 ] Vitalie Bureanu commented on PDFBOX-1542: - Hello Andreas, Thank you for the promptness, I attached to post the source code of our Parser which we use for text extraction. We extract simbol by simbol with rispective coordinates for each simbol. When we extract all simbols - in middle of these simbols white spacings are missed. Many thanks, Vitalie Whitespaces between words are not created - Key: PDFBOX-1542 URL: https://issues.apache.org/jira/browse/PDFBOX-1542 Project: PDFBox Issue Type: Wish Components: Text extraction Affects Versions: 1.7.1 Reporter: Vitalie Bureanu Priority: Minor Attachments: Parser.java Original Estimate: 1h Remaining Estimate: 1h Hello, I extract the text with PDFBox from PDF files. I noticed that extraction of text from some pdf files are not so good as expected. I have a seria of pdf invoices from which I try to extract the text with coordinates and resultat is pretty well, but I noticed very strange thing: when I extract text - the words are extracted without whitespaces bettween. Example: if I try to extract Unit Price the result is UnitPrice. But if I open the invoice in Adobe Reader and make Copy/Past into Notepad... I have the Unit Price with whitespaces! I think the whitespaces are not present in original pdf document... but the Adobe Reader in some way insert whitespaces between words when it show content of the pdf. Guys, can you please suggest me how I can have the strings with spaces after the parsing? See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf PS: I want to try the 1.8.0. version of PDFBox - how I can download it? Many thanks, Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PDFBOX-1552) Uppercase letters are read in lowercase manner
[ https://issues.apache.org/jira/browse/PDFBOX-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hesham updated PDFBOX-1552: --- Attachment: pdf_with_uppercase_letters.pdf This is a 1 page sample file to test. Uppercase letters are read in lowercase manner -- Key: PDFBOX-1552 URL: https://issues.apache.org/jira/browse/PDFBOX-1552 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.7.1 Environment: Windows XP Reporter: Hesham Attachments: pdf_with_uppercase_letters.pdf I have a PDF that when I read its contents using PDFBox some uppercase letters are being read as lowercase. For example : - Word Testing is read as testing - Word Eve is read as eve - Word Deuteronomy is read as deuteronomy Andreas commented on this by: The pdf uses marked content to replace a string (14.9.4 Replacement Text of the PDF specs provides a simple example). And yes, PDFBox doesn't support it, yet. Please check this 1-page sample PDF. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Overhaul PDFBox site
Hi there, what do you think about giving the PDFBox website an overhaul similar to http://cloudstack.apache.org/ http://ode.apache.org/index.html http://cordova.apache.org with a more prominent user guide such as http://ode.apache.org/userguide/ and a cleaner architecture description (together with main classes) for developers to support a faster intro into pdfbox Kind regards Maruan Sahyoun
[jira] [Created] (PDFBOX-1552) Uppercase letters are read in lowercase manner
Hesham created PDFBOX-1552: -- Summary: Uppercase letters are read in lowercase manner Key: PDFBOX-1552 URL: https://issues.apache.org/jira/browse/PDFBOX-1552 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.7.1 Environment: Windows XP Reporter: Hesham Attachments: pdf_with_uppercase_letters.pdf I have a PDF that when I read its contents using PDFBox some uppercase letters are being read as lowercase. For example : - Word Testing is read as testing - Word Eve is read as eve - Word Deuteronomy is read as deuteronomy Andreas commented on this by: The pdf uses marked content to replace a string (14.9.4 Replacement Text of the PDF specs provides a simple example). And yes, PDFBox doesn't support it, yet. Please check this 1-page sample PDF. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Overhaul PDFBox site
Hi, an update to the website with a cleaner grouping of content etc. would help to attract people. While 'ode' and 'cordova' are visually nice I would like to keep more navigation possibilities at the start page like in 'cloudstack'. Best regards, Timo Am 26.03.2013 14:03, schrieb Maruan Sahyoun: Hi there, what do you think about giving the PDFBox website an overhaul similar to http://cloudstack.apache.org/ http://ode.apache.org/index.html http://cordova.apache.org with a more prominent user guide such as http://ode.apache.org/userguide/ and a cleaner architecture description (together with main classes) for developers to support a faster intro into pdfbox Kind regards Maruan Sahyoun -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _
Re: Overhaul PDFBox site
well - the navigation is similar also hidden behind drop downs on ode compared to cloudstack. Both are using the same css framework [1] and the navigation can even be combined - that should give us enough freedom (and is an implementation detail). Both seem to be using the Apache CMS [2]. Maruan Sahyoun [1] http://twitter.github.com/bootstrap/ [2] https://svn.apache.org/repos/infra/websites/cms/webgui/content/export.json Am 26.03.2013 um 15:22 schrieb Timo Boehme timo.boe...@ontochem.com: Hi, an update to the website with a cleaner grouping of content etc. would help to attract people. While 'ode' and 'cordova' are visually nice I would like to keep more navigation possibilities at the start page like in 'cloudstack'. Best regards, Timo Am 26.03.2013 14:03, schrieb Maruan Sahyoun: Hi there, what do you think about giving the PDFBox website an overhaul similar to http://cloudstack.apache.org/ http://ode.apache.org/index.html http://cordova.apache.org with a more prominent user guide such as http://ode.apache.org/userguide/ and a cleaner architecture description (together with main classes) for developers to support a faster intro into pdfbox Kind regards Maruan Sahyoun -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _
Re: Overhaul PDFBox site
Hi, Am 26.03.2013 17:00, schrieb Maruan Sahyoun: well - the navigation is similar also hidden behind drop downs on ode compared to cloudstack. Both are using the same css framework [1] and the navigation can even be combined - that should give us enough freedom (and is an implementation detail). Both seem to be using the Apache CMS [2]. I guess we all know that we have to overhaul the content itself. :-) But first of all we have to decide how to manage the content. We have to use either svnpubsub or the Apache CMS [1], the latter is recommended. IMHO we should use the CMS [2] as it would be more flexible and it is easier to maintain the content. As a good starting point I've changed the maven skin of our site to the bootstrap like fluendo skin [3]. Maybe it is a good idea to fresh up the layout a little bit in preparation of a possible transition to the CMS. WDYT and the more interesting question any volunteer to handle the transition? BR Andreas Lehmkühler [1] http://www.apache.org/dev/project-site.html [2] http://www.apache.org/dev/cmsref.html [3] http://people.apache.org/~lehmi/pdfbox_fluendo/index.html Maruan Sahyoun [1] http://twitter.github.com/bootstrap/ [2] https://svn.apache.org/repos/infra/websites/cms/webgui/content/export.json Am 26.03.2013 um 15:22 schrieb Timo Boehme timo.boe...@ontochem.com: Hi, an update to the website with a cleaner grouping of content etc. would help to attract people. While 'ode' and 'cordova' are visually nice I would like to keep more navigation possibilities at the start page like in 'cloudstack'. Best regards, Timo Am 26.03.2013 14:03, schrieb Maruan Sahyoun: Hi there, what do you think about giving the PDFBox website an overhaul similar to http://cloudstack.apache.org/ http://ode.apache.org/index.html http://cordova.apache.org with a more prominent user guide such as http://ode.apache.org/userguide/ and a cleaner architecture description (together with main classes) for developers to support a faster intro into pdfbox Kind regards Maruan Sahyoun -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _
Re: Overhaul PDFBox site
would be happy to handle that Maruan Sahyoun Am 26.03.2013 um 22:35 schrieb Andreas Lehmkuehler andr...@lehmi.de: Hi, Am 26.03.2013 17:00, schrieb Maruan Sahyoun: well - the navigation is similar also hidden behind drop downs on ode compared to cloudstack. Both are using the same css framework [1] and the navigation can even be combined - that should give us enough freedom (and is an implementation detail). Both seem to be using the Apache CMS [2]. I guess we all know that we have to overhaul the content itself. :-) But first of all we have to decide how to manage the content. We have to use either svnpubsub or the Apache CMS [1], the latter is recommended. IMHO we should use the CMS [2] as it would be more flexible and it is easier to maintain the content. As a good starting point I've changed the maven skin of our site to the bootstrap like fluendo skin [3]. Maybe it is a good idea to fresh up the layout a little bit in preparation of a possible transition to the CMS. WDYT and the more interesting question any volunteer to handle the transition? BR Andreas Lehmkühler [1] http://www.apache.org/dev/project-site.html [2] http://www.apache.org/dev/cmsref.html [3] http://people.apache.org/~lehmi/pdfbox_fluendo/index.html Maruan Sahyoun [1] http://twitter.github.com/bootstrap/ [2] https://svn.apache.org/repos/infra/websites/cms/webgui/content/export.json Am 26.03.2013 um 15:22 schrieb Timo Boehme timo.boe...@ontochem.com: Hi, an update to the website with a cleaner grouping of content etc. would help to attract people. While 'ode' and 'cordova' are visually nice I would like to keep more navigation possibilities at the start page like in 'cloudstack'. Best regards, Timo Am 26.03.2013 14:03, schrieb Maruan Sahyoun: Hi there, what do you think about giving the PDFBox website an overhaul similar to http://cloudstack.apache.org/ http://ode.apache.org/index.html http://cordova.apache.org with a more prominent user guide such as http://ode.apache.org/userguide/ and a cleaner architecture description (together with main classes) for developers to support a faster intro into pdfbox Kind regards Maruan Sahyoun -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _
Jenkins build is back to normal : PDFBox-trunk #622
See https://builds.apache.org/job/PDFBox-trunk/622/