Created new version 1.8.9 in JIRA

2014-12-09 Thread Andreas Lehmkühler
Hi,

maybe a little bit to early, but I've created a new 1.8.9 version in JIRA.
Obviously Tilman already works on 2 issues fitting in this version.

BR
Andreas Lehmkühler


[jira] [Updated] (PDFBOX-2539) [PATCH] Allow non static FontProvider

2014-12-09 Thread simon steiner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

simon steiner updated PDFBOX-2539:
--
Attachment: (was: fontProvider.patch)

 [PATCH] Allow non static FontProvider
 -

 Key: PDFBOX-2539
 URL: https://issues.apache.org/jira/browse/PDFBOX-2539
 Project: PDFBox
  Issue Type: Bug
  Components: FontBox
Affects Versions: 2.0.0
Reporter: simon steiner
 Attachments: fontProvider.patch


 I would like to use multiple instances of fontprovider in thread safe way



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2539) [PATCH] Allow non static FontProvider

2014-12-09 Thread simon steiner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

simon steiner updated PDFBOX-2539:
--
Attachment: fontProvider.patch

Fix patch

 [PATCH] Allow non static FontProvider
 -

 Key: PDFBOX-2539
 URL: https://issues.apache.org/jira/browse/PDFBOX-2539
 Project: PDFBox
  Issue Type: Bug
  Components: FontBox
Affects Versions: 2.0.0
Reporter: simon steiner
 Attachments: fontProvider.patch


 I would like to use multiple instances of fontprovider in thread safe way



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: preflight mass tests

2014-12-09 Thread Maruan Sahyoun
Hallo Tilman,

hast Du ne grobe Schätzung welcher Anteil der Dateien z.B. in Adobe Reader 
entweder nicht angezeigt, mit Dialog angezeigt oder falsch angezeigt wird?

Lieben Gruß
 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

Am 05.12.2014 um 20:45 schrieb Tilman Hausherr thaush...@t-online.de:

 Some numbers... it took 4-5 days
 
 total: 231223, failed: 142, percentage failed: 0.06141257472336292
 
 Of these, one can substract 33 OutOfMemoryErrors that happened near the end 
 of the test. Isolated runs went fine.
 
 about the rest:
 18 are the isSymbol stackoverflow
 9 are the getFontMatrix NPE
 33 are the root must be of type Pages errors
 
 The rest is mostly related to very broken PDF files.
 
 Tilman
 
 
 Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
 Hi Tilman,
 
 that's very good news. I trust a lot of time went into reviewing the test 
 results. wo your and Tim's efforts this achievement wouldn't have been 
 possible.
 
 BR
 
 Maruan
 
 Am 03.12.2014 um 21:04 schrieb Tilman Hausherr thaush...@t-online.de:
 
 I've now run preflight on half of the govdocs files. Every issue I have 
 opened on preflight is related to that test. The failure rate (exceptions 
 other than the allowed ValidationExceptions) is down from 1% when I 
 started to 0.05% now. Most of the frequent exceptions (e.g. the one with 
 NonTermimalField) have been fixed. Whats left now are exceptions related to 
 messy files, and some of the font related issues.
 
 Tilman
 
 Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
 Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
 It is not looking good, there is at least one NPEs issue coming.
 No more NPE after solving the two issues I opened today except 
 PDFBOX-1743.pdf which is a known problem.
 
 Coming up soon: run preflight on the 231227 PDF files from digitalcorpora 
 to see what happens.
 
 Tilman
 
 
 



[jira] [Updated] (PDFBOX-1886) Merge Function strips OCR layer in acrobat

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-1886:
---
Fix Version/s: 2.1.0

 Merge Function strips OCR layer in acrobat
 --

 Key: PDFBOX-1886
 URL: https://issues.apache.org/jira/browse/PDFBOX-1886
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.4
Reporter: adam brin
 Fix For: 2.1.0

 Attachments: cover_page4818280580458469287.pdf, page1.pdf, 
 santa-cruz-flats-project-part-2 (1).pdf


 We use the PDFMergerUtility to add cover pages to documents automatically. 
 We're finding that when we do so, it strips the OCR data from the source of 
 the merged files.
 {code}
 PDFMergerUtility merger = new PDFMergerUtility();
 File outputFile = File.createTempFile();
 merger.setDestinationStream(new FileOutputStream(outputFile));
 for (File file : files) {
 merger.addSource(file);
 }
 merger.mergeDocuments();
 return outputFile;
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-1878) Tags are not being displayed in Adobe Acrobat Tags panel when merging pdfs

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-1878:
---
Fix Version/s: 2.1.0

 Tags are not being displayed in Adobe Acrobat Tags panel when merging pdfs
 --

 Key: PDFBOX-1878
 URL: https://issues.apache.org/jira/browse/PDFBOX-1878
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.3, 1.8.4
 Environment: Windows XP SP3
Reporter: Tiuser Lassei
Priority: Minor
 Fix For: 2.1.0

 Attachments: pdf1.3.pdf, pdf1.4.pdf


 The merged PDF output produced by the PDFMergerUtility does not display the 
 tags correctly in the Tags panel of Adobe Acrobat. (Tested in Acrobat Pro XI 
 trial version). Have not tested in another PDF tool that can display tags 
 (not sure if another tool can do this).
 A single blank entry is shown instead of the actual structure tree of the 
 combined source pdfs.
 Though, it seems the reading order (based on the tag structure) is still 
 preserved (based on the testing of adobe reader's read aloud feature).
 Possibly related to fix on tag merging:
 https://issues.apache.org/jira/browse/PDFBOX-1342
 Although the tag merging logic is wrong is 1.8.2 (as only the first page is 
 tagged which was fixed as indicated in PDFBOX-1342), the tags appear 
 correctly in the Tag panel.
 This bug prevents users from modifying the tag structure in Acrobat as the 
 tag entries are missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PDFBOX-1874) PDFTextStripper.isParagraphSeparation(...)

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler reassigned PDFBOX-1874:
--

Assignee: Andreas Lehmkühler

 PDFTextStripper.isParagraphSeparation(...)
 --

 Key: PDFBOX-1874
 URL: https://issues.apache.org/jira/browse/PDFBOX-1874
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.3
 Environment: Eclipse
Reporter: Yuri Burrows
Assignee: Andreas Lehmkühler
Priority: Minor
  Labels: patch

 PDFTextStripper.isParagraphSeparation(...) seems to have an issue with how it 
 finds Y text indentation.
 PROBLEM:
 I believe the issue is due to precision in the the following logic:
 float yGap = Math.abs(position.getTextPosition().getYDirAdj()-
 lastPosition.getTextPosition().getYDirAdj());
 float xGap = (position.getTextPosition().getXDirAdj()-
 lastLineStartPosition.getTextPosition().getXDirAdj());
 if(yGap  (getDropThreshold()*maxHeightForLine))
 {
 result = true;
 yGap has a precision to 1000th+, while (getDropThreshold()*maxHeightForLine) 
 has a precision to 100,000th. Resulting in the following comparison (example):
 16.018  16.018005
 which evaluates to True. However 16.018  16.018 would evaluate to False.
 EFFECT OF THE PROBLEM:
 every line in the output is marked as isParagraphStart = true and 
 writeParagraphEnd() ... = true.
 I.E. 
 |||NEW_LINE|||
 |||PARAGRAPH_START|||PDFBox has been designed to represent PDF documents 
 using familiar object-oriented paradigms. The data|||NEW_LINE|||
 contained in a PDF document is a collection of basic object types: arrays, 
 booleans, dictionaries, numbers,|||NEW_LINE|||
 |||PARAGRAPH_END||NEW_LINE|||
 |||PARAGRAPH_START|||strings and binary streams. PDFBox captures these basic 
 object types in the org.pdfbox.cos package (the|||NEW_LINE|||
 COS Model). While it's possible to create any desired interactions with a PDF 
 document using only these|||NEW_LINE|||
 |||PARAGRAPH_END||NEW_LINE|||
 In the source PDF these lines appear as such:
 PDFBox has been designed to represent PDF documents using familiar 
 object-oriented paradigms. The data
 contained in a PDF document is a collection of basic object types: arrays, 
 booleans, dictionaries, numbers,
 strings and binary streams. PDFBox captures these basic object types in the 
 org.pdfbox.cos package (the
 COS Model). While it's possible to create any desired interactions with a PDF 
 document using only these
 MY WORKAROUND:
 NOTE: there is a small performance hit with this workaround.
float yGap = Math.abs(position.getTextPosition().getYDirAdj()
- lastPosition.getTextPosition().getYDirAdj());
   
DecimalFormat df = new DecimalFormat(#.00);
float yGapTruncated = Float.valueOf(df.format(yGap));
   
float newYVal = Float.valueOf(df.format(getDropThreshold()
* maxHeightForLine));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-1807) TextToPDF strips leading spaces from input file

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-1807:
---
Fix Version/s: 3.0.0

 TextToPDF strips leading spaces from input file
 ---

 Key: PDFBOX-1807
 URL: https://issues.apache.org/jira/browse/PDFBOX-1807
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.3
 Environment: Win7 64 bit
Reporter: Mark Mitchell
Priority: Minor
 Fix For: 3.0.0


 When using the TextToPDF utility on a text file that has spaces in the front 
 for formatting purposes, the leading spaces on the line are being stripped 
 causing the report to no longer looks like it did in the PDF.  
 Was this the intended result?  Is there a way to turn off the stripping of 
 the spaces?  If not, can it be added?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-1807) TextToPDF strips leading spaces from input file

2014-12-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239282#comment-14239282
 ] 

Andreas Lehmkühler commented on PDFBOX-1807:


TestToPDF is a proof of concept and not a real application. So, don't expect to 
much.

 TextToPDF strips leading spaces from input file
 ---

 Key: PDFBOX-1807
 URL: https://issues.apache.org/jira/browse/PDFBOX-1807
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.3
 Environment: Win7 64 bit
Reporter: Mark Mitchell
Priority: Minor
 Fix For: 3.0.0


 When using the TextToPDF utility on a text file that has spaces in the front 
 for formatting purposes, the leading spaces on the line are being stripped 
 causing the report to no longer looks like it did in the PDF.  
 Was this the intended result?  Is there a way to turn off the stripping of 
 the spaces?  If not, can it be added?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-1242) Handle non ISO-8859-1 chars with drawString

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-1242:
---
Fix Version/s: 2.0.0

 Handle non ISO-8859-1 chars with drawString
 ---

 Key: PDFBOX-1242
 URL: https://issues.apache.org/jira/browse/PDFBOX-1242
 Project: PDFBox
  Issue Type: Bug
  Components: Writing
Affects Versions: 1.5.0, 1.6.0
Reporter: Peter Andersen
 Fix For: 2.0.0


 The PDPageContentStream.drawString take a String as argument, it construct a 
 COSString of the input.
 If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff 
 and the bytes are taken from the
 input as UTF-16BE encoded.
 Back in the drawString method this unicode16 encoded COSString is appended as 
 a ISO-8859-1
   appendRawCommands( new String( buffer.toByteArray(), ISO-8859-1));
  
 The result of this is that a line with UTF-16 chars is shown prefix with þÿ, 
 and with double space between the other chars.
 The chars above 255 are shown as the two corresponding ISO-8859-1 characters.
 As a side question to this observation, is there an alternative way to use 
 Pdfbox, to support UTF16?
  
  
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PDFBOX-1151) StreamCorruptedException on bad PDF with -force

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler reassigned PDFBOX-1151:
--

Assignee: Andreas Lehmkühler

 StreamCorruptedException on bad PDF with -force
 ---

 Key: PDFBOX-1151
 URL: https://issues.apache.org/jira/browse/PDFBOX-1151
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.6.0, 1.8.7, 2.0.0
 Environment: Windows Vista
 Sun JDK 1.6.0_26
Reporter: Stas Shaposhnikov
Assignee: Andreas Lehmkühler
 Attachments: PDFStreamEngine.patch, test.pdf


 I am getting the StreamCorruptedException when trying to parse a possibly 
 invalid PDF document even if the -force option is specified.
 Stack trace:
 java.io.StreamCorruptedException: Error: data is null
   at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)
   at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
   at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
   at 
 org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
   at 
 org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:105)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:264)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
   at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:256)
   at org.apache.pdfbox.ExtractText.main(ExtractText.java:76)
   at org.apache.pdfbox.PDFBox.main(PDFBox.java:42)
 My suggestion is to skip bad sub-streams without throwing exceptions in 
 PDFStreamEngine.processSubStream() in case of forceParsing is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-785) Spliting a PDF creates unnecessarily large files

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-785:
--
Fix Version/s: 2.0.0

 Spliting a PDF creates unnecessarily large files
 

 Key: PDFBOX-785
 URL: https://issues.apache.org/jira/browse/PDFBOX-785
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 0.8.0-incubator, 1.1.0, 1.2.1
 Environment: Windows XP, openOffice3.0.0, pdfsam
Reporter: mathieu radiguet
Assignee: Andreas Lehmkühler
 Fix For: 2.0.0

 Attachments: fileSizeIssue.zip


 Using PDFBox0.8.0 (also tryed on 1.1.0 and 1.2.1) to split files result in 
 bigger parts than the original.
 Conserned files where made from openOfice .odt documents in version 3.0.0 
 using openOffice pdf Export and then merging several copies with pdfsam 
 (http://www.pdfsam.org/)
 In joined eclipse project the test file size is 10 712 749  bytes for 
 2812pages and the result files's sises after spliting in two at page 2300 are 
 : 8 812 515  bytes and 10 701 142  bytes.
 Using pdfSplit in command line as result we have all single result file 
 bigger than the original. An exemple is also joined. An error tells the 
 original file is corrupted but we tryed it on a file (using pdfsam and 
 without using it) with no error and with similar result so i think it's not 
 related. 
 This issue seems similar to : JIRA PDFBOX-28 
 (https://issues.apache.org/jira/browse/PDFBOX-28)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger reopened PDFBOX-2548:
---

 Problems with character extraction (fi ligature)
 

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: preflight.png, test.pdf, test2.pdf


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Attachment: (was: test.pdf)

 Problems with character extraction (fi ligature)
 

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: preflight.png


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger closed PDFBOX-2548.
-
Resolution: Not a Problem

 Problems with character extraction (fi ligature)
 

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: preflight.png


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Attachment: (was: test2.pdf)

 Problems with character extraction (fi ligature)
 

 Key: PDFBOX-2548
 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
 Environment: Windows7Professional JavaSE8 EclipseKepler
Reporter: Matthias Bösinger
Priority: Minor
 Attachments: preflight.png


  favorite
   
 I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
 pdfBox text extraction can also extract special characters (for example small 
 capital lettres), which caused problems when the underlying font has been a 
 simple Type1 font.
 However, the text extraction now causes another type of problem. In my case, 
 when the charater sequences fi or fl occur in the text, the 
 PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
 'fi' and 'fl' and sets a space character on their right side.
 (Surprisingly, if I access the list of characters of a page via the 
 charactersByArticle field of PDFTextStripper / via the 
 PDFTextStripper#processText(TextPosition pos) method, the same characters 
 show up as 'normal-single' characters f i / f l).
 My assumption is that the advantage of the underlying OpenFont type turns 
 into this particular disadvantage, because the PDFTextStripper recognizes the 
 character sequence f i / f l as special charcters fi / fl (- what might have to 
 do with the fact, that the getText() method calculates things like whitespace 
 characters by distances / positional placements).
 Background: The given document is a wordbook text with very dense printed 
 text.
 see this link for code and output:
 http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
 My question: is there anything what I can do to avoid this problem?
 thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PDFBOX-2551) Wrong barcode printing for embedded font

2014-12-09 Thread Andriy (JIRA)
Andriy created PDFBOX-2551:
--

 Summary: Wrong barcode printing for embedded font
 Key: PDFBOX-2551
 URL: https://issues.apache.org/jira/browse/PDFBOX-2551
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7
Reporter: Andriy
 Fix For: 1.8.8
 Attachments: barcode_printing_problem.pdf

Couldn't print file with embedded font code 128.  Code for printing:

PDDocument document = load(new FileInputStream(barcode_printing_problem.pdf));
PrinterJob printJob = getPrinterJob();
printJob.setPrintService(getPrinter(MY_PRINTER));
document.silentPrint(printJob);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2551) Wrong barcode printing for embedded font

2014-12-09 Thread Andriy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andriy updated PDFBOX-2551:
---
Attachment: barcode_printing_problem.pdf

Input pdf file

 Wrong barcode printing for embedded font
 

 Key: PDFBOX-2551
 URL: https://issues.apache.org/jira/browse/PDFBOX-2551
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7
Reporter: Andriy
 Fix For: 1.8.8

 Attachments: barcode_printing_problem.pdf


 Couldn't print file with embedded font code 128.  Code for printing:
 PDDocument document = load(new 
 FileInputStream(barcode_printing_problem.pdf));
 PrinterJob printJob = getPrinterJob();
 printJob.setPrintService(getPrinter(MY_PRINTER));
 document.silentPrint(printJob);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2551) Wrong barcode printing for embedded font

2014-12-09 Thread Andriy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andriy updated PDFBOX-2551:
---
Attachment: print_result.pdf

after pring

 Wrong barcode printing for embedded font
 

 Key: PDFBOX-2551
 URL: https://issues.apache.org/jira/browse/PDFBOX-2551
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7
Reporter: Andriy
 Fix For: 1.8.8

 Attachments: barcode_printing_problem.pdf, print_result.pdf


 Couldn't print file with embedded font code 128.  Code for printing:
 PDDocument document = load(new 
 FileInputStream(barcode_printing_problem.pdf));
 PrinterJob printJob = getPrinterJob();
 printJob.setPrintService(getPrinter(MY_PRINTER));
 document.silentPrint(printJob);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-2551) Wrong barcode printing for embedded font

2014-12-09 Thread Andriy (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239493#comment-14239493
 ] 

Andriy edited comment on PDFBOX-2551 at 12/9/14 2:59 PM:
-

after pring print_result.pdf


was (Author: andriy.brez):
after pring

 Wrong barcode printing for embedded font
 

 Key: PDFBOX-2551
 URL: https://issues.apache.org/jira/browse/PDFBOX-2551
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7
Reporter: Andriy
 Fix For: 1.8.8

 Attachments: barcode_printing_problem.pdf, print_result.pdf


 Couldn't print file with embedded font code 128.  Code for printing:
 PDDocument document = load(new 
 FileInputStream(barcode_printing_problem.pdf));
 PrinterJob printJob = getPrinterJob();
 printJob.setPrintService(getPrinter(MY_PRINTER));
 document.silentPrint(printJob);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-2551) Wrong barcode printing for embedded font

2014-12-09 Thread Andriy (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239492#comment-14239492
 ] 

Andriy edited comment on PDFBOX-2551 at 12/9/14 2:59 PM:
-

Input pdf file barcode_printing_problem.pdf


was (Author: andriy.brez):
Input pdf file

 Wrong barcode printing for embedded font
 

 Key: PDFBOX-2551
 URL: https://issues.apache.org/jira/browse/PDFBOX-2551
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7
Reporter: Andriy
 Fix For: 1.8.8

 Attachments: barcode_printing_problem.pdf, print_result.pdf


 Couldn't print file with embedded font code 128.  Code for printing:
 PDDocument document = load(new 
 FileInputStream(barcode_printing_problem.pdf));
 PrinterJob printJob = getPrinterJob();
 printJob.setPrintService(getPrinter(MY_PRINTER));
 document.silentPrint(printJob);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2551) Wrong barcode printing for embedded font

2014-12-09 Thread Andriy (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239497#comment-14239497
 ] 

Andriy commented on PDFBOX-2551:


Could it issue depends of text encoding?

 Wrong barcode printing for embedded font
 

 Key: PDFBOX-2551
 URL: https://issues.apache.org/jira/browse/PDFBOX-2551
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7
Reporter: Andriy
 Fix For: 1.8.8

 Attachments: barcode_printing_problem.pdf, print_result.pdf


 Couldn't print file with embedded font code 128.  Code for printing:
 PDDocument document = load(new 
 FileInputStream(barcode_printing_problem.pdf));
 PrinterJob printJob = getPrinterJob();
 printJob.setPrintService(getPrinter(MY_PRINTER));
 document.silentPrint(printJob);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: preflight mass tests

2014-12-09 Thread Allison, Timothy B.
Tilman,
  This is fantastic!  If you send me an example of the code you used to call 
preflight (#parse() or  #parse(Format format)???), I'd like to run it within 
tika-batch to see what our batch performance is.
  Ideally, once we can turn our public vm on, it would be fun to run these 
tests there.
  

 Best,

Tim

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Friday, December 05, 2014 2:45 PM
To: dev@pdfbox.apache.org
Subject: Re: preflight mass tests

Some numbers... it took 4-5 days

total: 231223, failed: 142, percentage failed: 0.06141257472336292

Of these, one can substract 33 OutOfMemoryErrors that happened near the 
end of the test. Isolated runs went fine.

about the rest:
18 are the isSymbol stackoverflow
9 are the getFontMatrix NPE
33 are the root must be of type Pages errors

The rest is mostly related to very broken PDF files.

Tilman


Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
 Hi Tilman,

 that's very good news. I trust a lot of time went into reviewing the test 
 results. wo your and Tim's efforts this achievement wouldn't have been 
 possible.

 BR

 Maruan

 Am 03.12.2014 um 21:04 schrieb Tilman Hausherr thaush...@t-online.de:

 I've now run preflight on half of the govdocs files. Every issue I have 
 opened on preflight is related to that test. The failure rate (exceptions 
 other than the allowed ValidationExceptions) is down from 1% when I 
 started to 0.05% now. Most of the frequent exceptions (e.g. the one with 
 NonTermimalField) have been fixed. Whats left now are exceptions related to 
 messy files, and some of the font related issues.

 Tilman

 Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
 Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
 It is not looking good, there is at least one NPEs issue coming.
 No more NPE after solving the two issues I opened today except 
 PDFBOX-1743.pdf which is a known problem.

 Coming up soon: run preflight on the 231227 PDF files from digitalcorpora 
 to see what happens.

 Tilman





[jira] [Updated] (PDFBOX-2551) Wrong barcode printing for embedded font

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2551:
---
Fix Version/s: (was: 1.8.8)

 Wrong barcode printing for embedded font
 

 Key: PDFBOX-2551
 URL: https://issues.apache.org/jira/browse/PDFBOX-2551
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 1.8.7
Reporter: Andriy
 Attachments: barcode_printing_problem.pdf, print_result.pdf


 Couldn't print file with embedded font code 128.  Code for printing:
 PDDocument document = load(new 
 FileInputStream(barcode_printing_problem.pdf));
 PrinterJob printJob = getPrinterJob();
 printJob.setPrintService(getPrinter(MY_PRINTER));
 document.silentPrint(printJob);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: preflight mass tests

2014-12-09 Thread Tilman Hausherr
Here's the code... it assumes that all PDFs are flat in one single 
directory. Libraries needed: preflight-app, jai_imageio, 
levigo_jbig2-imageio-1.6.1.jar. I have run it only with the trunk, not 
with 1.8, because we didn't fix all problems there.

Tilman

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FilenameFilter;
import java.io.PrintWriter;
import org.apache.pdfbox.preflight.PreflightDocument;
import org.apache.pdfbox.preflight.exception.ValidationException;
import org.apache.pdfbox.preflight.parser.PreflightParser;

/**
 *
 * @author Tilman Hausherr
 */
public class PreflightTest
{
public static void main(String[] args) throws FileNotFoundException
{
File dir;
if (args.length  0)
{
dir = new File(args[0]);
}
else
{
dir = new File(k:\\dc);
}

int total = 0;
int failed = 0;
File[] dirList = dir.listFiles(new FilenameFilter()
{
@Override
public boolean accept(File dir, String name)
{
if (name.compareTo(00.pdf) = 0) // use this to 
start at a certain file

{
return false;
}
return name.toLowerCase().endsWith(.pdf);
}
});
for (File pdf : dirList)
{
++total;
System.out.println(pdf.getName());
// just test that it doesn't crash
try
{
new File(pdf.getName() + -exception.txt).delete();
PreflightParser parser = new PreflightParser(pdf);
parser.parse();
try (PreflightDocument preflightDocument = 
parser.getPreflightDocument())

{
preflightDocument.validate();
preflightDocument.getResult();
}
parser.clearResources();
}
catch (ValidationException e)
{
}
catch (Throwable e)
{
++failed;
try (PrintWriter pw = new PrintWriter(new 
File(pdf.getName() + -exception.txt)))

{
e.printStackTrace(pw);
}
System.out.flush();
System.err.flush();
System.err.print(pdf.getName() +  preflight fail: );
e.printStackTrace();
System.out.flush();
System.err.flush();
}
System.out.println(total:  + total + , failed:  + 
failed + , percentage failed:  + (((float) failed) / total * 100.0) + 
%);

}

}

}


Am 09.12.2014 um 17:28 schrieb Allison, Timothy B.:

Tilman,
   This is fantastic!  If you send me an example of the code you used to call 
preflight (#parse() or  #parse(Format format)???), I'd like to run it within 
tika-batch to see what our batch performance is.
   Ideally, once we can turn our public vm on, it would be fun to run these 
tests there.
   


  Best,

 Tim

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Friday, December 05, 2014 2:45 PM
To: dev@pdfbox.apache.org
Subject: Re: preflight mass tests

Some numbers... it took 4-5 days

total: 231223, failed: 142, percentage failed: 0.06141257472336292

Of these, one can substract 33 OutOfMemoryErrors that happened near the
end of the test. Isolated runs went fine.

about the rest:
18 are the isSymbol stackoverflow
9 are the getFontMatrix NPE
33 are the root must be of type Pages errors

The rest is mostly related to very broken PDF files.

Tilman


Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:

Hi Tilman,

that's very good news. I trust a lot of time went into reviewing the test 
results. wo your and Tim's efforts this achievement wouldn't have been possible.

BR

Maruan

Am 03.12.2014 um 21:04 schrieb Tilman Hausherr thaush...@t-online.de:


I've now run preflight on half of the govdocs files. Every issue I have opened on 
preflight is related to that test. The failure rate (exceptions other than the 
allowed ValidationExceptions) is down from 1% when I started to 0.05% now. 
Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. 
Whats left now are exceptions related to messy files, and some of the font related issues.

Tilman

Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:

Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:

It is not looking good, there is at least one NPEs issue coming.

No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf 
which is a known problem.

Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to 
see what happens.

Tilman





svn: E185004: Unexpected end of svndiff input

2014-12-09 Thread Andreas Lehmkuehler

Hi,

I've got the following error when I try to commit the PDFBox release candidate 
to the dist repo [1]


svn: E185004: Unexpected end of svndiff input

The issue seems to be related to big files only, as I was able to commit the 
smaller files  1Mb.


The email notification doesn't work too.

Can someone please have a look, as I'm in the middle of the release process.

Thanks in advance
Andreas Lehmkühler

[1] https://dist.apache.org/repos/dist/dev/pdfbox/1.8.8/


[jira] [Commented] (PDFBOX-1886) Merge Function strips OCR layer in acrobat

2014-12-09 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239855#comment-14239855
 ] 

Tilman Hausherr commented on PDFBOX-1886:
-

Why do you think that the OCR is missing? I can copy  paste text from the 
santa-cruz-flats-project-part-2 (1).pdf file.

 Merge Function strips OCR layer in acrobat
 --

 Key: PDFBOX-1886
 URL: https://issues.apache.org/jira/browse/PDFBOX-1886
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.4
Reporter: adam brin
 Fix For: 2.1.0

 Attachments: cover_page4818280580458469287.pdf, page1.pdf, 
 santa-cruz-flats-project-part-2 (1).pdf


 We use the PDFMergerUtility to add cover pages to documents automatically. 
 We're finding that when we do so, it strips the OCR data from the source of 
 the merged files.
 {code}
 PDFMergerUtility merger = new PDFMergerUtility();
 File outputFile = File.createTempFile();
 merger.setDestinationStream(new FileOutputStream(outputFile));
 for (File file : files) {
 merger.addSource(file);
 }
 merger.mergeDocuments();
 return outputFile;
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2505) ArrayIndexOutOfBoundsException in PDColor constructor

2014-12-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239866#comment-14239866
 ] 

ASF subversion and git services commented on PDFBOX-2505:
-

Commit 1644156 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1644156 ]

PDFBOX-2505: fix parameter validation from previous commit

 ArrayIndexOutOfBoundsException in PDColor constructor
 -

 Key: PDFBOX-2505
 URL: https://issues.apache.org/jira/browse/PDFBOX-2505
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
 Fix For: 2.0.0

 Attachments: PDFBOX-2505-032618-p96.pdf


 {code}
 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.elementData(Unknown Source)
 at java.util.ArrayList.get(Unknown Source)
 at org.apache.pdfbox.cos.COSArray.get(COSArray.java:210)
 at 
 org.apache.pdfbox.pdmodel.graphics.color.PDColor.init(PDColor.java:54)
 at 
 org.apache.pdfbox.contentstream.operator.color.SetColor.process(SetColor.java:41)
 at 
 org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceCMYKColor.process(SetNonStrokingDeviceCMYKColor.java:38)
 {code}
 The attached file has a k without arguments.
 This is only in 2.0, not in 1.8. In 1.8 SetNonStrokingCMYKColor initializes 
 the array with size 4 (ok, it will crash if there are 5 arguments), in 2.0 
 SetNonStrokingDeviceCMYKColor / SetColor take what is there.
 Two possible solutions in SetColor:
 1) initialize components with the initial colors of the colorspace
 2) initialize components with empty array
 Both solutions get rid of the exception. Solution 2 is used in another 
 constructor.
 Which one is better? (I'd prefer solution 1 because it has the correct array 
 size and would also change the other constructor)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: preflight mass tests

2014-12-09 Thread Tilman Hausherr

Since you answered to the list, I'll answer here too:
I dpn't know, I didn't try to display the fails.

Tilman

Am 09.12.2014 um 10:59 schrieb Maruan Sahyoun:

Hallo Tilman,

hast Du ne grobe Schätzung welcher Anteil der Dateien z.B. in Adobe Reader 
entweder nicht angezeigt, mit Dialog angezeigt oder falsch angezeigt wird?

Lieben Gruß
  
Maruan Sahyoun


FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

Am 05.12.2014 um 20:45 schrieb Tilman Hausherr thaush...@t-online.de:


Some numbers... it took 4-5 days

total: 231223, failed: 142, percentage failed: 0.06141257472336292

Of these, one can substract 33 OutOfMemoryErrors that happened near the end of 
the test. Isolated runs went fine.

about the rest:
18 are the isSymbol stackoverflow
9 are the getFontMatrix NPE
33 are the root must be of type Pages errors

The rest is mostly related to very broken PDF files.

Tilman


Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:

Hi Tilman,

that's very good news. I trust a lot of time went into reviewing the test 
results. wo your and Tim's efforts this achievement wouldn't have been possible.

BR

Maruan

Am 03.12.2014 um 21:04 schrieb Tilman Hausherr thaush...@t-online.de:


I've now run preflight on half of the govdocs files. Every issue I have opened on 
preflight is related to that test. The failure rate (exceptions other than the 
allowed ValidationExceptions) is down from 1% when I started to 0.05% now. 
Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. 
Whats left now are exceptions related to messy files, and some of the font related issues.

Tilman

Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:

Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:

It is not looking good, there is at least one NPEs issue coming.

No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf 
which is a known problem.

Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to 
see what happens.

Tilman







Build failed in Jenkins: PDFBox-trunk » PDFBox parent #1511

2014-12-09 Thread Apache Jenkins Server
See 
https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox-parent/1511/

--
maven3-agent.jar already up to date
maven3-interceptor.jar already up to date
maven3-interceptor-commons.jar already up to date
===[JENKINS REMOTING CAPACITY]===   channel started
Executing Maven:  -B -f 
/home/jenkins/jenkins-slave/workspace/PDFBox-trunk/trunk/pom.xml 
-Dmaven.repo.local=/home/jenkins/jenkins-slave/maven-repositories/1 clean 
deploy -Ppedantic
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] PDFBox parent
[INFO] Apache FontBox
[INFO] Apache XmpBox
[INFO] Apache PDFBox
[INFO] Apache Preflight
[INFO] Apache Preflight application
[INFO] Apache PDFBox tools
[INFO] Apache PDFBox application
[INFO] Apache PDFBox examples
[INFO] PDFBox reactor
[INFO] 
[INFO] 
[INFO] Building PDFBox parent 2.0.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ pdfbox-parent ---
[TASKS] Scanning folder 
'https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox-parent/ws/'
 for files matching the pattern '**/*.java' - excludes: 
[TASKS] Found 0 files to scan for tasks
Found 0 open tasks.
[TASKS] Computing warning deltas based on reference build #1510
log4j:WARN No appenders could be found for logger 
(org.apache.commons.beanutils.converters.BooleanConverter).
log4j:WARN Please initialize the log4j system properly.
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ pdfbox-parent 
---
[INFO] 
[INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ 
pdfbox-parent ---
[INFO] 
[INFO] --- apache-rat-plugin:0.10:check (default) @ pdfbox-parent ---
[INFO] 51 implicit excludes (use -debug for more details).
[INFO] Exclude: release.properties
[INFO] 1 resources included (use -debug for more details)
[INFO] Rat check: Summary of files. Unapproved: 0 unknown: 0 generated: 0 
approved: 1 licence.
[INFO] 
[INFO] --- maven-install-plugin:2.5.1:install (default-install) @ pdfbox-parent 
---
[INFO] Installing 
https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox-parent/ws/pom.xml
 to 
/home/jenkins/jenkins-slave/maven-repositories/1/org/apache/pdfbox/pdfbox-parent/2.0.0-SNAPSHOT/pdfbox-parent-2.0.0-SNAPSHOT.pom
[INFO] 
[INFO] --- maven-deploy-plugin:2.8.1:deploy (default-deploy) @ pdfbox-parent ---
Downloading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-parent/2.0.0-SNAPSHOT/maven-metadata.xml
[WARNING] Could not transfer metadata 
org.apache.pdfbox:pdfbox-parent:2.0.0-SNAPSHOT/maven-metadata.xml from/to 
apache.snapshots.https 
(https://repository.apache.org/content/repositories/snapshots): 
repository.apache.org


Build failed in Jenkins: PDFBox-trunk #1511

2014-12-09 Thread Apache Jenkins Server
See https://builds.apache.org/job/PDFBox-trunk/1511/changes

Changes:

[tilman] PDFBOX-2505: fix parameter validation from previous commit

--
Started by an SCM change
Building remotely on ubuntu-4 (docker Ubuntu ubuntu4 ubuntu) in workspace 
https://builds.apache.org/job/PDFBox-trunk/ws/
Cleaning up https://builds.apache.org/job/PDFBox-trunk/ws/trunk
Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/target
Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/app/target
Deleting 
https://builds.apache.org/job/PDFBox-trunk/ws/trunk/preflight-app/target
Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/examples/target
Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/parent/target
Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/xmpbox/target
Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/preflight/target
Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/pdfbox/target
Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/tools/target
Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/fontbox/target
Updating http://svn.apache.org/repos/asf/pdfbox/trunk at revision 
'2014-12-09T20:58:15.682 +'
U 
pdfbox/src/main/java/org/apache/pdfbox/contentstream/operator/state/SetLineWidth.java
At revision 1644175
Parsing POMs
maven3-agent.jar already up to date
maven3-interceptor.jar already up to date
maven3-interceptor-commons.jar already up to date
[trunk] $ /home/jenkins/tools/java/jdk1.6.0_20-32-unlimited-security/bin/java 
-Xmx1g -XX:MaxPermSize=300m -cp 
/home/jenkins/jenkins-slave/maven3-agent.jar:/home/jenkins/jenkins-slave/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.0.5/boot/plexus-classworlds-2.4.jar
 org.jvnet.hudson.maven3.agent.Maven3Main 
/home/jenkins/jenkins-slave/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.0.5
 /home/jenkins/jenkins-slave/slave.jar 
/home/jenkins/jenkins-slave/maven3-interceptor.jar 
/home/jenkins/jenkins-slave/maven3-interceptor-commons.jar 37561
===[JENKINS REMOTING CAPACITY]===   channel started
Executing Maven:  -B -f 
https://builds.apache.org/job/PDFBox-trunk/ws/trunk/pom.xml 
-Dmaven.repo.local=/home/jenkins/jenkins-slave/maven-repositories/1 clean 
deploy -Ppedantic
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] PDFBox parent
[INFO] Apache FontBox
[INFO] Apache XmpBox
[INFO] Apache PDFBox
[INFO] Apache Preflight
[INFO] Apache Preflight application
[INFO] Apache PDFBox tools
[INFO] Apache PDFBox application
[INFO] Apache PDFBox examples
[INFO] PDFBox reactor
[INFO] 
[INFO] 
[INFO] Building PDFBox parent 2.0.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ pdfbox-parent ---
[TASKS] Scanning folder 
'https://builds.apache.org/job/PDFBox-trunk/ws/trunk/parent' for files 
matching the pattern '**/*.java' - excludes: 
[TASKS] Found 0 files to scan for tasks
Found 0 open tasks.
[TASKS] Computing warning deltas based on reference build #1510
log4j:WARN No appenders could be found for logger 
(org.apache.commons.beanutils.converters.BooleanConverter).
log4j:WARN Please initialize the log4j system properly.
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ pdfbox-parent 
---
[INFO] 
[INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ 
pdfbox-parent ---
[INFO] 
[INFO] --- apache-rat-plugin:0.10:check (default) @ pdfbox-parent ---
[INFO] 51 implicit excludes (use -debug for more details).
[INFO] Exclude: release.properties
[INFO] 1 resources included (use -debug for more details)
[INFO] Rat check: Summary of files. Unapproved: 0 unknown: 0 generated: 0 
approved: 1 licence.
[INFO] 
[INFO] --- maven-install-plugin:2.5.1:install (default-install) @ pdfbox-parent 
---
[INFO] Installing 
https://builds.apache.org/job/PDFBox-trunk/ws/trunk/parent/pom.xml to 
/home/jenkins/jenkins-slave/maven-repositories/1/org/apache/pdfbox/pdfbox-parent/2.0.0-SNAPSHOT/pdfbox-parent-2.0.0-SNAPSHOT.pom
[INFO] 
[INFO] --- maven-deploy-plugin:2.8.1:deploy (default-deploy) @ pdfbox-parent ---
Downloading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-parent/2.0.0-SNAPSHOT/maven-metadata.xml
[WARNING] Could not transfer metadata 
org.apache.pdfbox:pdfbox-parent:2.0.0-SNAPSHOT/maven-metadata.xml from/to 
apache.snapshots.https 
(https://repository.apache.org/content/repositories/snapshots): 
repository.apache.org
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] PDFBox parent . FAILURE [1:00.832s]
[INFO] Apache FontBox 

[jira] [Assigned] (PDFBOX-1242) Handle non ISO-8859-1 chars with drawString

2014-12-09 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson reassigned PDFBOX-1242:
---

Assignee: John Hewson

 Handle non ISO-8859-1 chars with drawString
 ---

 Key: PDFBOX-1242
 URL: https://issues.apache.org/jira/browse/PDFBOX-1242
 Project: PDFBox
  Issue Type: Bug
  Components: Writing
Affects Versions: 1.5.0, 1.6.0
Reporter: Peter Andersen
Assignee: John Hewson
 Fix For: 2.0.0


 The PDPageContentStream.drawString take a String as argument, it construct a 
 COSString of the input.
 If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff 
 and the bytes are taken from the
 input as UTF-16BE encoded.
 Back in the drawString method this unicode16 encoded COSString is appended as 
 a ISO-8859-1
   appendRawCommands( new String( buffer.toByteArray(), ISO-8859-1));
  
 The result of this is that a line with UTF-16 chars is shown prefix with þÿ, 
 and with double space between the other chars.
 The chars above 255 are shown as the two corresponding ISO-8859-1 characters.
 As a side question to this observation, is there an alternative way to use 
 Pdfbox, to support UTF16?
  
  
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-1242) Handle non ISO-8859-1 chars with drawString

2014-12-09 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240161#comment-14240161
 ] 

John Hewson commented on PDFBOX-1242:
-

Yes, this is a bug in PDFBox, but it's one we know about already. What does the 
code you've posted do?

Please use file attachments to post code, JIRA uses markup so your code is 
unusable when posted In this manner. You can delete the previous comment an 
attach it as a fils with More  Attach Files.

 Handle non ISO-8859-1 chars with drawString
 ---

 Key: PDFBOX-1242
 URL: https://issues.apache.org/jira/browse/PDFBOX-1242
 Project: PDFBox
  Issue Type: Bug
  Components: Writing
Affects Versions: 1.5.0, 1.6.0
Reporter: Peter Andersen
Assignee: John Hewson
 Fix For: 2.0.0


 The PDPageContentStream.drawString take a String as argument, it construct a 
 COSString of the input.
 If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff 
 and the bytes are taken from the
 input as UTF-16BE encoded.
 Back in the drawString method this unicode16 encoded COSString is appended as 
 a ISO-8859-1
   appendRawCommands( new String( buffer.toByteArray(), ISO-8859-1));
  
 The result of this is that a line with UTF-16 chars is shown prefix with þÿ, 
 and with double space between the other chars.
 The chars above 255 are shown as the two corresponding ISO-8859-1 characters.
 As a side question to this observation, is there an alternative way to use 
 Pdfbox, to support UTF16?
  
  
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-1242) Handle non ISO-8859-1 chars with drawString

2014-12-09 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240161#comment-14240161
 ] 

John Hewson edited comment on PDFBOX-1242 at 12/9/14 10:21 PM:
---

Yes, this is a bug in PDFBox, but it's one we know about already. What does the 
code you've posted do?

Please use file attachments to post code, JIRA uses markup so your code is 
unusable when posted in this manner. You can delete the previous comment and 
attach it as a file with More  Attach Files.


was (Author: jahewson):
Yes, this is a bug in PDFBox, but it's one we know about already. What does the 
code you've posted do?

Please use file attachments to post code, JIRA uses markup so your code is 
unusable when posted In this manner. You can delete the previous comment an 
attach it as a fils with More  Attach Files.

 Handle non ISO-8859-1 chars with drawString
 ---

 Key: PDFBOX-1242
 URL: https://issues.apache.org/jira/browse/PDFBOX-1242
 Project: PDFBox
  Issue Type: Bug
  Components: Writing
Affects Versions: 1.5.0, 1.6.0
Reporter: Peter Andersen
Assignee: John Hewson
 Fix For: 2.0.0


 The PDPageContentStream.drawString take a String as argument, it construct a 
 COSString of the input.
 If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff 
 and the bytes are taken from the
 input as UTF-16BE encoded.
 Back in the drawString method this unicode16 encoded COSString is appended as 
 a ISO-8859-1
   appendRawCommands( new String( buffer.toByteArray(), ISO-8859-1));
  
 The result of this is that a line with UTF-16 chars is shown prefix with þÿ, 
 and with double space between the other chars.
 The chars above 255 are shown as the two corresponding ISO-8859-1 characters.
 As a side question to this observation, is there an alternative way to use 
 Pdfbox, to support UTF16?
  
  
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-1242) Handle non ISO-8859-1 chars with drawString

2014-12-09 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240184#comment-14240184
 ] 

John Hewson commented on PDFBOX-1242:
-

This is not a duplicate of PDFBOX-922, however both need to be fixed before 
Unicode text will work. 

 Handle non ISO-8859-1 chars with drawString
 ---

 Key: PDFBOX-1242
 URL: https://issues.apache.org/jira/browse/PDFBOX-1242
 Project: PDFBox
  Issue Type: Bug
  Components: Writing
Affects Versions: 1.5.0, 1.6.0
Reporter: Peter Andersen
Assignee: John Hewson
 Fix For: 2.0.0


 The PDPageContentStream.drawString take a String as argument, it construct a 
 COSString of the input.
 If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff 
 and the bytes are taken from the
 input as UTF-16BE encoded.
 Back in the drawString method this unicode16 encoded COSString is appended as 
 a ISO-8859-1
   appendRawCommands( new String( buffer.toByteArray(), ISO-8859-1));
  
 The result of this is that a line with UTF-16 chars is shown prefix with þÿ, 
 and with double space between the other chars.
 The chars above 255 are shown as the two corresponding ISO-8859-1 characters.
 As a side question to this observation, is there an alternative way to use 
 Pdfbox, to support UTF16?
  
  
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

2014-12-09 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031695#comment-14031695
 ] 

John Hewson edited comment on PDFBOX-922 at 12/9/14 11:09 PM:
--

{quote}
drawString() in PDPageContentStream just writes the text into PDF as any 
COSString would choose to represent it. This is not the right thing to do. When 
the font is a CID keyed font, every glyph is 16 bit wide by definition, and 
COSString won't necessarily notice and write it correctly.
{quote}

Not quite: every CID can be up to 16-bits wide, but many (or for  256 glyphs, 
all) will fit inside 8 bits. The byte-width of a string is controlled by 
-whether or not it starts with a BOM, not which font it uses- the current 
font's CMap but is always 16-bits with TTF.

Therefore, drawString() must know what font is currently being drawn, and ask 
that font to encode the String to whatever byte sequence it takes to draw those 
glyphs. So, PDFont must be added to the drawString() API, and PDFont ought to 
have a method for public byte[] encode(String).

drawString() is only valid after setFont() has been called, so it doesn't need 
adding to the API, we can just use the current font. PDFont#encode is a good 
idea, yes.

{quote}
PDFont needs a clearly specified API which performs java String to 
font-specific encoding transformation.
{quote}

Yes, as above.

{quote}
Observe that there are no methods in PDFont called decode(), and I have a hard 
time figuring out what any one of these methods actually do, because everything 
seems to be called encode or lookup. It seems that the encode(byte[], int 
int) performs decoding, so it should be renamed such.
{quote}

Yes, I don't know if anybody knows what those methods are actually doing, 
including the original author.

{quote}
In general I'd recommend pushing the encode/decode job down to the font layer. 
Provide just two methods: byte[] encode(String) and String decode(byte[]). 
Their job is to convert between the byte sequences required by that font and 
java Strings, and they handle full runs of text, not just single characters. 
They will then use single- or multibyte encodings as the font requires without 
the higher level having to do crazy stuff like processEncodedText() currently 
does in PDFStreamEngine.
{quote}

processEncodedText() is indeed crazy and needs fixing, but what you propose 
won't work because the 16-bit string encoding is not set by the font, it's set 
on a per-string basis by having that string start with a BOM.

{quote}
There are unfortunately very many ways to encode text in PDF, and especially if 
text needs to be decodable from the byte stream generated by other programs, 
the full complexity must be faced and implemented. These are to be solved in a 
case-by-case basis in the PDFont hierarchy. The PDFont highest class methods 
for encode and decode should be defined as abstract to reflect the fact that 
encoding depends on the particular subtype of the font.
{quote}

Yes, though as far as decoding the correct text is concerned all you have to do 
is make sure that the ToUnicode map is built correctly - you can put any old 
garbage in the actual strings (any many PDFs do). 

{quote}
It may be that for some of these fonts the implementation is same because the 
actual mechanics can be handled by varying the Encoding instance, though.
{quote}

Maybe, though the Encoding class is for Type1 fonts (and equivalent, e.g. 
Type1C) only.


was (Author: jahewson):
{quote}
drawString() in PDPageContentStream just writes the text into PDF as any 
COSString would choose to represent it. This is not the right thing to do. When 
the font is a CID keyed font, every glyph is 16 bit wide by definition, and 
COSString won't necessarily notice and write it correctly.
{quote}

Not quite: every CID can be up to 16-bits wide, but many (or for  256 glyphs, 
all) will fit inside 8 bits. The byte-width of a string is controlled by 
whether or not it starts with a BOM, not which font it uses.

Therefore, drawString() must know what font is currently being drawn, and ask 
that font to encode the String to whatever byte sequence it takes to draw those 
glyphs. So, PDFont must be added to the drawString() API, and PDFont ought to 
have a method for public byte[] encode(String).

drawString() is only valid after setFont() has been called, so it doesn't need 
adding to the API, we can just use the current font. PDFont#encode is a good 
idea, yes.

{quote}
PDFont needs a clearly specified API which performs java String to 
font-specific encoding transformation.
{quote}

Yes, as above.

{quote}
Observe that there are no methods in PDFont called decode(), and I have a hard 
time figuring out what any one of these methods actually do, because everything 
seems to be called encode or lookup. It seems that the encode(byte[], int 
int) performs decoding, so it should be 

[jira] [Updated] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-922:
--
Assignee: (was: Andreas Lehmkühler)

 True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
 

 Key: PDFBOX-922
 URL: https://issues.apache.org/jira/browse/PDFBOX-922
 Project: PDFBox
  Issue Type: New Feature
  Components: Writing
Affects Versions: 1.3.1
 Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
Reporter: Thanos Agelatos
Priority: Blocker
 Fix For: 2.0.0

 Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff


 PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it 
 creates, making it impossible to create PDFs in any language apart from 
 English and ones supported in WinAnsiEncoding. This behaviour is caused 
 because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, 
 and there is no Identity-H or Identity-V Encoding classes provided (to set 
 afterwards via PDFont.setFont() )
 This excludes the following languages plus many others:
 - Greek
 - Bulgarian
 - Swedish
 - Baltic languages
 - Malteze 
 The PDF created contains garbled characters and/or squares.
 Simple test case:
 {code}
 PDDocument doc = null;
   try {
   doc = new PDDocument();
   PDPage page = new PDPage();
   doc.addPage(page);
   // extract fonts for fields
   byte[] arialNorm = extractFont(arial.ttf);
   //byte[] arialBold = extractFont(arialbd.ttf); 
   //PDFont font = PDType1Font.HELVETICA;
   PDFont font = PDTrueTypeFont.loadTTF(doc, new 
 ByteArrayInputStream(arialNorm));
   
   PDPageContentStream contentStream = new 
 PDPageContentStream(doc, page);
   contentStream.beginText();
   contentStream.setFont(font, 12);
   contentStream.moveTextPositionByAmount(100, 700);
   contentStream.drawString(Hello world from PDFBox 
 ελληνικά); // text here may appear garbled; insert any text in Greek or 
 Bulgarian or Malteze
   contentStream.endText();
   contentStream.close();
   doc.save(pdfbox.pdf);
   System.out.println( created!);
   } catch (Exception ioe) {
   ioe.printStackTrace();
   } finally {
   if (doc != null) {
   try { doc.close(); } catch (Exception e) {}
   }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: svn: E185004: Unexpected end of svndiff input

2014-12-09 Thread Andreas Lehmkuehler

I've created a ticket (INFRA-8846) as discussed on HipChat.

BR
Andreas Lehmkühler


Am 09.12.2014 um 19:31 schrieb Andreas Lehmkuehler:

Hi,

I've got the following error when I try to commit the PDFBox release candidate
to the dist repo [1]

svn: E185004: Unexpected end of svndiff input

The issue seems to be related to big files only, as I was able to commit the
smaller files  1Mb.

The email notification doesn't work too.

Can someone please have a look, as I'm in the middle of the release process.

Thanks in advance
Andreas Lehmkühler

[1] https://dist.apache.org/repos/dist/dev/pdfbox/1.8.8/




[jira] [Commented] (PDFBOX-1242) Handle non ISO-8859-1 chars with drawString

2014-12-09 Thread Glen Peterson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240282#comment-14240282
 ] 

Glen Peterson commented on PDFBOX-1242:
---

If I remember correctly, the PDF file format uses it's own very special 14-bit 
character encoding.  If you use anything outside of what the PDF spec calles 
WinAnsi you may have to embed a font that handles those characters in the PDF 
file to ensure readability.

I have not submitted any patches, nor am I likely to any time soon.  What I did 
submit was a very partial work-around.  The mangled code above is now publicly 
available under the Apache 2.0 license on GitHub where it should be much more 
readable.  There is a Unicode to WinAnsi translation table here (I'll explain 
in a moment):
https://github.com/GlenKPeterson/PdfLayoutManager/blob/master/src/main/java/com/planbase/pdf/layoutmanager/PdfLayoutMgr.java#L651

The code that uses that table is here:
https://github.com/GlenKPeterson/PdfLayoutManager/blob/master/src/main/java/com/planbase/pdf/layoutmanager/PdfLayoutMgr.java#L972

High-level overview for each input character

1. The characters up to 127 are the same in UTF-16 and ISO-8859-1, so it leaves 
them unchanged

2. If one of the higher than 127 input UTF-16 characters has an ISO-8859-1 
equivalent, it is converted directly/exactly.

3. If the input character is Cyrillic, there are somewhat standard, Romanized 
transliterations, where you can substitute one or more Roman characters that 
have a similar phonetic sound to the Cyrillic character.  So this lets us 
support an additional set of languages (Russian in particular) without 
embedding any fonts or otherwise dealing with the root issue.

4. If the above rules do not cover the character in question, a bullet is 
written to the output stream, so that the end user can see that there is a 
character there that didn't print.

OK, so I lied.  The while loop at line 1006 doesn't actually work one 
character at a time.  It finds instances of characters that need to be 
substituted.  Then it copies what chunks of raw input it can to the output 
unchanged.  It only drops to a character-by-character algorithm when it finds a 
character that actually needs to be substituted.  This means that any length 
string of modern English characters will pass through unchanged.

Most of that is in comments in the code on GitHub, but is probably easier to 
read knowing this overview. I hope that helps.

 Handle non ISO-8859-1 chars with drawString
 ---

 Key: PDFBOX-1242
 URL: https://issues.apache.org/jira/browse/PDFBOX-1242
 Project: PDFBox
  Issue Type: Bug
  Components: Writing
Affects Versions: 1.5.0, 1.6.0
Reporter: Peter Andersen
Assignee: John Hewson
 Fix For: 2.0.0


 The PDPageContentStream.drawString take a String as argument, it construct a 
 COSString of the input.
 If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff 
 and the bytes are taken from the
 input as UTF-16BE encoded.
 Back in the drawString method this unicode16 encoded COSString is appended as 
 a ISO-8859-1
   appendRawCommands( new String( buffer.toByteArray(), ISO-8859-1));
  
 The result of this is that a line with UTF-16 chars is shown prefix with þÿ, 
 and with double space between the other chars.
 The chars above 255 are shown as the two corresponding ISO-8859-1 characters.
 As a side question to this observation, is there an alternative way to use 
 Pdfbox, to support UTF16?
  
  
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)