Created new version 1.8.9 in JIRA
Hi, maybe a little bit to early, but I've created a new 1.8.9 version in JIRA. Obviously Tilman already works on 2 issues fitting in this version. BR Andreas Lehmkühler
[jira] [Updated] (PDFBOX-2539) [PATCH] Allow non static FontProvider
[ https://issues.apache.org/jira/browse/PDFBOX-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] simon steiner updated PDFBOX-2539: -- Attachment: (was: fontProvider.patch) [PATCH] Allow non static FontProvider - Key: PDFBOX-2539 URL: https://issues.apache.org/jira/browse/PDFBOX-2539 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 2.0.0 Reporter: simon steiner Attachments: fontProvider.patch I would like to use multiple instances of fontprovider in thread safe way -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2539) [PATCH] Allow non static FontProvider
[ https://issues.apache.org/jira/browse/PDFBOX-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] simon steiner updated PDFBOX-2539: -- Attachment: fontProvider.patch Fix patch [PATCH] Allow non static FontProvider - Key: PDFBOX-2539 URL: https://issues.apache.org/jira/browse/PDFBOX-2539 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 2.0.0 Reporter: simon steiner Attachments: fontProvider.patch I would like to use multiple instances of fontprovider in thread safe way -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: preflight mass tests
Hallo Tilman, hast Du ne grobe Schätzung welcher Anteil der Dateien z.B. in Adobe Reader entweder nicht angezeigt, mit Dialog angezeigt oder falsch angezeigt wird? Lieben Gruß Maruan Sahyoun FileAffairs GmbH Josef-Schappe-Straße 21 40882 Ratingen Tel: +49 (2102) 89497 88 Fax: +49 (2102) 89497 91 sahy...@fileaffairs.de www.fileaffairs.de Geschäftsführer: Maruan Sahyoun Handelsregister: AG Düsseldorf, HRB 53837 UST.-ID: DE248275827 Am 05.12.2014 um 20:45 schrieb Tilman Hausherr thaush...@t-online.de: Some numbers... it took 4-5 days total: 231223, failed: 142, percentage failed: 0.06141257472336292 Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine. about the rest: 18 are the isSymbol stackoverflow 9 are the getFontMatrix NPE 33 are the root must be of type Pages errors The rest is mostly related to very broken PDF files. Tilman Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun: Hi Tilman, that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible. BR Maruan Am 03.12.2014 um 21:04 schrieb Tilman Hausherr thaush...@t-online.de: I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the allowed ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues. Tilman Am 03.11.2014 um 22:58 schrieb Tilman Hausherr: Am 03.11.2014 um 19:00 schrieb Tilman Hausherr: It is not looking good, there is at least one NPEs issue coming. No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem. Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens. Tilman
[jira] [Updated] (PDFBOX-1886) Merge Function strips OCR layer in acrobat
[ https://issues.apache.org/jira/browse/PDFBOX-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-1886: --- Fix Version/s: 2.1.0 Merge Function strips OCR layer in acrobat -- Key: PDFBOX-1886 URL: https://issues.apache.org/jira/browse/PDFBOX-1886 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.4 Reporter: adam brin Fix For: 2.1.0 Attachments: cover_page4818280580458469287.pdf, page1.pdf, santa-cruz-flats-project-part-2 (1).pdf We use the PDFMergerUtility to add cover pages to documents automatically. We're finding that when we do so, it strips the OCR data from the source of the merged files. {code} PDFMergerUtility merger = new PDFMergerUtility(); File outputFile = File.createTempFile(); merger.setDestinationStream(new FileOutputStream(outputFile)); for (File file : files) { merger.addSource(file); } merger.mergeDocuments(); return outputFile; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1878) Tags are not being displayed in Adobe Acrobat Tags panel when merging pdfs
[ https://issues.apache.org/jira/browse/PDFBOX-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-1878: --- Fix Version/s: 2.1.0 Tags are not being displayed in Adobe Acrobat Tags panel when merging pdfs -- Key: PDFBOX-1878 URL: https://issues.apache.org/jira/browse/PDFBOX-1878 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.3, 1.8.4 Environment: Windows XP SP3 Reporter: Tiuser Lassei Priority: Minor Fix For: 2.1.0 Attachments: pdf1.3.pdf, pdf1.4.pdf The merged PDF output produced by the PDFMergerUtility does not display the tags correctly in the Tags panel of Adobe Acrobat. (Tested in Acrobat Pro XI trial version). Have not tested in another PDF tool that can display tags (not sure if another tool can do this). A single blank entry is shown instead of the actual structure tree of the combined source pdfs. Though, it seems the reading order (based on the tag structure) is still preserved (based on the testing of adobe reader's read aloud feature). Possibly related to fix on tag merging: https://issues.apache.org/jira/browse/PDFBOX-1342 Although the tag merging logic is wrong is 1.8.2 (as only the first page is tagged which was fixed as indicated in PDFBOX-1342), the tags appear correctly in the Tag panel. This bug prevents users from modifying the tag structure in Acrobat as the tag entries are missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PDFBOX-1874) PDFTextStripper.isParagraphSeparation(...)
[ https://issues.apache.org/jira/browse/PDFBOX-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler reassigned PDFBOX-1874: -- Assignee: Andreas Lehmkühler PDFTextStripper.isParagraphSeparation(...) -- Key: PDFBOX-1874 URL: https://issues.apache.org/jira/browse/PDFBOX-1874 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.3 Environment: Eclipse Reporter: Yuri Burrows Assignee: Andreas Lehmkühler Priority: Minor Labels: patch PDFTextStripper.isParagraphSeparation(...) seems to have an issue with how it finds Y text indentation. PROBLEM: I believe the issue is due to precision in the the following logic: float yGap = Math.abs(position.getTextPosition().getYDirAdj()- lastPosition.getTextPosition().getYDirAdj()); float xGap = (position.getTextPosition().getXDirAdj()- lastLineStartPosition.getTextPosition().getXDirAdj()); if(yGap (getDropThreshold()*maxHeightForLine)) { result = true; yGap has a precision to 1000th+, while (getDropThreshold()*maxHeightForLine) has a precision to 100,000th. Resulting in the following comparison (example): 16.018 16.018005 which evaluates to True. However 16.018 16.018 would evaluate to False. EFFECT OF THE PROBLEM: every line in the output is marked as isParagraphStart = true and writeParagraphEnd() ... = true. I.E. |||NEW_LINE||| |||PARAGRAPH_START|||PDFBox has been designed to represent PDF documents using familiar object-oriented paradigms. The data|||NEW_LINE||| contained in a PDF document is a collection of basic object types: arrays, booleans, dictionaries, numbers,|||NEW_LINE||| |||PARAGRAPH_END||NEW_LINE||| |||PARAGRAPH_START|||strings and binary streams. PDFBox captures these basic object types in the org.pdfbox.cos package (the|||NEW_LINE||| COS Model). While it's possible to create any desired interactions with a PDF document using only these|||NEW_LINE||| |||PARAGRAPH_END||NEW_LINE||| In the source PDF these lines appear as such: PDFBox has been designed to represent PDF documents using familiar object-oriented paradigms. The data contained in a PDF document is a collection of basic object types: arrays, booleans, dictionaries, numbers, strings and binary streams. PDFBox captures these basic object types in the org.pdfbox.cos package (the COS Model). While it's possible to create any desired interactions with a PDF document using only these MY WORKAROUND: NOTE: there is a small performance hit with this workaround. float yGap = Math.abs(position.getTextPosition().getYDirAdj() - lastPosition.getTextPosition().getYDirAdj()); DecimalFormat df = new DecimalFormat(#.00); float yGapTruncated = Float.valueOf(df.format(yGap)); float newYVal = Float.valueOf(df.format(getDropThreshold() * maxHeightForLine)); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1807) TextToPDF strips leading spaces from input file
[ https://issues.apache.org/jira/browse/PDFBOX-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-1807: --- Fix Version/s: 3.0.0 TextToPDF strips leading spaces from input file --- Key: PDFBOX-1807 URL: https://issues.apache.org/jira/browse/PDFBOX-1807 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.3 Environment: Win7 64 bit Reporter: Mark Mitchell Priority: Minor Fix For: 3.0.0 When using the TextToPDF utility on a text file that has spaces in the front for formatting purposes, the leading spaces on the line are being stripped causing the report to no longer looks like it did in the PDF. Was this the intended result? Is there a way to turn off the stripping of the spaces? If not, can it be added? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-1807) TextToPDF strips leading spaces from input file
[ https://issues.apache.org/jira/browse/PDFBOX-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239282#comment-14239282 ] Andreas Lehmkühler commented on PDFBOX-1807: TestToPDF is a proof of concept and not a real application. So, don't expect to much. TextToPDF strips leading spaces from input file --- Key: PDFBOX-1807 URL: https://issues.apache.org/jira/browse/PDFBOX-1807 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.3 Environment: Win7 64 bit Reporter: Mark Mitchell Priority: Minor Fix For: 3.0.0 When using the TextToPDF utility on a text file that has spaces in the front for formatting purposes, the leading spaces on the line are being stripped causing the report to no longer looks like it did in the PDF. Was this the intended result? Is there a way to turn off the stripping of the spaces? If not, can it be added? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-1242) Handle non ISO-8859-1 chars with drawString
[ https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-1242: --- Fix Version/s: 2.0.0 Handle non ISO-8859-1 chars with drawString --- Key: PDFBOX-1242 URL: https://issues.apache.org/jira/browse/PDFBOX-1242 Project: PDFBox Issue Type: Bug Components: Writing Affects Versions: 1.5.0, 1.6.0 Reporter: Peter Andersen Fix For: 2.0.0 The PDPageContentStream.drawString take a String as argument, it construct a COSString of the input. If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff and the bytes are taken from the input as UTF-16BE encoded. Back in the drawString method this unicode16 encoded COSString is appended as a ISO-8859-1 appendRawCommands( new String( buffer.toByteArray(), ISO-8859-1)); The result of this is that a line with UTF-16 chars is shown prefix with þÿ, and with double space between the other chars. The chars above 255 are shown as the two corresponding ISO-8859-1 characters. As a side question to this observation, is there an alternative way to use Pdfbox, to support UTF16? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PDFBOX-1151) StreamCorruptedException on bad PDF with -force
[ https://issues.apache.org/jira/browse/PDFBOX-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler reassigned PDFBOX-1151: -- Assignee: Andreas Lehmkühler StreamCorruptedException on bad PDF with -force --- Key: PDFBOX-1151 URL: https://issues.apache.org/jira/browse/PDFBOX-1151 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.6.0, 1.8.7, 2.0.0 Environment: Windows Vista Sun JDK 1.6.0_26 Reporter: Stas Shaposhnikov Assignee: Andreas Lehmkühler Attachments: PDFStreamEngine.patch, test.pdf I am getting the StreamCorruptedException when trying to parse a possibly invalid PDF document even if the -force option is specified. Stack trace: java.io.StreamCorruptedException: Error: data is null at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221) at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156) at org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:105) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:264) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:256) at org.apache.pdfbox.ExtractText.main(ExtractText.java:76) at org.apache.pdfbox.PDFBox.main(PDFBox.java:42) My suggestion is to skip bad sub-streams without throwing exceptions in PDFStreamEngine.processSubStream() in case of forceParsing is true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-785) Spliting a PDF creates unnecessarily large files
[ https://issues.apache.org/jira/browse/PDFBOX-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-785: -- Fix Version/s: 2.0.0 Spliting a PDF creates unnecessarily large files Key: PDFBOX-785 URL: https://issues.apache.org/jira/browse/PDFBOX-785 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 0.8.0-incubator, 1.1.0, 1.2.1 Environment: Windows XP, openOffice3.0.0, pdfsam Reporter: mathieu radiguet Assignee: Andreas Lehmkühler Fix For: 2.0.0 Attachments: fileSizeIssue.zip Using PDFBox0.8.0 (also tryed on 1.1.0 and 1.2.1) to split files result in bigger parts than the original. Conserned files where made from openOfice .odt documents in version 3.0.0 using openOffice pdf Export and then merging several copies with pdfsam (http://www.pdfsam.org/) In joined eclipse project the test file size is 10 712 749 bytes for 2812pages and the result files's sises after spliting in two at page 2300 are : 8 812 515 bytes and 10 701 142 bytes. Using pdfSplit in command line as result we have all single result file bigger than the original. An exemple is also joined. An error tells the original file is corrupted but we tryed it on a file (using pdfsam and without using it) with no error and with similar result so i think it's not related. This issue seems similar to : JIRA PDFBOX-28 (https://issues.apache.org/jira/browse/PDFBOX-28) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Bösinger reopened PDFBOX-2548: --- Problems with character extraction (fi ligature) Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: preflight.png, test.pdf, test2.pdf favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Bösinger updated PDFBOX-2548: -- Attachment: (was: test.pdf) Problems with character extraction (fi ligature) Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: preflight.png favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Bösinger closed PDFBOX-2548. - Resolution: Not a Problem Problems with character extraction (fi ligature) Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: preflight.png favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Bösinger updated PDFBOX-2548: -- Attachment: (was: test2.pdf) Problems with character extraction (fi ligature) Key: PDFBOX-2548 URL: https://issues.apache.org/jira/browse/PDFBOX-2548 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.7 Environment: Windows7Professional JavaSE8 EclipseKepler Reporter: Matthias Bösinger Priority: Minor Attachments: preflight.png favorite I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font. However, the text extraction now causes another type of problem. In my case, when the charater sequences fi or fl occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side. (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l). My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements). Background: The given document is a wordbook text with very dense printed text. see this link for code and output: http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox My question: is there anything what I can do to avoid this problem? thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PDFBOX-2551) Wrong barcode printing for embedded font
Andriy created PDFBOX-2551: -- Summary: Wrong barcode printing for embedded font Key: PDFBOX-2551 URL: https://issues.apache.org/jira/browse/PDFBOX-2551 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 1.8.7 Reporter: Andriy Fix For: 1.8.8 Attachments: barcode_printing_problem.pdf Couldn't print file with embedded font code 128. Code for printing: PDDocument document = load(new FileInputStream(barcode_printing_problem.pdf)); PrinterJob printJob = getPrinterJob(); printJob.setPrintService(getPrinter(MY_PRINTER)); document.silentPrint(printJob); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2551) Wrong barcode printing for embedded font
[ https://issues.apache.org/jira/browse/PDFBOX-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andriy updated PDFBOX-2551: --- Attachment: barcode_printing_problem.pdf Input pdf file Wrong barcode printing for embedded font Key: PDFBOX-2551 URL: https://issues.apache.org/jira/browse/PDFBOX-2551 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 1.8.7 Reporter: Andriy Fix For: 1.8.8 Attachments: barcode_printing_problem.pdf Couldn't print file with embedded font code 128. Code for printing: PDDocument document = load(new FileInputStream(barcode_printing_problem.pdf)); PrinterJob printJob = getPrinterJob(); printJob.setPrintService(getPrinter(MY_PRINTER)); document.silentPrint(printJob); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2551) Wrong barcode printing for embedded font
[ https://issues.apache.org/jira/browse/PDFBOX-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andriy updated PDFBOX-2551: --- Attachment: print_result.pdf after pring Wrong barcode printing for embedded font Key: PDFBOX-2551 URL: https://issues.apache.org/jira/browse/PDFBOX-2551 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 1.8.7 Reporter: Andriy Fix For: 1.8.8 Attachments: barcode_printing_problem.pdf, print_result.pdf Couldn't print file with embedded font code 128. Code for printing: PDDocument document = load(new FileInputStream(barcode_printing_problem.pdf)); PrinterJob printJob = getPrinterJob(); printJob.setPrintService(getPrinter(MY_PRINTER)); document.silentPrint(printJob); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PDFBOX-2551) Wrong barcode printing for embedded font
[ https://issues.apache.org/jira/browse/PDFBOX-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239493#comment-14239493 ] Andriy edited comment on PDFBOX-2551 at 12/9/14 2:59 PM: - after pring print_result.pdf was (Author: andriy.brez): after pring Wrong barcode printing for embedded font Key: PDFBOX-2551 URL: https://issues.apache.org/jira/browse/PDFBOX-2551 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 1.8.7 Reporter: Andriy Fix For: 1.8.8 Attachments: barcode_printing_problem.pdf, print_result.pdf Couldn't print file with embedded font code 128. Code for printing: PDDocument document = load(new FileInputStream(barcode_printing_problem.pdf)); PrinterJob printJob = getPrinterJob(); printJob.setPrintService(getPrinter(MY_PRINTER)); document.silentPrint(printJob); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PDFBOX-2551) Wrong barcode printing for embedded font
[ https://issues.apache.org/jira/browse/PDFBOX-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239492#comment-14239492 ] Andriy edited comment on PDFBOX-2551 at 12/9/14 2:59 PM: - Input pdf file barcode_printing_problem.pdf was (Author: andriy.brez): Input pdf file Wrong barcode printing for embedded font Key: PDFBOX-2551 URL: https://issues.apache.org/jira/browse/PDFBOX-2551 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 1.8.7 Reporter: Andriy Fix For: 1.8.8 Attachments: barcode_printing_problem.pdf, print_result.pdf Couldn't print file with embedded font code 128. Code for printing: PDDocument document = load(new FileInputStream(barcode_printing_problem.pdf)); PrinterJob printJob = getPrinterJob(); printJob.setPrintService(getPrinter(MY_PRINTER)); document.silentPrint(printJob); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2551) Wrong barcode printing for embedded font
[ https://issues.apache.org/jira/browse/PDFBOX-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239497#comment-14239497 ] Andriy commented on PDFBOX-2551: Could it issue depends of text encoding? Wrong barcode printing for embedded font Key: PDFBOX-2551 URL: https://issues.apache.org/jira/browse/PDFBOX-2551 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 1.8.7 Reporter: Andriy Fix For: 1.8.8 Attachments: barcode_printing_problem.pdf, print_result.pdf Couldn't print file with embedded font code 128. Code for printing: PDDocument document = load(new FileInputStream(barcode_printing_problem.pdf)); PrinterJob printJob = getPrinterJob(); printJob.setPrintService(getPrinter(MY_PRINTER)); document.silentPrint(printJob); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: preflight mass tests
Tilman, This is fantastic! If you send me an example of the code you used to call preflight (#parse() or #parse(Format format)???), I'd like to run it within tika-batch to see what our batch performance is. Ideally, once we can turn our public vm on, it would be fun to run these tests there. Best, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Friday, December 05, 2014 2:45 PM To: dev@pdfbox.apache.org Subject: Re: preflight mass tests Some numbers... it took 4-5 days total: 231223, failed: 142, percentage failed: 0.06141257472336292 Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine. about the rest: 18 are the isSymbol stackoverflow 9 are the getFontMatrix NPE 33 are the root must be of type Pages errors The rest is mostly related to very broken PDF files. Tilman Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun: Hi Tilman, that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible. BR Maruan Am 03.12.2014 um 21:04 schrieb Tilman Hausherr thaush...@t-online.de: I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the allowed ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues. Tilman Am 03.11.2014 um 22:58 schrieb Tilman Hausherr: Am 03.11.2014 um 19:00 schrieb Tilman Hausherr: It is not looking good, there is at least one NPEs issue coming. No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem. Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens. Tilman
[jira] [Updated] (PDFBOX-2551) Wrong barcode printing for embedded font
[ https://issues.apache.org/jira/browse/PDFBOX-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-2551: --- Fix Version/s: (was: 1.8.8) Wrong barcode printing for embedded font Key: PDFBOX-2551 URL: https://issues.apache.org/jira/browse/PDFBOX-2551 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 1.8.7 Reporter: Andriy Attachments: barcode_printing_problem.pdf, print_result.pdf Couldn't print file with embedded font code 128. Code for printing: PDDocument document = load(new FileInputStream(barcode_printing_problem.pdf)); PrinterJob printJob = getPrinterJob(); printJob.setPrintService(getPrinter(MY_PRINTER)); document.silentPrint(printJob); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: preflight mass tests
Here's the code... it assumes that all PDFs are flat in one single directory. Libraries needed: preflight-app, jai_imageio, levigo_jbig2-imageio-1.6.1.jar. I have run it only with the trunk, not with 1.8, because we didn't fix all problems there. Tilman import java.io.File; import java.io.FileNotFoundException; import java.io.FilenameFilter; import java.io.PrintWriter; import org.apache.pdfbox.preflight.PreflightDocument; import org.apache.pdfbox.preflight.exception.ValidationException; import org.apache.pdfbox.preflight.parser.PreflightParser; /** * * @author Tilman Hausherr */ public class PreflightTest { public static void main(String[] args) throws FileNotFoundException { File dir; if (args.length 0) { dir = new File(args[0]); } else { dir = new File(k:\\dc); } int total = 0; int failed = 0; File[] dirList = dir.listFiles(new FilenameFilter() { @Override public boolean accept(File dir, String name) { if (name.compareTo(00.pdf) = 0) // use this to start at a certain file { return false; } return name.toLowerCase().endsWith(.pdf); } }); for (File pdf : dirList) { ++total; System.out.println(pdf.getName()); // just test that it doesn't crash try { new File(pdf.getName() + -exception.txt).delete(); PreflightParser parser = new PreflightParser(pdf); parser.parse(); try (PreflightDocument preflightDocument = parser.getPreflightDocument()) { preflightDocument.validate(); preflightDocument.getResult(); } parser.clearResources(); } catch (ValidationException e) { } catch (Throwable e) { ++failed; try (PrintWriter pw = new PrintWriter(new File(pdf.getName() + -exception.txt))) { e.printStackTrace(pw); } System.out.flush(); System.err.flush(); System.err.print(pdf.getName() + preflight fail: ); e.printStackTrace(); System.out.flush(); System.err.flush(); } System.out.println(total: + total + , failed: + failed + , percentage failed: + (((float) failed) / total * 100.0) + %); } } } Am 09.12.2014 um 17:28 schrieb Allison, Timothy B.: Tilman, This is fantastic! If you send me an example of the code you used to call preflight (#parse() or #parse(Format format)???), I'd like to run it within tika-batch to see what our batch performance is. Ideally, once we can turn our public vm on, it would be fun to run these tests there. Best, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Friday, December 05, 2014 2:45 PM To: dev@pdfbox.apache.org Subject: Re: preflight mass tests Some numbers... it took 4-5 days total: 231223, failed: 142, percentage failed: 0.06141257472336292 Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine. about the rest: 18 are the isSymbol stackoverflow 9 are the getFontMatrix NPE 33 are the root must be of type Pages errors The rest is mostly related to very broken PDF files. Tilman Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun: Hi Tilman, that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible. BR Maruan Am 03.12.2014 um 21:04 schrieb Tilman Hausherr thaush...@t-online.de: I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the allowed ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues. Tilman Am 03.11.2014 um 22:58 schrieb Tilman Hausherr: Am 03.11.2014 um 19:00 schrieb Tilman Hausherr: It is not looking good, there is at least one NPEs issue coming. No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem. Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens. Tilman
svn: E185004: Unexpected end of svndiff input
Hi, I've got the following error when I try to commit the PDFBox release candidate to the dist repo [1] svn: E185004: Unexpected end of svndiff input The issue seems to be related to big files only, as I was able to commit the smaller files 1Mb. The email notification doesn't work too. Can someone please have a look, as I'm in the middle of the release process. Thanks in advance Andreas Lehmkühler [1] https://dist.apache.org/repos/dist/dev/pdfbox/1.8.8/
[jira] [Commented] (PDFBOX-1886) Merge Function strips OCR layer in acrobat
[ https://issues.apache.org/jira/browse/PDFBOX-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239855#comment-14239855 ] Tilman Hausherr commented on PDFBOX-1886: - Why do you think that the OCR is missing? I can copy paste text from the santa-cruz-flats-project-part-2 (1).pdf file. Merge Function strips OCR layer in acrobat -- Key: PDFBOX-1886 URL: https://issues.apache.org/jira/browse/PDFBOX-1886 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.4 Reporter: adam brin Fix For: 2.1.0 Attachments: cover_page4818280580458469287.pdf, page1.pdf, santa-cruz-flats-project-part-2 (1).pdf We use the PDFMergerUtility to add cover pages to documents automatically. We're finding that when we do so, it strips the OCR data from the source of the merged files. {code} PDFMergerUtility merger = new PDFMergerUtility(); File outputFile = File.createTempFile(); merger.setDestinationStream(new FileOutputStream(outputFile)); for (File file : files) { merger.addSource(file); } merger.mergeDocuments(); return outputFile; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2505) ArrayIndexOutOfBoundsException in PDColor constructor
[ https://issues.apache.org/jira/browse/PDFBOX-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239866#comment-14239866 ] ASF subversion and git services commented on PDFBOX-2505: - Commit 1644156 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1644156 ] PDFBOX-2505: fix parameter validation from previous commit ArrayIndexOutOfBoundsException in PDColor constructor - Key: PDFBOX-2505 URL: https://issues.apache.org/jira/browse/PDFBOX-2505 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 2.0.0 Reporter: Tilman Hausherr Fix For: 2.0.0 Attachments: PDFBOX-2505-032618-p96.pdf {code} Exception in thread main java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(Unknown Source) at java.util.ArrayList.get(Unknown Source) at org.apache.pdfbox.cos.COSArray.get(COSArray.java:210) at org.apache.pdfbox.pdmodel.graphics.color.PDColor.init(PDColor.java:54) at org.apache.pdfbox.contentstream.operator.color.SetColor.process(SetColor.java:41) at org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceCMYKColor.process(SetNonStrokingDeviceCMYKColor.java:38) {code} The attached file has a k without arguments. This is only in 2.0, not in 1.8. In 1.8 SetNonStrokingCMYKColor initializes the array with size 4 (ok, it will crash if there are 5 arguments), in 2.0 SetNonStrokingDeviceCMYKColor / SetColor take what is there. Two possible solutions in SetColor: 1) initialize components with the initial colors of the colorspace 2) initialize components with empty array Both solutions get rid of the exception. Solution 2 is used in another constructor. Which one is better? (I'd prefer solution 1 because it has the correct array size and would also change the other constructor) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: preflight mass tests
Since you answered to the list, I'll answer here too: I dpn't know, I didn't try to display the fails. Tilman Am 09.12.2014 um 10:59 schrieb Maruan Sahyoun: Hallo Tilman, hast Du ne grobe Schätzung welcher Anteil der Dateien z.B. in Adobe Reader entweder nicht angezeigt, mit Dialog angezeigt oder falsch angezeigt wird? Lieben Gruß Maruan Sahyoun FileAffairs GmbH Josef-Schappe-Straße 21 40882 Ratingen Tel: +49 (2102) 89497 88 Fax: +49 (2102) 89497 91 sahy...@fileaffairs.de www.fileaffairs.de Geschäftsführer: Maruan Sahyoun Handelsregister: AG Düsseldorf, HRB 53837 UST.-ID: DE248275827 Am 05.12.2014 um 20:45 schrieb Tilman Hausherr thaush...@t-online.de: Some numbers... it took 4-5 days total: 231223, failed: 142, percentage failed: 0.06141257472336292 Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine. about the rest: 18 are the isSymbol stackoverflow 9 are the getFontMatrix NPE 33 are the root must be of type Pages errors The rest is mostly related to very broken PDF files. Tilman Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun: Hi Tilman, that's very good news. I trust a lot of time went into reviewing the test results. wo your and Tim's efforts this achievement wouldn't have been possible. BR Maruan Am 03.12.2014 um 21:04 schrieb Tilman Hausherr thaush...@t-online.de: I've now run preflight on half of the govdocs files. Every issue I have opened on preflight is related to that test. The failure rate (exceptions other than the allowed ValidationExceptions) is down from 1% when I started to 0.05% now. Most of the frequent exceptions (e.g. the one with NonTermimalField) have been fixed. Whats left now are exceptions related to messy files, and some of the font related issues. Tilman Am 03.11.2014 um 22:58 schrieb Tilman Hausherr: Am 03.11.2014 um 19:00 schrieb Tilman Hausherr: It is not looking good, there is at least one NPEs issue coming. No more NPE after solving the two issues I opened today except PDFBOX-1743.pdf which is a known problem. Coming up soon: run preflight on the 231227 PDF files from digitalcorpora to see what happens. Tilman
Build failed in Jenkins: PDFBox-trunk » PDFBox parent #1511
See https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox-parent/1511/ -- maven3-agent.jar already up to date maven3-interceptor.jar already up to date maven3-interceptor-commons.jar already up to date ===[JENKINS REMOTING CAPACITY]=== channel started Executing Maven: -B -f /home/jenkins/jenkins-slave/workspace/PDFBox-trunk/trunk/pom.xml -Dmaven.repo.local=/home/jenkins/jenkins-slave/maven-repositories/1 clean deploy -Ppedantic [INFO] Scanning for projects... [INFO] [INFO] Reactor Build Order: [INFO] [INFO] PDFBox parent [INFO] Apache FontBox [INFO] Apache XmpBox [INFO] Apache PDFBox [INFO] Apache Preflight [INFO] Apache Preflight application [INFO] Apache PDFBox tools [INFO] Apache PDFBox application [INFO] Apache PDFBox examples [INFO] PDFBox reactor [INFO] [INFO] [INFO] Building PDFBox parent 2.0.0-SNAPSHOT [INFO] [INFO] [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ pdfbox-parent --- [TASKS] Scanning folder 'https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox-parent/ws/' for files matching the pattern '**/*.java' - excludes: [TASKS] Found 0 files to scan for tasks Found 0 open tasks. [TASKS] Computing warning deltas based on reference build #1510 log4j:WARN No appenders could be found for logger (org.apache.commons.beanutils.converters.BooleanConverter). log4j:WARN Please initialize the log4j system properly. [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ pdfbox-parent --- [INFO] [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ pdfbox-parent --- [INFO] [INFO] --- apache-rat-plugin:0.10:check (default) @ pdfbox-parent --- [INFO] 51 implicit excludes (use -debug for more details). [INFO] Exclude: release.properties [INFO] 1 resources included (use -debug for more details) [INFO] Rat check: Summary of files. Unapproved: 0 unknown: 0 generated: 0 approved: 1 licence. [INFO] [INFO] --- maven-install-plugin:2.5.1:install (default-install) @ pdfbox-parent --- [INFO] Installing https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox-parent/ws/pom.xml to /home/jenkins/jenkins-slave/maven-repositories/1/org/apache/pdfbox/pdfbox-parent/2.0.0-SNAPSHOT/pdfbox-parent-2.0.0-SNAPSHOT.pom [INFO] [INFO] --- maven-deploy-plugin:2.8.1:deploy (default-deploy) @ pdfbox-parent --- Downloading: https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-parent/2.0.0-SNAPSHOT/maven-metadata.xml [WARNING] Could not transfer metadata org.apache.pdfbox:pdfbox-parent:2.0.0-SNAPSHOT/maven-metadata.xml from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): repository.apache.org
Build failed in Jenkins: PDFBox-trunk #1511
See https://builds.apache.org/job/PDFBox-trunk/1511/changes Changes: [tilman] PDFBOX-2505: fix parameter validation from previous commit -- Started by an SCM change Building remotely on ubuntu-4 (docker Ubuntu ubuntu4 ubuntu) in workspace https://builds.apache.org/job/PDFBox-trunk/ws/ Cleaning up https://builds.apache.org/job/PDFBox-trunk/ws/trunk Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/target Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/app/target Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/preflight-app/target Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/examples/target Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/parent/target Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/xmpbox/target Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/preflight/target Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/pdfbox/target Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/tools/target Deleting https://builds.apache.org/job/PDFBox-trunk/ws/trunk/fontbox/target Updating http://svn.apache.org/repos/asf/pdfbox/trunk at revision '2014-12-09T20:58:15.682 +' U pdfbox/src/main/java/org/apache/pdfbox/contentstream/operator/state/SetLineWidth.java At revision 1644175 Parsing POMs maven3-agent.jar already up to date maven3-interceptor.jar already up to date maven3-interceptor-commons.jar already up to date [trunk] $ /home/jenkins/tools/java/jdk1.6.0_20-32-unlimited-security/bin/java -Xmx1g -XX:MaxPermSize=300m -cp /home/jenkins/jenkins-slave/maven3-agent.jar:/home/jenkins/jenkins-slave/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.0.5/boot/plexus-classworlds-2.4.jar org.jvnet.hudson.maven3.agent.Maven3Main /home/jenkins/jenkins-slave/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.0.5 /home/jenkins/jenkins-slave/slave.jar /home/jenkins/jenkins-slave/maven3-interceptor.jar /home/jenkins/jenkins-slave/maven3-interceptor-commons.jar 37561 ===[JENKINS REMOTING CAPACITY]=== channel started Executing Maven: -B -f https://builds.apache.org/job/PDFBox-trunk/ws/trunk/pom.xml -Dmaven.repo.local=/home/jenkins/jenkins-slave/maven-repositories/1 clean deploy -Ppedantic [INFO] Scanning for projects... [INFO] [INFO] Reactor Build Order: [INFO] [INFO] PDFBox parent [INFO] Apache FontBox [INFO] Apache XmpBox [INFO] Apache PDFBox [INFO] Apache Preflight [INFO] Apache Preflight application [INFO] Apache PDFBox tools [INFO] Apache PDFBox application [INFO] Apache PDFBox examples [INFO] PDFBox reactor [INFO] [INFO] [INFO] Building PDFBox parent 2.0.0-SNAPSHOT [INFO] [INFO] [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ pdfbox-parent --- [TASKS] Scanning folder 'https://builds.apache.org/job/PDFBox-trunk/ws/trunk/parent' for files matching the pattern '**/*.java' - excludes: [TASKS] Found 0 files to scan for tasks Found 0 open tasks. [TASKS] Computing warning deltas based on reference build #1510 log4j:WARN No appenders could be found for logger (org.apache.commons.beanutils.converters.BooleanConverter). log4j:WARN Please initialize the log4j system properly. [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ pdfbox-parent --- [INFO] [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ pdfbox-parent --- [INFO] [INFO] --- apache-rat-plugin:0.10:check (default) @ pdfbox-parent --- [INFO] 51 implicit excludes (use -debug for more details). [INFO] Exclude: release.properties [INFO] 1 resources included (use -debug for more details) [INFO] Rat check: Summary of files. Unapproved: 0 unknown: 0 generated: 0 approved: 1 licence. [INFO] [INFO] --- maven-install-plugin:2.5.1:install (default-install) @ pdfbox-parent --- [INFO] Installing https://builds.apache.org/job/PDFBox-trunk/ws/trunk/parent/pom.xml to /home/jenkins/jenkins-slave/maven-repositories/1/org/apache/pdfbox/pdfbox-parent/2.0.0-SNAPSHOT/pdfbox-parent-2.0.0-SNAPSHOT.pom [INFO] [INFO] --- maven-deploy-plugin:2.8.1:deploy (default-deploy) @ pdfbox-parent --- Downloading: https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-parent/2.0.0-SNAPSHOT/maven-metadata.xml [WARNING] Could not transfer metadata org.apache.pdfbox:pdfbox-parent:2.0.0-SNAPSHOT/maven-metadata.xml from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): repository.apache.org [INFO] [INFO] Reactor Summary: [INFO] [INFO] PDFBox parent . FAILURE [1:00.832s] [INFO] Apache FontBox
[jira] [Assigned] (PDFBOX-1242) Handle non ISO-8859-1 chars with drawString
[ https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson reassigned PDFBOX-1242: --- Assignee: John Hewson Handle non ISO-8859-1 chars with drawString --- Key: PDFBOX-1242 URL: https://issues.apache.org/jira/browse/PDFBOX-1242 Project: PDFBox Issue Type: Bug Components: Writing Affects Versions: 1.5.0, 1.6.0 Reporter: Peter Andersen Assignee: John Hewson Fix For: 2.0.0 The PDPageContentStream.drawString take a String as argument, it construct a COSString of the input. If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff and the bytes are taken from the input as UTF-16BE encoded. Back in the drawString method this unicode16 encoded COSString is appended as a ISO-8859-1 appendRawCommands( new String( buffer.toByteArray(), ISO-8859-1)); The result of this is that a line with UTF-16 chars is shown prefix with þÿ, and with double space between the other chars. The chars above 255 are shown as the two corresponding ISO-8859-1 characters. As a side question to this observation, is there an alternative way to use Pdfbox, to support UTF16? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-1242) Handle non ISO-8859-1 chars with drawString
[ https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240161#comment-14240161 ] John Hewson commented on PDFBOX-1242: - Yes, this is a bug in PDFBox, but it's one we know about already. What does the code you've posted do? Please use file attachments to post code, JIRA uses markup so your code is unusable when posted In this manner. You can delete the previous comment an attach it as a fils with More Attach Files. Handle non ISO-8859-1 chars with drawString --- Key: PDFBOX-1242 URL: https://issues.apache.org/jira/browse/PDFBOX-1242 Project: PDFBox Issue Type: Bug Components: Writing Affects Versions: 1.5.0, 1.6.0 Reporter: Peter Andersen Assignee: John Hewson Fix For: 2.0.0 The PDPageContentStream.drawString take a String as argument, it construct a COSString of the input. If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff and the bytes are taken from the input as UTF-16BE encoded. Back in the drawString method this unicode16 encoded COSString is appended as a ISO-8859-1 appendRawCommands( new String( buffer.toByteArray(), ISO-8859-1)); The result of this is that a line with UTF-16 chars is shown prefix with þÿ, and with double space between the other chars. The chars above 255 are shown as the two corresponding ISO-8859-1 characters. As a side question to this observation, is there an alternative way to use Pdfbox, to support UTF16? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PDFBOX-1242) Handle non ISO-8859-1 chars with drawString
[ https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240161#comment-14240161 ] John Hewson edited comment on PDFBOX-1242 at 12/9/14 10:21 PM: --- Yes, this is a bug in PDFBox, but it's one we know about already. What does the code you've posted do? Please use file attachments to post code, JIRA uses markup so your code is unusable when posted in this manner. You can delete the previous comment and attach it as a file with More Attach Files. was (Author: jahewson): Yes, this is a bug in PDFBox, but it's one we know about already. What does the code you've posted do? Please use file attachments to post code, JIRA uses markup so your code is unusable when posted In this manner. You can delete the previous comment an attach it as a fils with More Attach Files. Handle non ISO-8859-1 chars with drawString --- Key: PDFBOX-1242 URL: https://issues.apache.org/jira/browse/PDFBOX-1242 Project: PDFBox Issue Type: Bug Components: Writing Affects Versions: 1.5.0, 1.6.0 Reporter: Peter Andersen Assignee: John Hewson Fix For: 2.0.0 The PDPageContentStream.drawString take a String as argument, it construct a COSString of the input. If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff and the bytes are taken from the input as UTF-16BE encoded. Back in the drawString method this unicode16 encoded COSString is appended as a ISO-8859-1 appendRawCommands( new String( buffer.toByteArray(), ISO-8859-1)); The result of this is that a line with UTF-16 chars is shown prefix with þÿ, and with double space between the other chars. The chars above 255 are shown as the two corresponding ISO-8859-1 characters. As a side question to this observation, is there an alternative way to use Pdfbox, to support UTF16? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-1242) Handle non ISO-8859-1 chars with drawString
[ https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240184#comment-14240184 ] John Hewson commented on PDFBOX-1242: - This is not a duplicate of PDFBOX-922, however both need to be fixed before Unicode text will work. Handle non ISO-8859-1 chars with drawString --- Key: PDFBOX-1242 URL: https://issues.apache.org/jira/browse/PDFBOX-1242 Project: PDFBox Issue Type: Bug Components: Writing Affects Versions: 1.5.0, 1.6.0 Reporter: Peter Andersen Assignee: John Hewson Fix For: 2.0.0 The PDPageContentStream.drawString take a String as argument, it construct a COSString of the input. If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff and the bytes are taken from the input as UTF-16BE encoded. Back in the drawString method this unicode16 encoded COSString is appended as a ISO-8859-1 appendRawCommands( new String( buffer.toByteArray(), ISO-8859-1)); The result of this is that a line with UTF-16 chars is shown prefix with þÿ, and with double space between the other chars. The chars above 255 are shown as the two corresponding ISO-8859-1 characters. As a side question to this observation, is there an alternative way to use Pdfbox, to support UTF16? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
[ https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031695#comment-14031695 ] John Hewson edited comment on PDFBOX-922 at 12/9/14 11:09 PM: -- {quote} drawString() in PDPageContentStream just writes the text into PDF as any COSString would choose to represent it. This is not the right thing to do. When the font is a CID keyed font, every glyph is 16 bit wide by definition, and COSString won't necessarily notice and write it correctly. {quote} Not quite: every CID can be up to 16-bits wide, but many (or for 256 glyphs, all) will fit inside 8 bits. The byte-width of a string is controlled by -whether or not it starts with a BOM, not which font it uses- the current font's CMap but is always 16-bits with TTF. Therefore, drawString() must know what font is currently being drawn, and ask that font to encode the String to whatever byte sequence it takes to draw those glyphs. So, PDFont must be added to the drawString() API, and PDFont ought to have a method for public byte[] encode(String). drawString() is only valid after setFont() has been called, so it doesn't need adding to the API, we can just use the current font. PDFont#encode is a good idea, yes. {quote} PDFont needs a clearly specified API which performs java String to font-specific encoding transformation. {quote} Yes, as above. {quote} Observe that there are no methods in PDFont called decode(), and I have a hard time figuring out what any one of these methods actually do, because everything seems to be called encode or lookup. It seems that the encode(byte[], int int) performs decoding, so it should be renamed such. {quote} Yes, I don't know if anybody knows what those methods are actually doing, including the original author. {quote} In general I'd recommend pushing the encode/decode job down to the font layer. Provide just two methods: byte[] encode(String) and String decode(byte[]). Their job is to convert between the byte sequences required by that font and java Strings, and they handle full runs of text, not just single characters. They will then use single- or multibyte encodings as the font requires without the higher level having to do crazy stuff like processEncodedText() currently does in PDFStreamEngine. {quote} processEncodedText() is indeed crazy and needs fixing, but what you propose won't work because the 16-bit string encoding is not set by the font, it's set on a per-string basis by having that string start with a BOM. {quote} There are unfortunately very many ways to encode text in PDF, and especially if text needs to be decodable from the byte stream generated by other programs, the full complexity must be faced and implemented. These are to be solved in a case-by-case basis in the PDFont hierarchy. The PDFont highest class methods for encode and decode should be defined as abstract to reflect the fact that encoding depends on the particular subtype of the font. {quote} Yes, though as far as decoding the correct text is concerned all you have to do is make sure that the ToUnicode map is built correctly - you can put any old garbage in the actual strings (any many PDFs do). {quote} It may be that for some of these fonts the implementation is same because the actual mechanics can be handled by varying the Encoding instance, though. {quote} Maybe, though the Encoding class is for Type1 fonts (and equivalent, e.g. Type1C) only. was (Author: jahewson): {quote} drawString() in PDPageContentStream just writes the text into PDF as any COSString would choose to represent it. This is not the right thing to do. When the font is a CID keyed font, every glyph is 16 bit wide by definition, and COSString won't necessarily notice and write it correctly. {quote} Not quite: every CID can be up to 16-bits wide, but many (or for 256 glyphs, all) will fit inside 8 bits. The byte-width of a string is controlled by whether or not it starts with a BOM, not which font it uses. Therefore, drawString() must know what font is currently being drawn, and ask that font to encode the String to whatever byte sequence it takes to draw those glyphs. So, PDFont must be added to the drawString() API, and PDFont ought to have a method for public byte[] encode(String). drawString() is only valid after setFont() has been called, so it doesn't need adding to the API, we can just use the current font. PDFont#encode is a good idea, yes. {quote} PDFont needs a clearly specified API which performs java String to font-specific encoding transformation. {quote} Yes, as above. {quote} Observe that there are no methods in PDFont called decode(), and I have a hard time figuring out what any one of these methods actually do, because everything seems to be called encode or lookup. It seems that the encode(byte[], int int) performs decoding, so it should be
[jira] [Updated] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
[ https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-922: -- Assignee: (was: Andreas Lehmkühler) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!) Key: PDFBOX-922 URL: https://issues.apache.org/jira/browse/PDFBOX-922 Project: PDFBox Issue Type: New Feature Components: Writing Affects Versions: 1.3.1 Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0 Reporter: Thanos Agelatos Priority: Blocker Fix For: 2.0.0 Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it creates, making it impossible to create PDFs in any language apart from English and ones supported in WinAnsiEncoding. This behaviour is caused because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, and there is no Identity-H or Identity-V Encoding classes provided (to set afterwards via PDFont.setFont() ) This excludes the following languages plus many others: - Greek - Bulgarian - Swedish - Baltic languages - Malteze The PDF created contains garbled characters and/or squares. Simple test case: {code} PDDocument doc = null; try { doc = new PDDocument(); PDPage page = new PDPage(); doc.addPage(page); // extract fonts for fields byte[] arialNorm = extractFont(arial.ttf); //byte[] arialBold = extractFont(arialbd.ttf); //PDFont font = PDType1Font.HELVETICA; PDFont font = PDTrueTypeFont.loadTTF(doc, new ByteArrayInputStream(arialNorm)); PDPageContentStream contentStream = new PDPageContentStream(doc, page); contentStream.beginText(); contentStream.setFont(font, 12); contentStream.moveTextPositionByAmount(100, 700); contentStream.drawString(Hello world from PDFBox ελληνικά); // text here may appear garbled; insert any text in Greek or Bulgarian or Malteze contentStream.endText(); contentStream.close(); doc.save(pdfbox.pdf); System.out.println( created!); } catch (Exception ioe) { ioe.printStackTrace(); } finally { if (doc != null) { try { doc.close(); } catch (Exception e) {} } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: svn: E185004: Unexpected end of svndiff input
I've created a ticket (INFRA-8846) as discussed on HipChat. BR Andreas Lehmkühler Am 09.12.2014 um 19:31 schrieb Andreas Lehmkuehler: Hi, I've got the following error when I try to commit the PDFBox release candidate to the dist repo [1] svn: E185004: Unexpected end of svndiff input The issue seems to be related to big files only, as I was able to commit the smaller files 1Mb. The email notification doesn't work too. Can someone please have a look, as I'm in the middle of the release process. Thanks in advance Andreas Lehmkühler [1] https://dist.apache.org/repos/dist/dev/pdfbox/1.8.8/
[jira] [Commented] (PDFBOX-1242) Handle non ISO-8859-1 chars with drawString
[ https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240282#comment-14240282 ] Glen Peterson commented on PDFBOX-1242: --- If I remember correctly, the PDF file format uses it's own very special 14-bit character encoding. If you use anything outside of what the PDF spec calles WinAnsi you may have to embed a font that handles those characters in the PDF file to ensure readability. I have not submitted any patches, nor am I likely to any time soon. What I did submit was a very partial work-around. The mangled code above is now publicly available under the Apache 2.0 license on GitHub where it should be much more readable. There is a Unicode to WinAnsi translation table here (I'll explain in a moment): https://github.com/GlenKPeterson/PdfLayoutManager/blob/master/src/main/java/com/planbase/pdf/layoutmanager/PdfLayoutMgr.java#L651 The code that uses that table is here: https://github.com/GlenKPeterson/PdfLayoutManager/blob/master/src/main/java/com/planbase/pdf/layoutmanager/PdfLayoutMgr.java#L972 High-level overview for each input character 1. The characters up to 127 are the same in UTF-16 and ISO-8859-1, so it leaves them unchanged 2. If one of the higher than 127 input UTF-16 characters has an ISO-8859-1 equivalent, it is converted directly/exactly. 3. If the input character is Cyrillic, there are somewhat standard, Romanized transliterations, where you can substitute one or more Roman characters that have a similar phonetic sound to the Cyrillic character. So this lets us support an additional set of languages (Russian in particular) without embedding any fonts or otherwise dealing with the root issue. 4. If the above rules do not cover the character in question, a bullet is written to the output stream, so that the end user can see that there is a character there that didn't print. OK, so I lied. The while loop at line 1006 doesn't actually work one character at a time. It finds instances of characters that need to be substituted. Then it copies what chunks of raw input it can to the output unchanged. It only drops to a character-by-character algorithm when it finds a character that actually needs to be substituted. This means that any length string of modern English characters will pass through unchanged. Most of that is in comments in the code on GitHub, but is probably easier to read knowing this overview. I hope that helps. Handle non ISO-8859-1 chars with drawString --- Key: PDFBOX-1242 URL: https://issues.apache.org/jira/browse/PDFBOX-1242 Project: PDFBox Issue Type: Bug Components: Writing Affects Versions: 1.5.0, 1.6.0 Reporter: Peter Andersen Assignee: John Hewson Fix For: 2.0.0 The PDPageContentStream.drawString take a String as argument, it construct a COSString of the input. If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff and the bytes are taken from the input as UTF-16BE encoded. Back in the drawString method this unicode16 encoded COSString is appended as a ISO-8859-1 appendRawCommands( new String( buffer.toByteArray(), ISO-8859-1)); The result of this is that a line with UTF-16 chars is shown prefix with þÿ, and with double space between the other chars. The chars above 255 are shown as the two corresponding ISO-8859-1 characters. As a side question to this observation, is there an alternative way to use Pdfbox, to support UTF16? -- This message was sent by Atlassian JIRA (v6.3.4#6332)