[jira] [Commented] (PDFBOX-1915) Implement shading with Coons and tensor-product patch meshes
[ https://issues.apache.org/jira/browse/PDFBOX-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045578#comment-14045578 ] Tilman Hausherr commented on PDFBOX-1915: - I see you've been active with refactoring, this is good :-) Yes the javadocs should be done too. It doesn't have to be long, but it should be a summary of whats being done / how it is being used by other classes if it is an interface or an abstract class. Nobody likes to do it, but the more you wait, the more annoying it becomes to do, so don't wait :-) Private classes are not required to have a javadoc, but they should have an explanation if it isn't obvious from the code. Or at least a hint of whats being done. E.g. getLen = length of a line. isEdgeALine - would like to know. If you used a wikipedia article, or an online paper as help, include the link where applicable. I couldn't have done types 1, 4, 5 without wikipedia and at one university course resource :-) - CoordinateColorPair.java: classes that are used only by you don't have to be public. Just leave out the keyword public. - PatchMeshesShadingContext.java readPatch(): if I remember this correctly, an EOF at that place is a bug in the source file (the flag was read successfully but not the rest), thus LOG.ERROR. - that decode line you commented out: just remove it - setLevel: this isn't a setter, it isn't a getter, I suggest you name it calculateLevel or whatever. I assume this is what we discussed about here, i.e. you're making a decision how far you'll chop the patch into triangles depending on the size of the patch. - remove classes that have a comment This class is not used :-) Just delete them. I'll run the code tonight and/or this weekend and give more feedback. https://www.youtube.com/watch?v=TiqDqd-1pwU watch at 29:00 he shows the patch on the top right of the file TENSOR.PDF, and it seems you did the correct implementation :-) The stuff at 22:00 is also about shading, but I feel that this goes over the scope of this project. I don't even know if we support knockout transparency groups, we just started with transparency groups a week ago or so. Implement shading with Coons and tensor-product patch meshes Key: PDFBOX-1915 URL: https://issues.apache.org/jira/browse/PDFBOX-1915 Project: PDFBox Issue Type: Improvement Components: Rendering Affects Versions: 1.8.5, 1.8.6, 2.0.0 Reporter: Tilman Hausherr Assignee: Shaola Ren Labels: graphical, gsoc2014, java, math, shading Fix For: 2.0.0 Attachments: CIB-coons-vs-tensormesh.pdf, CIB-coonsmesh.pdf, CONICAL.pdf, GWG060_Shading_x1a.pdf, GWG060_Shading_x1a_1.png, HSBWHEEL.pdf, McAfee-ShadingType7.pdf, Shadingtype6week1.pdf, TENSOR.pdf, XYZsweep.pdf, _gwg060_shading_x1a.pdf-1.png, _mcafee-shadingtype7.pdf-1.png, asy-coons-but-really-tensor.pdf, asy-tensor-rainbow.pdf, asy-tensor.pdf, coons-function.pdf, coons-function.ps, coons-nofunction-CMYK.pdf, coons-nofunction-CMYK.ps, coons-nofunction-Duotone.pdf, coons-nofunction-Duotone.ps, coons-nofunction-Gray.pdf, coons-nofunction-Gray.ps, coons-nofunction-RGB.pdf, coons-nofunction-RGB.ps, coons2-function.pdf, coons2-function.ps, coons4-function.ps, crestron-p9.pdf, eci_altona-test-suite-v2_technical_H.pdf, failedTest.rar, lamp_cairo.pdf, lamp_cairo7_0.png, lamp_cairo7_1.png, lamp_cairo7_1.png, lineRasterization.jpg, mcafeeU5.pdf, mcafeeU5_1.png, mcafeeu5.pdf-1.png, pass4FlagTest.rar, patchCases.jpg, patchMap.jpg, shading6ContourTest.rar, shading6Done.rar, shading7.rar, tensor-nofunction-RGB.pdf, tensor-nofunction-RGB.ps, tensor-nofunction-RGB_1.png, tensor4-nofunction.pdf, tensor4-nofunction.ps, tensor4-nofunction_1.png, updateshading6ContourTest.rar Of the seven shading methods described in the PDF specification, type 6 (Coons patch meshes) and type 7 (Tensor-product patch meshes) haven't been implemented. I have done type 1, 4 and 5, but I don't know the math for type 6 and 7. My math days are decades away. Knowledge prerequisites: - java, although you don't have to be a java ace, just feel confortable - math: you should know what cubic Bézier curves, Degenerate Bézier curves, bilinear interpolation, tensor-product, affine transform matrix and Bernstein polynomials are, or be able to learn it - maven (basic) - svn (basic) - an IDE like Netbeans or Eclipse or IntelliJ (basic) - ideally, you are either a math student who likes to program, or a computer science student who is specializing in graphics. A first look at PDFBOX: try the command utility here: https://pdfbox.apache.org/commandline/#pdfToImage and use your favorite PDF, or the PDFs mentioned in PDFBOX-615, these have the shading types that are already
[jira] [Commented] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity
[ https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045582#comment-14045582 ] John Hewson commented on PDFBOX-2158: - Tilman, shouldn't r1605545 have been a one-line fix? {code} if (pdImage.isStencil() (decode.length != 2 || {code} Could be changed to: {code} if (pdImage.isStencil() (decode.length 2 || {code} ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity --- Key: PDFBOX-2158 URL: https://issues.apache.org/jira/browse/PDFBOX-2158 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.5 Environment: Windows x64 Reporter: Joel Hirsh Attachments: negative.text.box.pdf Attached PDF file is missing most of the text when processed by the ExtractText example program I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a rectange for COSName.FONT_BBOX that contained a ymin value of minus infinity. That method then creates a PDRectangle which calculates a bounding box with a ymin value of -65,329, and results in an enormous text size, and things go downhill from there. The text cannot be matched up, and most of it ends up being discarded. I was able to hack a fix by doing a check in the constructor PDRectangle.PDRectangle( COSArray array ) for big negative numbers and setting them to 0. With that change, all the text came through as expected. However, I don't have enough familiarity with the code to understand what a real fix ought to look like. The PDF file looks to be fine by other programs such as Acrobat and NitroPDF -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity
[ https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045590#comment-14045590 ] Tilman Hausherr commented on PDFBOX-2158: - The problem of that file wasn't just that the decode had the wrong size, there were null objects in it. These null objects did a NPE in toFloat(), the part that you quote wasn't even reached :-( . My solution is to hope that there are two entries at least, and that the first two entries are floats and to get them. ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity --- Key: PDFBOX-2158 URL: https://issues.apache.org/jira/browse/PDFBOX-2158 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.5 Environment: Windows x64 Reporter: Joel Hirsh Attachments: negative.text.box.pdf Attached PDF file is missing most of the text when processed by the ExtractText example program I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a rectange for COSName.FONT_BBOX that contained a ymin value of minus infinity. That method then creates a PDRectangle which calculates a bounding box with a ymin value of -65,329, and results in an enormous text size, and things go downhill from there. The text cannot be matched up, and most of it ends up being discarded. I was able to hack a fix by doing a check in the constructor PDRectangle.PDRectangle( COSArray array ) for big negative numbers and setting them to 0. With that change, all the text came through as expected. However, I don't have enough familiarity with the code to understand what a real fix ought to look like. The PDF file looks to be fine by other programs such as Acrobat and NitroPDF -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity
[ https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045591#comment-14045591 ] Tilman Hausherr commented on PDFBOX-2158: - This is the current log output for that file: 27.06.2014 08:15:07.476 WARN [main] org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader:436 - decode array COSArray{[COSFloat{1.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}]} not compatible with color space, using the first two entries 27.06.2014 08:15:07.477 WARN [main] org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader:436 - decode array COSArray{[COSFloat{1.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}]} not compatible with color space, using the first two entries 27.06.2014 08:15:08.909 WARN [main] org.apache.fontbox.cff.Type1CharString:338 - rlineTo without initial moveTo in font ArialMT, glyph AE ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity --- Key: PDFBOX-2158 URL: https://issues.apache.org/jira/browse/PDFBOX-2158 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.5 Environment: Windows x64 Reporter: Joel Hirsh Attachments: negative.text.box.pdf Attached PDF file is missing most of the text when processed by the ExtractText example program I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a rectange for COSName.FONT_BBOX that contained a ymin value of minus infinity. That method then creates a PDRectangle which calculates a bounding box with a ymin value of -65,329, and results in an enormous text size, and things go downhill from there. The text cannot be matched up, and most of it ends up being discarded. I was able to hack a fix by doing a check in the constructor PDRectangle.PDRectangle( COSArray array ) for big negative numbers and setting them to 0. With that change, all the text came through as expected. However, I don't have enough familiarity with the code to understand what a real fix ought to look like. The PDF file looks to be fine by other programs such as Acrobat and NitroPDF -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity
[ https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045591#comment-14045591 ] Tilman Hausherr edited comment on PDFBOX-2158 at 6/27/14 6:15 AM: -- This is the current log output for that file: {code} 27.06.2014 08:15:07.476 WARN [main] org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader:436 - decode array COSArray{[COSFloat{1.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}]} not compatible with color space, using the first two entries 27.06.2014 08:15:07.477 WARN [main] org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader:436 - decode array COSArray{[COSFloat{1.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}]} not compatible with color space, using the first two entries 27.06.2014 08:15:08.909 WARN [main] org.apache.fontbox.cff.Type1CharString:338 - rlineTo without initial moveTo in font ArialMT, glyph AE {code} was (Author: tilman): This is the current log output for that file: 27.06.2014 08:15:07.476 WARN [main] org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader:436 - decode array COSArray{[COSFloat{1.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}]} not compatible with color space, using the first two entries 27.06.2014 08:15:07.477 WARN [main] org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader:436 - decode array COSArray{[COSFloat{1.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}]} not compatible with color space, using the first two entries 27.06.2014 08:15:08.909 WARN [main] org.apache.fontbox.cff.Type1CharString:338 - rlineTo without initial moveTo in font ArialMT, glyph AE ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity --- Key: PDFBOX-2158 URL: https://issues.apache.org/jira/browse/PDFBOX-2158 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.5 Environment: Windows x64 Reporter: Joel Hirsh Attachments: negative.text.box.pdf Attached PDF file is missing most of the text when processed by the ExtractText example program I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a rectange for COSName.FONT_BBOX that contained a ymin value of minus infinity. That method then creates a PDRectangle which calculates a bounding box with a ymin value of -65,329, and results in an enormous text size, and things go downhill from there. The text cannot be matched up, and most of it ends up being discarded. I was able to hack a fix by doing a check in the constructor PDRectangle.PDRectangle( COSArray array ) for big negative numbers and setting them to 0. With that change, all the text came through as expected. However, I don't have enough familiarity with the code to understand what a real fix ought to look like. The PDF file looks to be fine by other programs such as Acrobat and NitroPDF -- This message was sent by Atlassian JIRA (v6.2#6252)
TIKA-1300
Please look at TIKA-1300 https://issues.apache.org/jira/browse/TIKA-1300, it about PDFBox sequential parser vs. non sequential parser
[jira] [Comment Edited] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity
[ https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045590#comment-14045590 ] Tilman Hausherr edited comment on PDFBOX-2158 at 6/27/14 6:25 AM: -- The problem of that file wasn't just that the decode had the wrong size, there were null objects in it. These null objects did a NPE in toFloat(), the part that you quote wasn't even reached :-( . My solution is to hope that there are two entries at least, and that the first two entries are numbers and to get them. was (Author: tilman): The problem of that file wasn't just that the decode had the wrong size, there were null objects in it. These null objects did a NPE in toFloat(), the part that you quote wasn't even reached :-( . My solution is to hope that there are two entries at least, and that the first two entries are floats and to get them. ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity --- Key: PDFBOX-2158 URL: https://issues.apache.org/jira/browse/PDFBOX-2158 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.5 Environment: Windows x64 Reporter: Joel Hirsh Attachments: negative.text.box.pdf Attached PDF file is missing most of the text when processed by the ExtractText example program I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a rectange for COSName.FONT_BBOX that contained a ymin value of minus infinity. That method then creates a PDRectangle which calculates a bounding box with a ymin value of -65,329, and results in an enormous text size, and things go downhill from there. The text cannot be matched up, and most of it ends up being discarded. I was able to hack a fix by doing a check in the constructor PDRectangle.PDRectangle( COSArray array ) for big negative numbers and setting them to 0. With that change, all the text came through as expected. However, I don't have enough familiarity with the code to understand what a real fix ought to look like. The PDF file looks to be fine by other programs such as Acrobat and NitroPDF -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity
[ https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045608#comment-14045608 ] John Hewson commented on PDFBOX-2158: - {quote} The problem of that file wasn't just that the decode had the wrong size, there were null objects in it. These null objects did a NPE in toFloat(), the part that you quote wasn't even reached . My solution is to hope that there are two entries at least, and that the first two entries are numbers and to get them. {quote} Ah ok, I had wondered if I'd missed something. ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity --- Key: PDFBOX-2158 URL: https://issues.apache.org/jira/browse/PDFBOX-2158 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.5 Environment: Windows x64 Reporter: Joel Hirsh Attachments: negative.text.box.pdf Attached PDF file is missing most of the text when processed by the ExtractText example program I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a rectange for COSName.FONT_BBOX that contained a ymin value of minus infinity. That method then creates a PDRectangle which calculates a bounding box with a ymin value of -65,329, and results in an enormous text size, and things go downhill from there. The text cannot be matched up, and most of it ends up being discarded. I was able to hack a fix by doing a check in the constructor PDRectangle.PDRectangle( COSArray array ) for big negative numbers and setting them to 0. With that change, all the text came through as expected. However, I don't have enough familiarity with the code to understand what a real fix ought to look like. The PDF file looks to be fine by other programs such as Acrobat and NitroPDF -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity
[ https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045610#comment-14045610 ] John Hewson commented on PDFBOX-2158: - Looking at the font problem, as noted the FontBBox contains -65329 but I see that the Descent is also far too large: 65324. It seems to me that these two numbers are two's complements of what were originally signed numbers, if so the original values would have been plausible: {code} Descent = -(65324 - 65536) = 212 FontBBox = -(-64329 + 65536) = -1207 {code} We could detect this by looking for values with Math.abs(value) 32767 and apply the equations above, alternatively we could just use the values from the font whenever they are available and only fall back to the FontDescriptor if the font is missing. ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity --- Key: PDFBOX-2158 URL: https://issues.apache.org/jira/browse/PDFBOX-2158 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.5 Environment: Windows x64 Reporter: Joel Hirsh Attachments: negative.text.box.pdf Attached PDF file is missing most of the text when processed by the ExtractText example program I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a rectange for COSName.FONT_BBOX that contained a ymin value of minus infinity. That method then creates a PDRectangle which calculates a bounding box with a ymin value of -65,329, and results in an enormous text size, and things go downhill from there. The text cannot be matched up, and most of it ends up being discarded. I was able to hack a fix by doing a check in the constructor PDRectangle.PDRectangle( COSArray array ) for big negative numbers and setting them to 0. With that change, all the text came through as expected. However, I don't have enough familiarity with the code to understand what a real fix ought to look like. The PDF file looks to be fine by other programs such as Acrobat and NitroPDF -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity
[ https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045610#comment-14045610 ] John Hewson edited comment on PDFBOX-2158 at 6/27/14 6:35 AM: -- Looking at the font problem, as noted the FontBBox contains -65329 but I see that the Descent is also far too large: 65324. It seems to me that these two numbers are two's complements of what were originally signed numbers, if so the original values would have been plausible: {code} Descent = -(65324 - 65536) = 212 FontBBox = -(-64329 + 65536) = -1207 {code} We could detect this by looking for values with Math.abs(value) 32767 and apply the equations above, alternatively we could just use the values from the font whenever they are available and only fall back to the FontDescriptor if the font file is missing. was (Author: jahewson): Looking at the font problem, as noted the FontBBox contains -65329 but I see that the Descent is also far too large: 65324. It seems to me that these two numbers are two's complements of what were originally signed numbers, if so the original values would have been plausible: {code} Descent = -(65324 - 65536) = 212 FontBBox = -(-64329 + 65536) = -1207 {code} We could detect this by looking for values with Math.abs(value) 32767 and apply the equations above, alternatively we could just use the values from the font whenever they are available and only fall back to the FontDescriptor if the font is missing. ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity --- Key: PDFBOX-2158 URL: https://issues.apache.org/jira/browse/PDFBOX-2158 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.5 Environment: Windows x64 Reporter: Joel Hirsh Attachments: negative.text.box.pdf Attached PDF file is missing most of the text when processed by the ExtractText example program I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a rectange for COSName.FONT_BBOX that contained a ymin value of minus infinity. That method then creates a PDRectangle which calculates a bounding box with a ymin value of -65,329, and results in an enormous text size, and things go downhill from there. The text cannot be matched up, and most of it ends up being discarded. I was able to hack a fix by doing a check in the constructor PDRectangle.PDRectangle( COSArray array ) for big negative numbers and setting them to 0. With that change, all the text came through as expected. However, I don't have enough familiarity with the code to understand what a real fix ought to look like. The PDF file looks to be fine by other programs such as Acrobat and NitroPDF -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Improving OCR plugin for PDFBox
Hi Dimuthu That’s great. We should wait until closer to the end of the GSoC period to integrate your work with PDFBox, as ideally we only want to have to do it once. We’ve not included C++ dependencies before so no, there won’t be a standard way, we’ll have to think something up. We’ll either make it an optional sub-project and the Tesseract JNI bindings might be better of having their own branch so that they are more like an external dependency - I’ll ask the dev mailing list. To prepare your code for contribution you’ll need to add the Apache header to each.java file (see any PDFBox .java file for an example) and submit a signed ICLA http://www.apache.org/licenses/icla.pdf to Apache. Regarding additional functionality, the most useful would be for a new command line tool which could write the OCR’d text back into the original PDF file as “invisible text”, which would allow for copy and paste and text search to then work for that PDF file. A starting point for this would be to try and write the OCR’d text into the original PDF as “visible” text - we can make it invisible later! -- John On 19 Jun 2014, at 13:57, DImuthu Upeksha dimuthu.upeks...@gmail.com wrote: Hi John, Except providing compatibility for platforms like windows, I think most of the functionalities of OCR plugin are finished (Please correct me if I'm wrong). But I would like to contribute to project further. Do you have anything to add as a new functionality? And If you plan to add this to PDFBox code, how should prepare my code? Is there any standard way? Thanks Dimuthu -- Regards W.Dimuthu Upeksha Undergraduate Department of Computer Science And Engineering University of Moratuwa, Sri Lanka
Re: TIKA-1300
thanks for the pointer - very useful information. BR Maruan Am 27.06.2014 um 08:18 schrieb Tilman Hausherr thaush...@t-online.de: Please look at TIKA-1300 https://issues.apache.org/jira/browse/TIKA-1300, it about PDFBox sequential parser vs. non sequential parser
[jira] [Commented] (PDFBOX-2162) annotation that highlights a text is not visible in image (converted from the pdf)
[ https://issues.apache.org/jira/browse/PDFBOX-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045802#comment-14045802 ] Maruan Sahyoun commented on PDFBOX-2162: The reason for the annotations to not showing up is that there is no Appearance stream (/AP) defined for them. The appearance defines how the annotation shall be rendered. It is an optional entry i.e. an annotation doesn’t need to have an appearance defined. Now what Adobe Reader does it generates an appearance for annotations which do not have an appearance defined. That’s why it works after being saved by Adobe Reader. You can used PDFBox to generate an appearance stream for the annotations. annotation that highlights a text is not visible in image (converted from the pdf) -- Key: PDFBOX-2162 URL: https://issues.apache.org/jira/browse/PDFBOX-2162 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 1.8.6 Environment: Java 1.7 and Eclipse Kepler Reporter: Julien Savoyet Fix For: 1.8.6 Attachments: myfile1.pdf, myfile1_re_saved.pdf Hi, I'm trying to convert in images (png or jpeg) a PDF file in which I've added an annotation within each page through the PDAnnotationTextMarkup object. I've used PDFImageWriter to achieve this convertion and the images are correctly generated excepted that the annotation has disappeared. It seems that there is a mistake because if for example I open my PDF file with acrobat reader and then I re-save it under a new PDF file and then I relaunch my image conversion script with this latter one, the images generated will this time contain the annotations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (PDFBOX-2165) NPE Exception with barcode ttf font
Tilman Hausherr created PDFBOX-2165: --- Summary: NPE Exception with barcode ttf font Key: PDFBOX-2165 URL: https://issues.apache.org/jira/browse/PDFBOX-2165 Project: PDFBox Issue Type: Bug Affects Versions: 2.0.0 Reporter: Tilman Hausherr Inspired by this complaint http://stackoverflow.com/a/24432822/535646 I tried loading barcode fonts with PDFBox, with the command {code} PDTrueTypeFont.loadTTF() {code} With the file Code39.ttf that I get here http://www.myfont.de/download.php?winfont=bar-code-39-lesbartype=zip I get this exception: {code} Exception in thread main java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:328) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:127) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:80) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (PDFBOX-2166) AIOOB Exception with barcode ttf font
Tilman Hausherr created PDFBOX-2166: --- Summary: AIOOB Exception with barcode ttf font Key: PDFBOX-2166 URL: https://issues.apache.org/jira/browse/PDFBOX-2166 Project: PDFBox Issue Type: Bug Affects Versions: 2.0.0 Reporter: Tilman Hausherr Inspired by this complaint http://stackoverflow.com/a/24432822/535646 I tried loading barcode fonts with PDFBox, with the command {code} PDTrueTypeFont.loadTTF() {code} With the file free3of9.ttf that I can get here http://www.barcodesinc.com/free-barcode-font/ I get this exception: {code} Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 42 at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:337) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:127) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:80) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PDFBOX-2166) AIOOB Exception with barcode ttf font
[ https://issues.apache.org/jira/browse/PDFBOX-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2166: Attachment: free3of9.ttf AIOOB Exception with barcode ttf font - Key: PDFBOX-2166 URL: https://issues.apache.org/jira/browse/PDFBOX-2166 Project: PDFBox Issue Type: Bug Affects Versions: 2.0.0 Reporter: Tilman Hausherr Labels: 42, barcode, font, truetype Attachments: free3of9.ttf Inspired by this complaint http://stackoverflow.com/a/24432822/535646 I tried loading barcode fonts with PDFBox, with the command {code} PDTrueTypeFont.loadTTF() {code} With the file free3of9.ttf that I can get here http://www.barcodesinc.com/free-barcode-font/ I get this exception: {code} Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 42 at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:337) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:127) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:80) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PDFBOX-2165) NPE Exception with barcode ttf font
[ https://issues.apache.org/jira/browse/PDFBOX-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2165: Attachment: Code39.ttf NPE Exception with barcode ttf font --- Key: PDFBOX-2165 URL: https://issues.apache.org/jira/browse/PDFBOX-2165 Project: PDFBox Issue Type: Bug Affects Versions: 2.0.0 Reporter: Tilman Hausherr Labels: barcode, font, truetype Attachments: Code39.ttf Inspired by this complaint http://stackoverflow.com/a/24432822/535646 I tried loading barcode fonts with PDFBox, with the command {code} PDTrueTypeFont.loadTTF() {code} With the file Code39.ttf that I get here http://www.myfont.de/download.php?winfont=bar-code-39-lesbartype=zip I get this exception: {code} Exception in thread main java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:328) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:127) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:80) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed
[ https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reassigned PDFBOX-2163: --- Assignee: Tilman Hausherr inline image with EI in the middle incorrectly parsed - Key: PDFBOX-2163 URL: https://issues.apache.org/jira/browse/PDFBOX-2163 Project: PDFBox Issue Type: Bug Components: Parsing Reporter: Tilman Hausherr Assignee: Tilman Hausherr This PDF http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf has an exception because the end of an inline image is improperly detected. The stream looks like this: {code} BI /W 452 /H 169 /BPC 8 /CS /RGB /D [0.0 1.0 0.0 1.0 0.0 1.0] /F [/A85 /Fl] ID .. EI .. ... EI Q {code} The inline images are handled in PDFStreamParser. This is tricky, we look for followup bin data to check that it isn't an EI in the middle, but here it isn't bin data, but ascii85 stuff. We also can't request that there be a LF before the EI, because I remember that I had a PDF at work created by a well known company that doesn't use it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed
[ https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045177#comment-14045177 ] Tilman Hausherr edited comment on PDFBOX-2163 at 6/27/14 6:39 PM: -- And more: http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf was (Author: tilman): And another: http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf I think we should use an alternative parsing strategy for Ascii85 encoded inline images, e.g. assuming that the EI is in a separate line. inline image with EI in the middle incorrectly parsed - Key: PDFBOX-2163 URL: https://issues.apache.org/jira/browse/PDFBOX-2163 Project: PDFBox Issue Type: Bug Components: Parsing Reporter: Tilman Hausherr Assignee: Tilman Hausherr This PDF http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf has an exception because the end of an inline image is improperly detected. The stream looks like this: {code} BI /W 452 /H 169 /BPC 8 /CS /RGB /D [0.0 1.0 0.0 1.0 0.0 1.0] /F [/A85 /Fl] ID .. EI .. ... EI Q {code} The inline images are handled in PDFStreamParser. This is tricky, we look for followup bin data to check that it isn't an EI in the middle, but here it isn't bin data, but ascii85 stuff. We also can't request that there be a LF before the EI, because I remember that I had a PDF at work created by a well known company that doesn't use it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed
[ https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046255#comment-14046255 ] Tilman Hausherr commented on PDFBOX-2163: - Fixed for the trunk in http://svn.apache.org/r1606177 I'm looking in the output stream to see if there are 70 ascii85 bytes. If yes, then this EI doesn't count. All the files above now render properly, and so do all older files with inline images I kept. inline image with EI in the middle incorrectly parsed - Key: PDFBOX-2163 URL: https://issues.apache.org/jira/browse/PDFBOX-2163 Project: PDFBox Issue Type: Bug Components: Parsing Reporter: Tilman Hausherr Assignee: Tilman Hausherr This PDF http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf has an exception because the end of an inline image is improperly detected. The stream looks like this: {code} BI /W 452 /H 169 /BPC 8 /CS /RGB /D [0.0 1.0 0.0 1.0 0.0 1.0] /F [/A85 /Fl] ID .. EI .. ... EI Q {code} The inline images are handled in PDFStreamParser. This is tricky, we look for followup bin data to check that it isn't an EI in the middle, but here it isn't bin data, but ascii85 stuff. We also can't request that there be a LF before the EI, because I remember that I had a PDF at work created by a well known company that doesn't use it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed
[ https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2163: Affects Version/s: 2.0.0 1.8.7 1.8.6 inline image with EI in the middle incorrectly parsed - Key: PDFBOX-2163 URL: https://issues.apache.org/jira/browse/PDFBOX-2163 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr This PDF http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf has an exception because the end of an inline image is improperly detected. The stream looks like this: {code} BI /W 452 /H 169 /BPC 8 /CS /RGB /D [0.0 1.0 0.0 1.0 0.0 1.0] /F [/A85 /Fl] ID .. EI .. ... EI Q {code} The inline images are handled in PDFStreamParser. This is tricky, we look for followup bin data to check that it isn't an EI in the middle, but here it isn't bin data, but ascii85 stuff. We also can't request that there be a LF before the EI, because I remember that I had a PDF at work created by a well known company that doesn't use it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed
[ https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2163: Labels: inline (was: ) inline image with EI in the middle incorrectly parsed - Key: PDFBOX-2163 URL: https://issues.apache.org/jira/browse/PDFBOX-2163 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Labels: inline Fix For: 1.8.7, 2.0.0 This PDF http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf has an exception because the end of an inline image is improperly detected. The stream looks like this: {code} BI /W 452 /H 169 /BPC 8 /CS /RGB /D [0.0 1.0 0.0 1.0 0.0 1.0] /F [/A85 /Fl] ID .. EI .. ... EI Q {code} The inline images are handled in PDFStreamParser. This is tricky, we look for followup bin data to check that it isn't an EI in the middle, but here it isn't bin data, but ascii85 stuff. We also can't request that there be a LF before the EI, because I remember that I had a PDF at work created by a well known company that doesn't use it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed
[ https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2163: Fix Version/s: 2.0.0 1.8.7 inline image with EI in the middle incorrectly parsed - Key: PDFBOX-2163 URL: https://issues.apache.org/jira/browse/PDFBOX-2163 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Labels: inline Fix For: 1.8.7, 2.0.0 This PDF http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf has an exception because the end of an inline image is improperly detected. The stream looks like this: {code} BI /W 452 /H 169 /BPC 8 /CS /RGB /D [0.0 1.0 0.0 1.0 0.0 1.0] /F [/A85 /Fl] ID .. EI .. ... EI Q {code} The inline images are handled in PDFStreamParser. This is tricky, we look for followup bin data to check that it isn't an EI in the middle, but here it isn't bin data, but ascii85 stuff. We also can't request that there be a LF before the EI, because I remember that I had a PDF at work created by a well known company that doesn't use it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed
[ https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045177#comment-14045177 ] Tilman Hausherr edited comment on PDFBOX-2163 at 6/27/14 6:53 PM: -- And more: http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf was (Author: tilman): And more: http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf inline image with EI in the middle incorrectly parsed - Key: PDFBOX-2163 URL: https://issues.apache.org/jira/browse/PDFBOX-2163 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Labels: inline Fix For: 1.8.7, 2.0.0 This PDF http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf has an exception because the end of an inline image is improperly detected. The stream looks like this: {code} BI /W 452 /H 169 /BPC 8 /CS /RGB /D [0.0 1.0 0.0 1.0 0.0 1.0] /F [/A85 /Fl] ID .. EI .. ... EI Q {code} The inline images are handled in PDFStreamParser. This is tricky, we look for followup bin data to check that it isn't an EI in the middle, but here it isn't bin data, but ascii85 stuff. We also can't request that there be a LF before the EI, because I remember that I had a PDF at work created by a well known company that doesn't use it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed
[ https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045177#comment-14045177 ] Tilman Hausherr edited comment on PDFBOX-2163 at 6/27/14 7:46 PM: -- And more: http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/152/152584.pdf was (Author: tilman): And more: http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf inline image with EI in the middle incorrectly parsed - Key: PDFBOX-2163 URL: https://issues.apache.org/jira/browse/PDFBOX-2163 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Labels: inline Fix For: 1.8.7, 2.0.0 This PDF http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf has an exception because the end of an inline image is improperly detected. The stream looks like this: {code} BI /W 452 /H 169 /BPC 8 /CS /RGB /D [0.0 1.0 0.0 1.0 0.0 1.0] /F [/A85 /Fl] ID .. EI .. ... EI Q {code} The inline images are handled in PDFStreamParser. This is tricky, we look for followup bin data to check that it isn't an EI in the middle, but here it isn't bin data, but ascii85 stuff. We also can't request that there be a LF before the EI, because I remember that I had a PDF at work created by a well known company that doesn't use it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed
[ https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045177#comment-14045177 ] Tilman Hausherr edited comment on PDFBOX-2163 at 6/27/14 8:01 PM: -- And more: http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/152/152584.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/092/092448.pdf was (Author: tilman): And more: http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/152/152584.pdf inline image with EI in the middle incorrectly parsed - Key: PDFBOX-2163 URL: https://issues.apache.org/jira/browse/PDFBOX-2163 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Labels: inline Fix For: 1.8.7, 2.0.0 This PDF http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf has an exception because the end of an inline image is improperly detected. The stream looks like this: {code} BI /W 452 /H 169 /BPC 8 /CS /RGB /D [0.0 1.0 0.0 1.0 0.0 1.0] /F [/A85 /Fl] ID .. EI .. ... EI Q {code} The inline images are handled in PDFStreamParser. This is tricky, we look for followup bin data to check that it isn't an EI in the middle, but here it isn't bin data, but ascii85 stuff. We also can't request that there be a LF before the EI, because I remember that I had a PDF at work created by a well known company that doesn't use it. -- This message was sent by Atlassian JIRA (v6.2#6252)
Jenkins build became unstable: PDFBox-trunk » Apache PDFBox #1084
See https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox/1084/changes
Jenkins build became unstable: PDFBox-trunk #1084
See https://builds.apache.org/job/PDFBox-trunk/1084/changes
[jira] [Updated] (PDFBOX-2165) NPE with barcode ttf font
[ https://issues.apache.org/jira/browse/PDFBOX-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2165: Summary: NPE with barcode ttf font (was: NPE Exception with barcode ttf font) NPE with barcode ttf font - Key: PDFBOX-2165 URL: https://issues.apache.org/jira/browse/PDFBOX-2165 Project: PDFBox Issue Type: Bug Affects Versions: 2.0.0 Reporter: Tilman Hausherr Labels: barcode, font, truetype Attachments: Code39.ttf Inspired by this complaint http://stackoverflow.com/a/24432822/535646 I tried loading barcode fonts with PDFBox, with the command {code} PDTrueTypeFont.loadTTF() {code} With the file Code39.ttf that I get here http://www.myfont.de/download.php?winfont=bar-code-39-lesbartype=zip I get this exception: {code} Exception in thread main java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:328) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:127) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:80) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PDFBOX-2166) AIOOBE with barcode ttf font
[ https://issues.apache.org/jira/browse/PDFBOX-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2166: Summary: AIOOBE with barcode ttf font (was: AIOOB Exception with barcode ttf font) AIOOBE with barcode ttf font Key: PDFBOX-2166 URL: https://issues.apache.org/jira/browse/PDFBOX-2166 Project: PDFBox Issue Type: Bug Affects Versions: 2.0.0 Reporter: Tilman Hausherr Labels: 42, barcode, font, truetype Attachments: free3of9.ttf Inspired by this complaint http://stackoverflow.com/a/24432822/535646 I tried loading barcode fonts with PDFBox, with the command {code} PDTrueTypeFont.loadTTF() {code} With the file free3of9.ttf that I can get here http://www.barcodesinc.com/free-barcode-font/ I get this exception: {code} Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 42 at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:337) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:127) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:80) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PDFBOX-2167) NPE in PDTrueTypeFont.makeFontDescriptor
[ https://issues.apache.org/jira/browse/PDFBOX-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2167: Attachment: 268554.pdf NPE in PDTrueTypeFont.makeFontDescriptor Key: PDFBOX-2167 URL: https://issues.apache.org/jira/browse/PDFBOX-2167 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 2.0.0 Reporter: Tilman Hausherr Attachments: 268554.pdf I get an NPE with the file from http://digitalcorpora.org/corp/nps/files/govdocs1/268/268554.pdf {code} java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:292) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontDescriptor(PDTrueTypeFont.java:150) at org.apache.pdfbox.pdmodel.font.PDFont.getFontWidth(PDFont.java:814) IOException for file 268554.pdf at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:382) at org.apache.pdfbox.pdmodel.font.PDFont.getFontWidth(PDFont.java:312) at org.apache.pdfbox.pdmodel.font.PDFont.getSpaceWidth(PDFont.java:855) at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:328) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:521) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:267) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:226) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:209) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:174) at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:227) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:160) at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:109) {code} I first thought it is the same as PDFBOX-2165, but it's a different line number. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (PDFBOX-2167) NPE in PDTrueTypeFont.makeFontDescriptor
Tilman Hausherr created PDFBOX-2167: --- Summary: NPE in PDTrueTypeFont.makeFontDescriptor Key: PDFBOX-2167 URL: https://issues.apache.org/jira/browse/PDFBOX-2167 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 2.0.0 Reporter: Tilman Hausherr Attachments: 268554.pdf I get an NPE with the file from http://digitalcorpora.org/corp/nps/files/govdocs1/268/268554.pdf {code} java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:292) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontDescriptor(PDTrueTypeFont.java:150) at org.apache.pdfbox.pdmodel.font.PDFont.getFontWidth(PDFont.java:814) IOException for file 268554.pdf at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:382) at org.apache.pdfbox.pdmodel.font.PDFont.getFontWidth(PDFont.java:312) at org.apache.pdfbox.pdmodel.font.PDFont.getSpaceWidth(PDFont.java:855) at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:328) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:521) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:267) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:226) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:209) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:174) at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:227) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:160) at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:109) {code} I first thought it is the same as PDFBOX-2165, but it's a different line number. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed
[ https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045177#comment-14045177 ] Tilman Hausherr edited comment on PDFBOX-2163 at 6/27/14 8:56 PM: -- And more: http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/152/152584.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/092/092448.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/248/248066.pdf was (Author: tilman): And more: http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/152/152584.pdf http://digitalcorpora.org/corp/nps/files/govdocs1/092/092448.pdf inline image with EI in the middle incorrectly parsed - Key: PDFBOX-2163 URL: https://issues.apache.org/jira/browse/PDFBOX-2163 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 1.8.6, 1.8.7, 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Labels: inline Fix For: 1.8.7, 2.0.0 This PDF http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf has an exception because the end of an inline image is improperly detected. The stream looks like this: {code} BI /W 452 /H 169 /BPC 8 /CS /RGB /D [0.0 1.0 0.0 1.0 0.0 1.0] /F [/A85 /Fl] ID .. EI .. ... EI Q {code} The inline images are handled in PDFStreamParser. This is tricky, we look for followup bin data to check that it isn't an EI in the middle, but here it isn't bin data, but ascii85 stuff. We also can't request that there be a LF before the EI, because I remember that I had a PDF at work created by a well known company that doesn't use it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2159) horizontal line above shaded text when printing on HP1320
[ https://issues.apache.org/jira/browse/PDFBOX-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046594#comment-14046594 ] John Hewson commented on PDFBOX-2159: - If you have a printer-specific problem you could try the 2.0 trunk and use the constructor: {code} PDFPrinter(PDDocument document, Scaling scaling, Orientation orientation, Paper paper, float dpi) {code} Which allows the image to be rasterized before being sent to the printer driver, you'll need to set a dpi, usually 300. horizontal line above shaded text when printing on HP1320 - Key: PDFBOX-2159 URL: https://issues.apache.org/jira/browse/PDFBOX-2159 Project: PDFBox Issue Type: Bug Affects Versions: 1.8.6, 1.8.7 Reporter: Tilman Hausherr Attachments: IMG_20140624_073954101.jpg, PDFBOX-2159-2.pdf, PDFBOX-2159-2.ps, PDFBOX-2159.pdf, PDFBOX-2159.ps This is a follow-up to PDFBOX-2141 and somewhat of PDFBOX-485. In the later, [~vbier] reported weird printing problems related to a specific part of the code that I changed again recently. While the original problem is gone, there is a new one that we discovered in a discussion in PDFBOX-2141 and it appears e.g. on the 4th page of the file pslib-shading.pdf that is in PDFBOX-1942: above the text there is a horizontal line (about 3mm thick) that goes over the whole page. I created a new test file that I am attaching. It has two shaded text lines and two unshaded, each time once with a standard 14 font and with an embedded type 1 font. [~vbier], please tell how many horizontal lines you get and where. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (PDFBOX-1875) Image and some text missing in rendered file
[ https://issues.apache.org/jira/browse/PDFBOX-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson resolved PDFBOX-1875. - Resolution: Fixed Yes that's it, the clipping path wasn't being intersected with the form's BBox. Good fix. Image and some text missing in rendered file Key: PDFBOX-1875 URL: https://issues.apache.org/jira/browse/PDFBOX-1875 Project: PDFBox Issue Type: Bug Components: Rendering Affects Versions: 2.0.0 Reporter: Tilman Hausherr Labels: bbox Fix For: 2.0.0 Attachments: PDFBOX-1861-bbox-bad.png, PDFBOX-1861-bbox-good.png, PDFBOX-1861-bbox.pdf, pdfbox-1861-tracemonkey.pdf-6.png An image and some text are missing on page 6 of the tracemonkey.pdf file of PDFBOX-1861. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2104) Implement transparency groups
[ https://issues.apache.org/jira/browse/PDFBOX-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046645#comment-14046645 ] John Hewson commented on PDFBOX-2104: - I did some syntactic cleaning up of the code in [r1606281|http://svn.apache.org/r1606281]. Implement transparency groups - Key: PDFBOX-2104 URL: https://issues.apache.org/jira/browse/PDFBOX-2104 Project: PDFBox Issue Type: Improvement Components: Rendering Affects Versions: 2.0.0 Reporter: Petr Slaby Assignee: John Hewson Labels: transparency Fix For: 2.0.0 Attachments: 01_MTEXT_CS6.pdf, TransparencyGroups.1.patch, TransparencyGroups.2.patch, TransparencyGroups.3.patch, TransparencyGroups.patch The attached PDF uses transparency groups, blending and soft masks to create the rounded corners and shades behind images. It appears that these features are not implemented in PDFBox. An implementation proposal is attached in the TransparencyGroup.patch. The basic idea is to create a buffered image, draw the transparency group content onto it and then use the result to produce the soft mask or draw the image on the original g2d. Note: I am not the (only) author of the proposed change. It was developed in our company few years ago in sources based on a 1.7.x version of PDFBox, mostly by a guy who already left. Over the years, merging of the work done in PDFBox main stream into our source base has become impossible due to many refactorings and other deep going changes done. Now we would like to go the opposite way - where possible - bring the changes and fixes we have done into PDFBox main stream and start to use it in our installations. -- This message was sent by Atlassian JIRA (v6.2#6252)
Jenkins build is back to stable : PDFBox-trunk » Apache PDFBox #1085
See https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox/1085/changes
Jenkins build is back to stable : PDFBox-trunk #1085
See https://builds.apache.org/job/PDFBox-trunk/1085/changes
[jira] [Comment Edited] (PDFBOX-2126) Optimize clipping
[ https://issues.apache.org/jira/browse/PDFBOX-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046702#comment-14046702 ] John Hewson edited comment on PDFBOX-2126 at 6/28/14 2:42 AM: -- I like what you're doing with this commit but when I looked closer I thought that the handling of clipping paths needed some more significant improvements. I've done some refactoring in [s1606283|http://svn.apache.org/s1606283] to PDGraphicsState, see what you think. I haven't applied any optimisations yet, you're welcome to update your patch again. Feedback appreciated. was (Author: jahewson): I like what you're doing with this commit but when I looked at applying at I thought that the handling of clipping paths needed some more significant improvements. I've done some refactoring in [s1606283|http://svn.apache.org/s1606283] to PDGraphicsState, see what you think. I haven't applied any optimisations yet, you're welcome to update your patch again. Feedback appreciated. Optimize clipping - Key: PDFBOX-2126 URL: https://issues.apache.org/jira/browse/PDFBOX-2126 Project: PDFBox Issue Type: Improvement Components: Rendering Affects Versions: 2.0.0 Reporter: Petr Slaby Attachments: ClipPath.1.patch, ClipPath.patch, example_010.pdf As already stated in a TODO comment in PageDrawer, the call of Graphics2D#setClip() is time and memory consuming. The attached patch optimizes clipping by calling Graphics2D#setClip() only if the clipping path has changed. The effect depends on the document, e.g. the attached one renders in 10.5s without the optimization and in 5.5 seconds in the optimized version. The clipping has to be re-applied whenever the transform in Graphics2D changes. This is not explicitly checked for, the implementation rather depends on the cached value being reset manually. Currently this is only needed at one place when processing annotations (AcroForms). Also, the implementation relies upon the clipping path object stored in PDGraphicsState to never change so that a comparison using == can be used. This works fine, but needs a bit of awareness in future changes. To make the design more clean, the clipping path could be made private to PDGraphcisState and thus really immutable from outside. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2126) Optimize clipping
[ https://issues.apache.org/jira/browse/PDFBOX-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046702#comment-14046702 ] John Hewson commented on PDFBOX-2126: - I like what you're doing with this commit but when I looked at applying at I thought that the handling of clipping paths needed some more significant improvements. I've done some refactoring in [s1606283|http://svn.apache.org/s1606283] to PDGraphicsState, see what you think. I haven't applied any optimisations yet, you're welcome to update your patch again. Feedback appreciated. Optimize clipping - Key: PDFBOX-2126 URL: https://issues.apache.org/jira/browse/PDFBOX-2126 Project: PDFBox Issue Type: Improvement Components: Rendering Affects Versions: 2.0.0 Reporter: Petr Slaby Attachments: ClipPath.1.patch, ClipPath.patch, example_010.pdf As already stated in a TODO comment in PageDrawer, the call of Graphics2D#setClip() is time and memory consuming. The attached patch optimizes clipping by calling Graphics2D#setClip() only if the clipping path has changed. The effect depends on the document, e.g. the attached one renders in 10.5s without the optimization and in 5.5 seconds in the optimized version. The clipping has to be re-applied whenever the transform in Graphics2D changes. This is not explicitly checked for, the implementation rather depends on the cached value being reset manually. Currently this is only needed at one place when processing annotations (AcroForms). Also, the implementation relies upon the clipping path object stored in PDGraphicsState to never change so that a comparison using == can be used. This works fine, but needs a bit of awareness in future changes. To make the design more clean, the clipping path could be made private to PDGraphcisState and thus really immutable from outside. -- This message was sent by Atlassian JIRA (v6.2#6252)