date:20140627


[ 
https://issues.apache.org/jira/browse/PDFBOX-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045578#comment-14045578
 ] 

Tilman Hausherr commented on PDFBOX-1915:
-

I see you've been active with refactoring, this is good :-)

Yes the javadocs should be done too. It doesn't have to be long, but it should 
be a summary of whats being done / how it is being used by other classes if it 
is an interface or an abstract class. Nobody likes to do it, but the more you 
wait, the more annoying it becomes to do, so don't wait :-)

Private classes are not required to have a javadoc, but they should have an 
explanation if it isn't obvious from the code. Or at least a hint of whats 
being done. E.g.  getLen = length of a line. isEdgeALine - would like to 
know.

If you used a wikipedia article, or an online paper as help, include the link 
where applicable. I couldn't have done types 1, 4, 5 without wikipedia and at 
one university course resource :-)

- CoordinateColorPair.java: classes that are used only by you don't have to be 
public. Just leave out the keyword public.
- PatchMeshesShadingContext.java  readPatch(): if I remember this correctly, an 
EOF at that place is a bug in the source file (the flag was read successfully 
but not the rest), thus LOG.ERROR.
- that decode line you commented out: just remove it
- setLevel: this isn't a setter, it isn't a getter, I suggest you name it 
calculateLevel or whatever. I assume this is what we discussed about here, 
i.e. you're making a decision how far you'll chop the patch into triangles 
depending on the size of the patch.
- remove classes that have a comment This class is not used :-) Just delete 
them.

I'll run the code tonight and/or this weekend and give more feedback.

https://www.youtube.com/watch?v=TiqDqd-1pwU
watch at 29:00 he shows the patch on the top right of the file TENSOR.PDF, and 
it seems you did the correct implementation :-)

The stuff at 22:00 is also about shading, but I feel that this goes over the 
scope of this project. I don't even know if we support knockout transparency 
groups, we just started with transparency groups a week ago or so.

 Implement shading with Coons and tensor-product patch meshes
 

 Key: PDFBOX-1915
 URL: https://issues.apache.org/jira/browse/PDFBOX-1915
 Project: PDFBox
  Issue Type: Improvement
  Components: Rendering
Affects Versions: 1.8.5, 1.8.6, 2.0.0
Reporter: Tilman Hausherr
Assignee: Shaola Ren
  Labels: graphical, gsoc2014, java, math, shading
 Fix For: 2.0.0

 Attachments: CIB-coons-vs-tensormesh.pdf, CIB-coonsmesh.pdf, 
 CONICAL.pdf, GWG060_Shading_x1a.pdf, GWG060_Shading_x1a_1.png, HSBWHEEL.pdf, 
 McAfee-ShadingType7.pdf, Shadingtype6week1.pdf, TENSOR.pdf, XYZsweep.pdf, 
 _gwg060_shading_x1a.pdf-1.png, _mcafee-shadingtype7.pdf-1.png, 
 asy-coons-but-really-tensor.pdf, asy-tensor-rainbow.pdf, asy-tensor.pdf, 
 coons-function.pdf, coons-function.ps, coons-nofunction-CMYK.pdf, 
 coons-nofunction-CMYK.ps, coons-nofunction-Duotone.pdf, 
 coons-nofunction-Duotone.ps, coons-nofunction-Gray.pdf, 
 coons-nofunction-Gray.ps, coons-nofunction-RGB.pdf, coons-nofunction-RGB.ps, 
 coons2-function.pdf, coons2-function.ps, coons4-function.ps, crestron-p9.pdf, 
 eci_altona-test-suite-v2_technical_H.pdf, failedTest.rar, lamp_cairo.pdf, 
 lamp_cairo7_0.png, lamp_cairo7_1.png, lamp_cairo7_1.png, 
 lineRasterization.jpg, mcafeeU5.pdf, mcafeeU5_1.png, mcafeeu5.pdf-1.png, 
 pass4FlagTest.rar, patchCases.jpg, patchMap.jpg, shading6ContourTest.rar, 
 shading6Done.rar, shading7.rar, tensor-nofunction-RGB.pdf, 
 tensor-nofunction-RGB.ps, tensor-nofunction-RGB_1.png, 
 tensor4-nofunction.pdf, tensor4-nofunction.ps, tensor4-nofunction_1.png, 
 updateshading6ContourTest.rar


 Of the seven shading methods described in the PDF specification, type 6 
 (Coons patch meshes) and type 7 (Tensor-product patch meshes) haven't been 
 implemented. I have done type 1, 4 and 5, but I don't know the math for type 
 6 and 7. My math days are decades away.
 Knowledge prerequisites: 
 - java, although you don't have to be a java ace, just feel confortable
 - math: you should know what cubic Bézier curves, Degenerate Bézier 
 curves, bilinear interpolation, tensor-product, affine transform 
 matrix and Bernstein polynomials are, or be able to learn it
 - maven (basic)
 - svn (basic)
 - an IDE like Netbeans or Eclipse or IntelliJ (basic)
 - ideally, you are either a math student who likes to program, or a computer 
 science student who is specializing in graphics.
 A first look at PDFBOX: try the command utility here:
 https://pdfbox.apache.org/commandline/#pdfToImage
 and use your favorite PDF, or the PDFs mentioned in PDFBOX-615, these have 
 the shading types that are already

[jira] [Commented] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity


[ 
https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045582#comment-14045582
 ] 

John Hewson commented on PDFBOX-2158:
-

Tilman, shouldn't r1605545 have been a one-line fix?

{code}
if (pdImage.isStencil()  (decode.length != 2 ||
{code}

Could be changed to:

{code}
if (pdImage.isStencil()  (decode.length  2 ||
{code}

 ExtractText missing most of text in this PDF file, due to font bounding box 
 with minus infinity
 ---

 Key: PDFBOX-2158
 URL: https://issues.apache.org/jira/browse/PDFBOX-2158
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.5
 Environment: Windows x64
Reporter: Joel Hirsh
 Attachments: negative.text.box.pdf


 Attached PDF file is missing most of the text when processed by the 
 ExtractText example program
 I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a 
 rectange for COSName.FONT_BBOX  that contained a ymin value of minus 
 infinity. That method then creates a PDRectangle which calculates a bounding 
 box with a ymin value of -65,329, and results in an enormous text size, and 
 things go downhill from there.  The text cannot be matched up, and most of it 
 ends up being discarded.
 I was able to hack a fix by doing a check in the constructor 
 PDRectangle.PDRectangle( COSArray array ) for big negative numbers and 
 setting them to 0.  With that change, all the text came through as expected. 
 However, I don't have enough familiarity with the code to understand what a 
 real fix ought to look like.
 The PDF file looks to be fine by other programs such as Acrobat and NitroPDF



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity


[ 
https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045590#comment-14045590
 ] 

Tilman Hausherr commented on PDFBOX-2158:
-

The problem of that file wasn't just that the decode had the wrong size, there 
were null objects in it. These null objects did a NPE in toFloat(), the part 
that you quote wasn't even reached :-( . My solution is to hope that there are 
two entries at least, and that the first two entries are floats and to get them.

 ExtractText missing most of text in this PDF file, due to font bounding box 
 with minus infinity
 ---

 Key: PDFBOX-2158
 URL: https://issues.apache.org/jira/browse/PDFBOX-2158
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.5
 Environment: Windows x64
Reporter: Joel Hirsh
 Attachments: negative.text.box.pdf


 Attached PDF file is missing most of the text when processed by the 
 ExtractText example program
 I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a 
 rectange for COSName.FONT_BBOX  that contained a ymin value of minus 
 infinity. That method then creates a PDRectangle which calculates a bounding 
 box with a ymin value of -65,329, and results in an enormous text size, and 
 things go downhill from there.  The text cannot be matched up, and most of it 
 ends up being discarded.
 I was able to hack a fix by doing a check in the constructor 
 PDRectangle.PDRectangle( COSArray array ) for big negative numbers and 
 setting them to 0.  With that change, all the text came through as expected. 
 However, I don't have enough familiarity with the code to understand what a 
 real fix ought to look like.
 The PDF file looks to be fine by other programs such as Acrobat and NitroPDF



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity


[ 
https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045591#comment-14045591
 ] 

Tilman Hausherr commented on PDFBOX-2158:
-

This is the current log output for that file:

27.06.2014 08:15:07.476 WARN  [main] 
org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader:436 - decode array 
COSArray{[COSFloat{1.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, 
COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSNull{}, 
COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}]} 
not compatible with color space, using the first two entries
27.06.2014 08:15:07.477 WARN  [main] 
org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader:436 - decode array 
COSArray{[COSFloat{1.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, 
COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSNull{}, 
COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}]} 
not compatible with color space, using the first two entries
27.06.2014 08:15:08.909 WARN  [main] org.apache.fontbox.cff.Type1CharString:338 
- rlineTo without initial moveTo in font ArialMT, glyph AE


 ExtractText missing most of text in this PDF file, due to font bounding box 
 with minus infinity
 ---

 Key: PDFBOX-2158
 URL: https://issues.apache.org/jira/browse/PDFBOX-2158
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.5
 Environment: Windows x64
Reporter: Joel Hirsh
 Attachments: negative.text.box.pdf


 Attached PDF file is missing most of the text when processed by the 
 ExtractText example program
 I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a 
 rectange for COSName.FONT_BBOX  that contained a ymin value of minus 
 infinity. That method then creates a PDRectangle which calculates a bounding 
 box with a ymin value of -65,329, and results in an enormous text size, and 
 things go downhill from there.  The text cannot be matched up, and most of it 
 ends up being discarded.
 I was able to hack a fix by doing a check in the constructor 
 PDRectangle.PDRectangle( COSArray array ) for big negative numbers and 
 setting them to 0.  With that change, all the text came through as expected. 
 However, I don't have enough familiarity with the code to understand what a 
 real fix ought to look like.
 The PDF file looks to be fine by other programs such as Acrobat and NitroPDF



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity


[ 
https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045591#comment-14045591
 ] 

Tilman Hausherr edited comment on PDFBOX-2158 at 6/27/14 6:15 AM:
--

This is the current log output for that file:
{code}
27.06.2014 08:15:07.476 WARN  [main] 
org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader:436 - decode array 
COSArray{[COSFloat{1.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, 
COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSNull{}, 
COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}]} 
not compatible with color space, using the first two entries
27.06.2014 08:15:07.477 WARN  [main] 
org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader:436 - decode array 
COSArray{[COSFloat{1.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, 
COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSNull{}, 
COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}]} 
not compatible with color space, using the first two entries
27.06.2014 08:15:08.909 WARN  [main] org.apache.fontbox.cff.Type1CharString:338 
- rlineTo without initial moveTo in font ArialMT, glyph AE
{code}


was (Author: tilman):
This is the current log output for that file:

27.06.2014 08:15:07.476 WARN  [main] 
org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader:436 - decode array 
COSArray{[COSFloat{1.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, 
COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSNull{}, 
COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}]} 
not compatible with color space, using the first two entries
27.06.2014 08:15:07.477 WARN  [main] 
org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader:436 - decode array 
COSArray{[COSFloat{1.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, 
COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSFloat{0.0}, COSNull{}, 
COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}, COSNull{}]} 
not compatible with color space, using the first two entries
27.06.2014 08:15:08.909 WARN  [main] org.apache.fontbox.cff.Type1CharString:338 
- rlineTo without initial moveTo in font ArialMT, glyph AE


 ExtractText missing most of text in this PDF file, due to font bounding box 
 with minus infinity
 ---

 Key: PDFBOX-2158
 URL: https://issues.apache.org/jira/browse/PDFBOX-2158
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.5
 Environment: Windows x64
Reporter: Joel Hirsh
 Attachments: negative.text.box.pdf


 Attached PDF file is missing most of the text when processed by the 
 ExtractText example program
 I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a 
 rectange for COSName.FONT_BBOX  that contained a ymin value of minus 
 infinity. That method then creates a PDRectangle which calculates a bounding 
 box with a ymin value of -65,329, and results in an enormous text size, and 
 things go downhill from there.  The text cannot be matched up, and most of it 
 ends up being discarded.
 I was able to hack a fix by doing a check in the constructor 
 PDRectangle.PDRectangle( COSArray array ) for big negative numbers and 
 setting them to 0.  With that change, all the text came through as expected. 
 However, I don't have enough familiarity with the code to understand what a 
 real fix ought to look like.
 The PDF file looks to be fine by other programs such as Acrobat and NitroPDF



--
This message was sent by Atlassian JIRA
(v6.2#6252)

TIKA-1300

2014-06-27 Thread Tilman Hausherr

Please look at TIKA-1300 
https://issues.apache.org/jira/browse/TIKA-1300, it about PDFBox 
sequential parser vs. non sequential parser

[jira] [Comment Edited] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity

[
https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045590#comment-14045590
]

Tilman Hausherr edited comment on PDFBOX-2158 at 6/27/14 6:25 AM:
--

The problem of that file wasn't just that the decode had the wrong size, there
were null objects in it. These null objects did a NPE in toFloat(), the part
that you quote wasn't even reached :-( . My solution is to hope that there are
two entries at least, and that the first two entries are numbers and to get
them.

was (Author: tilman):
The problem of that file wasn't just that the decode had the wrong size, there
were null objects in it. These null objects did a NPE in toFloat(), the part
that you quote wasn't even reached :-( . My solution is to hope that there are
two entries at least, and that the first two entries are floats and to get them.

ExtractText missing most of text in this PDF file, due to font bounding box
with minus infinity
---

Key: PDFBOX-2158
URL: https://issues.apache.org/jira/browse/PDFBOX-2158
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.5
Environment: Windows x64
Reporter: Joel Hirsh
Attachments: negative.text.box.pdf

Attached PDF file is missing most of the text when processed by the
ExtractText example program
I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a
rectange for COSName.FONT_BBOX that contained a ymin value of minus
infinity. That method then creates a PDRectangle which calculates a bounding
box with a ymin value of -65,329, and results in an enormous text size, and
things go downhill from there. The text cannot be matched up, and most of it
ends up being discarded.
I was able to hack a fix by doing a check in the constructor
PDRectangle.PDRectangle( COSArray array ) for big negative numbers and
setting them to 0. With that change, all the text came through as expected.
However, I don't have enough familiarity with the code to understand what a
real fix ought to look like.
The PDF file looks to be fine by other programs such as Acrobat and NitroPDF

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity

[
https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045608#comment-14045608
]

John Hewson commented on PDFBOX-2158:
-

{quote}
The problem of that file wasn't just that the decode had the wrong size, there
were null objects in it. These null objects did a NPE in toFloat(), the part
that you quote wasn't even reached . My solution is to hope that there are two
entries at least, and that the first two entries are numbers and to get them.
{quote}

Ah ok, I had wondered if I'd missed something.

ExtractText missing most of text in this PDF file, due to font bounding box
with minus infinity
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity


[ 
https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045610#comment-14045610
 ] 

John Hewson commented on PDFBOX-2158:
-

Looking at the font problem, as noted the FontBBox contains -65329 but I see 
that the Descent is also far too large: 65324. It seems to me that these two 
numbers are two's complements of what were originally signed numbers, if so the 
original values would have been plausible:

{code}
Descent = -(65324 - 65536) = 212
FontBBox = -(-64329 + 65536) = -1207
{code}

We could detect this by looking for values with Math.abs(value)  32767 and 
apply the equations above, alternatively we could just use the values from the 
font whenever they are available and only fall back to the FontDescriptor if 
the font is missing.

 ExtractText missing most of text in this PDF file, due to font bounding box 
 with minus infinity
 ---

 Key: PDFBOX-2158
 URL: https://issues.apache.org/jira/browse/PDFBOX-2158
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.5
 Environment: Windows x64
Reporter: Joel Hirsh
 Attachments: negative.text.box.pdf


 Attached PDF file is missing most of the text when processed by the 
 ExtractText example program
 I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a 
 rectange for COSName.FONT_BBOX  that contained a ymin value of minus 
 infinity. That method then creates a PDRectangle which calculates a bounding 
 box with a ymin value of -65,329, and results in an enormous text size, and 
 things go downhill from there.  The text cannot be matched up, and most of it 
 ends up being discarded.
 I was able to hack a fix by doing a check in the constructor 
 PDRectangle.PDRectangle( COSArray array ) for big negative numbers and 
 setting them to 0.  With that change, all the text came through as expected. 
 However, I don't have enough familiarity with the code to understand what a 
 real fix ought to look like.
 The PDF file looks to be fine by other programs such as Acrobat and NitroPDF



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity


[ 
https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045610#comment-14045610
 ] 

John Hewson edited comment on PDFBOX-2158 at 6/27/14 6:35 AM:
--

Looking at the font problem, as noted the FontBBox contains -65329 but I see 
that the Descent is also far too large: 65324. It seems to me that these two 
numbers are two's complements of what were originally signed numbers, if so the 
original values would have been plausible:

{code}
Descent = -(65324 - 65536) = 212
FontBBox = -(-64329 + 65536) = -1207
{code}

We could detect this by looking for values with Math.abs(value)  32767 and 
apply the equations above, alternatively we could just use the values from the 
font whenever they are available and only fall back to the FontDescriptor if 
the font file is missing.


was (Author: jahewson):
Looking at the font problem, as noted the FontBBox contains -65329 but I see 
that the Descent is also far too large: 65324. It seems to me that these two 
numbers are two's complements of what were originally signed numbers, if so the 
original values would have been plausible:

{code}
Descent = -(65324 - 65536) = 212
FontBBox = -(-64329 + 65536) = -1207
{code}

We could detect this by looking for values with Math.abs(value)  32767 and 
apply the equations above, alternatively we could just use the values from the 
font whenever they are available and only fall back to the FontDescriptor if 
the font is missing.

 ExtractText missing most of text in this PDF file, due to font bounding box 
 with minus infinity
 ---

 Key: PDFBOX-2158
 URL: https://issues.apache.org/jira/browse/PDFBOX-2158
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.5
 Environment: Windows x64
Reporter: Joel Hirsh
 Attachments: negative.text.box.pdf


 Attached PDF file is missing most of the text when processed by the 
 ExtractText example program
 I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a 
 rectange for COSName.FONT_BBOX  that contained a ymin value of minus 
 infinity. That method then creates a PDRectangle which calculates a bounding 
 box with a ymin value of -65,329, and results in an enormous text size, and 
 things go downhill from there.  The text cannot be matched up, and most of it 
 ends up being discarded.
 I was able to hack a fix by doing a check in the constructor 
 PDRectangle.PDRectangle( COSArray array ) for big negative numbers and 
 setting them to 0.  With that change, all the text came through as expected. 
 However, I don't have enough familiarity with the code to understand what a 
 real fix ought to look like.
 The PDF file looks to be fine by other programs such as Acrobat and NitroPDF



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Improving OCR plugin for PDFBox

2014-06-27 Thread John Hewson

Hi Dimuthu

That’s great. We should wait until closer to the end of the GSoC period to 
integrate your work with PDFBox, as ideally we only want to have to do it once. 
We’ve not included C++ dependencies before so no, there won’t be a standard 
way, we’ll have to think something up. We’ll either make it an optional 
sub-project and the Tesseract JNI bindings might be better of having their own 
branch so that they are more like an external dependency - I’ll ask the dev 
mailing list.

To prepare your code for contribution you’ll need to add the Apache header to 
each.java file (see any PDFBox .java file for an example) and submit a signed 
ICLA http://www.apache.org/licenses/icla.pdf to Apache.

Regarding additional functionality, the most useful would be for a new command 
line tool which could write the OCR’d text back into the original PDF file as 
“invisible text”, which would allow for copy and paste and text search to then 
work for that PDF file. A starting point for this would be to try and write the 
OCR’d text into the original PDF as “visible” text - we can make it invisible 
later!

-- John

On 19 Jun 2014, at 13:57, DImuthu Upeksha dimuthu.upeks...@gmail.com wrote:

 Hi John,
 Except providing compatibility for platforms like windows, I think most of 
 the functionalities of OCR plugin are finished (Please correct me if I'm 
 wrong). But I would like to contribute to project further. Do  you have 
 anything to add as a new functionality? And If you plan to add this to PDFBox 
 code, how should prepare my code? Is there any standard way?
 
 Thanks
 Dimuthu
 -- 
 Regards
 W.Dimuthu Upeksha
 Undergraduate
 Department of Computer Science And Engineering
 University of Moratuwa, Sri Lanka

Re: TIKA-1300

2014-06-27 Thread Maruan Sahyoun

thanks for the pointer - very useful information.

BR
Maruan

Am 27.06.2014 um 08:18 schrieb Tilman Hausherr thaush...@t-online.de:

 Please look at TIKA-1300 https://issues.apache.org/jira/browse/TIKA-1300, 
 it about PDFBox sequential parser vs. non sequential parser

[jira] [Commented] (PDFBOX-2162) annotation that highlights a text is not visible in image (converted from the pdf)

2014-06-27 Thread Maruan Sahyoun (JIRA)

[
https://issues.apache.org/jira/browse/PDFBOX-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045802#comment-14045802
]

Maruan Sahyoun commented on PDFBOX-2162:

The reason for the annotations to not showing up is that there is no Appearance
stream (/AP) defined for them. The appearance defines how the annotation shall
be rendered. It is an optional entry i.e. an annotation doesn’t need to have an
appearance defined.

Now what Adobe Reader does it generates an appearance for annotations which do
not have an appearance defined. That’s why it works after being saved by Adobe
Reader.

You can used PDFBox to generate an appearance stream for the annotations.

annotation that highlights a text is not visible in image (converted from the
pdf)
--

Key: PDFBOX-2162
URL: https://issues.apache.org/jira/browse/PDFBOX-2162
Project: PDFBox
Issue Type: Bug
Components: Rendering
Affects Versions: 1.8.6
Environment: Java 1.7 and Eclipse Kepler
Reporter: Julien Savoyet
Fix For: 1.8.6

Attachments: myfile1.pdf, myfile1_re_saved.pdf

Hi, I'm trying to convert in images (png or jpeg) a PDF file in which I've
added an annotation within each page through the PDAnnotationTextMarkup
object.
I've used PDFImageWriter to achieve this convertion and the images are
correctly generated excepted that the annotation has disappeared.
It seems that there is a mistake because if for example I open my PDF file
with acrobat reader and then I re-save it under a new PDF file and then I
relaunch my image conversion script with this latter one, the images
generated will this time contain the annotations.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (PDFBOX-2165) NPE Exception with barcode ttf font

Tilman Hausherr created PDFBOX-2165:
---

 Summary: NPE Exception with barcode ttf font
 Key: PDFBOX-2165
 URL: https://issues.apache.org/jira/browse/PDFBOX-2165
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Tilman Hausherr


Inspired by this complaint
http://stackoverflow.com/a/24432822/535646
I tried loading barcode fonts with PDFBox, with the command
{code}
PDTrueTypeFont.loadTTF()
{code}

With the file Code39.ttf that I get here
http://www.myfont.de/download.php?winfont=bar-code-39-lesbartype=zip
I get this exception:
{code}
Exception in thread main java.lang.NullPointerException
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:328)
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:127)
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:80)
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (PDFBOX-2166) AIOOB Exception with barcode ttf font

Tilman Hausherr created PDFBOX-2166:
---

 Summary: AIOOB Exception with barcode ttf font
 Key: PDFBOX-2166
 URL: https://issues.apache.org/jira/browse/PDFBOX-2166
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Tilman Hausherr


Inspired by this complaint
http://stackoverflow.com/a/24432822/535646
I tried loading barcode fonts with PDFBox, with the command
{code}
PDTrueTypeFont.loadTTF()
{code}

With the file free3of9.ttf that I can get here 
http://www.barcodesinc.com/free-barcode-font/
I get this exception:
{code}
Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 42
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:337)
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:127)
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:80)
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PDFBOX-2166) AIOOB Exception with barcode ttf font


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2166:


Attachment: free3of9.ttf

 AIOOB Exception with barcode ttf font
 -

 Key: PDFBOX-2166
 URL: https://issues.apache.org/jira/browse/PDFBOX-2166
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
  Labels: 42, barcode, font, truetype
 Attachments: free3of9.ttf


 Inspired by this complaint
 http://stackoverflow.com/a/24432822/535646
 I tried loading barcode fonts with PDFBox, with the command
 {code}
 PDTrueTypeFont.loadTTF()
 {code}
 With the file free3of9.ttf that I can get here 
 http://www.barcodesinc.com/free-barcode-font/
 I get this exception:
 {code}
 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 42
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:337)
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:127)
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:80)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PDFBOX-2165) NPE Exception with barcode ttf font


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2165:


Attachment: Code39.ttf

 NPE Exception with barcode ttf font
 ---

 Key: PDFBOX-2165
 URL: https://issues.apache.org/jira/browse/PDFBOX-2165
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
  Labels: barcode, font, truetype
 Attachments: Code39.ttf


 Inspired by this complaint
 http://stackoverflow.com/a/24432822/535646
 I tried loading barcode fonts with PDFBox, with the command
 {code}
 PDTrueTypeFont.loadTTF()
 {code}
 With the file Code39.ttf that I get here
 http://www.myfont.de/download.php?winfont=bar-code-39-lesbartype=zip
 I get this exception:
 {code}
 Exception in thread main java.lang.NullPointerException
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:328)
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:127)
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:80)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reassigned PDFBOX-2163:
---

Assignee: Tilman Hausherr

 inline image with EI in the middle incorrectly parsed
 -

 Key: PDFBOX-2163
 URL: https://issues.apache.org/jira/browse/PDFBOX-2163
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr

 This PDF
 http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf
 has an exception because the end of an inline image is improperly detected. 
 The stream looks like this:
 {code}
 BI
   /W 452
   /H 169
   /BPC 8
   /CS /RGB
   /D [0.0 1.0 0.0 1.0 0.0 1.0]
   /F [/A85 /Fl]
 ID
 ..
 EI
 ..
 ...
 
 EI Q
 {code}
 The inline images are handled in PDFStreamParser. This is tricky, we look for 
 followup bin data to check that it isn't an EI in the middle, but here it 
 isn't bin data, but ascii85 stuff. We also can't request that there be a LF 
 before the EI, because I remember that I had a PDF at work created by a well 
 known company that doesn't use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed


[ 
https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045177#comment-14045177
 ] 

Tilman Hausherr edited comment on PDFBOX-2163 at 6/27/14 6:39 PM:
--

And more:
http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf



was (Author: tilman):
And another:
http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf

I think we should use an alternative parsing strategy for Ascii85 encoded 
inline images, e.g. assuming that the EI is in a separate line.

 inline image with EI in the middle incorrectly parsed
 -

 Key: PDFBOX-2163
 URL: https://issues.apache.org/jira/browse/PDFBOX-2163
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr

 This PDF
 http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf
 has an exception because the end of an inline image is improperly detected. 
 The stream looks like this:
 {code}
 BI
   /W 452
   /H 169
   /BPC 8
   /CS /RGB
   /D [0.0 1.0 0.0 1.0 0.0 1.0]
   /F [/A85 /Fl]
 ID
 ..
 EI
 ..
 ...
 
 EI Q
 {code}
 The inline images are handled in PDFStreamParser. This is tricky, we look for 
 followup bin data to check that it isn't an EI in the middle, but here it 
 isn't bin data, but ascii85 stuff. We also can't request that there be a LF 
 before the EI, because I remember that I had a PDF at work created by a well 
 known company that doesn't use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed


[ 
https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046255#comment-14046255
 ] 

Tilman Hausherr commented on PDFBOX-2163:
-

Fixed for the trunk in http://svn.apache.org/r1606177

I'm looking in the output stream to see if there are 70 ascii85 bytes. If yes, 
then this EI doesn't count. All the files above now render properly, and so do 
all older files with inline images I kept.

 inline image with EI in the middle incorrectly parsed
 -

 Key: PDFBOX-2163
 URL: https://issues.apache.org/jira/browse/PDFBOX-2163
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr

 This PDF
 http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf
 has an exception because the end of an inline image is improperly detected. 
 The stream looks like this:
 {code}
 BI
   /W 452
   /H 169
   /BPC 8
   /CS /RGB
   /D [0.0 1.0 0.0 1.0 0.0 1.0]
   /F [/A85 /Fl]
 ID
 ..
 EI
 ..
 ...
 
 EI Q
 {code}
 The inline images are handled in PDFStreamParser. This is tricky, we look for 
 followup bin data to check that it isn't an EI in the middle, but here it 
 isn't bin data, but ascii85 stuff. We also can't request that there be a LF 
 before the EI, because I remember that I had a PDF at work created by a well 
 known company that doesn't use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2163:


Affects Version/s: 2.0.0
   1.8.7
   1.8.6

 inline image with EI in the middle incorrectly parsed
 -

 Key: PDFBOX-2163
 URL: https://issues.apache.org/jira/browse/PDFBOX-2163
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.6, 1.8.7, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr

 This PDF
 http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf
 has an exception because the end of an inline image is improperly detected. 
 The stream looks like this:
 {code}
 BI
   /W 452
   /H 169
   /BPC 8
   /CS /RGB
   /D [0.0 1.0 0.0 1.0 0.0 1.0]
   /F [/A85 /Fl]
 ID
 ..
 EI
 ..
 ...
 
 EI Q
 {code}
 The inline images are handled in PDFStreamParser. This is tricky, we look for 
 followup bin data to check that it isn't an EI in the middle, but here it 
 isn't bin data, but ascii85 stuff. We also can't request that there be a LF 
 before the EI, because I remember that I had a PDF at work created by a well 
 known company that doesn't use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2163:


Labels: inline  (was: )

 inline image with EI in the middle incorrectly parsed
 -

 Key: PDFBOX-2163
 URL: https://issues.apache.org/jira/browse/PDFBOX-2163
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.6, 1.8.7, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: inline
 Fix For: 1.8.7, 2.0.0


 This PDF
 http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf
 has an exception because the end of an inline image is improperly detected. 
 The stream looks like this:
 {code}
 BI
   /W 452
   /H 169
   /BPC 8
   /CS /RGB
   /D [0.0 1.0 0.0 1.0 0.0 1.0]
   /F [/A85 /Fl]
 ID
 ..
 EI
 ..
 ...
 
 EI Q
 {code}
 The inline images are handled in PDFStreamParser. This is tricky, we look for 
 followup bin data to check that it isn't an EI in the middle, but here it 
 isn't bin data, but ascii85 stuff. We also can't request that there be a LF 
 before the EI, because I remember that I had a PDF at work created by a well 
 known company that doesn't use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2163:


Fix Version/s: 2.0.0
   1.8.7

 inline image with EI in the middle incorrectly parsed
 -

 Key: PDFBOX-2163
 URL: https://issues.apache.org/jira/browse/PDFBOX-2163
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.6, 1.8.7, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: inline
 Fix For: 1.8.7, 2.0.0


 This PDF
 http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf
 has an exception because the end of an inline image is improperly detected. 
 The stream looks like this:
 {code}
 BI
   /W 452
   /H 169
   /BPC 8
   /CS /RGB
   /D [0.0 1.0 0.0 1.0 0.0 1.0]
   /F [/A85 /Fl]
 ID
 ..
 EI
 ..
 ...
 
 EI Q
 {code}
 The inline images are handled in PDFStreamParser. This is tricky, we look for 
 followup bin data to check that it isn't an EI in the middle, but here it 
 isn't bin data, but ascii85 stuff. We also can't request that there be a LF 
 before the EI, because I remember that I had a PDF at work created by a well 
 known company that doesn't use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed


[ 
https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045177#comment-14045177
 ] 

Tilman Hausherr edited comment on PDFBOX-2163 at 6/27/14 6:53 PM:
--

And more:
http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf



was (Author: tilman):
And more:
http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf


 inline image with EI in the middle incorrectly parsed
 -

 Key: PDFBOX-2163
 URL: https://issues.apache.org/jira/browse/PDFBOX-2163
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.6, 1.8.7, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: inline
 Fix For: 1.8.7, 2.0.0


 This PDF
 http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf
 has an exception because the end of an inline image is improperly detected. 
 The stream looks like this:
 {code}
 BI
   /W 452
   /H 169
   /BPC 8
   /CS /RGB
   /D [0.0 1.0 0.0 1.0 0.0 1.0]
   /F [/A85 /Fl]
 ID
 ..
 EI
 ..
 ...
 
 EI Q
 {code}
 The inline images are handled in PDFStreamParser. This is tricky, we look for 
 followup bin data to check that it isn't an EI in the middle, but here it 
 isn't bin data, but ascii85 stuff. We also can't request that there be a LF 
 before the EI, because I remember that I had a PDF at work created by a well 
 known company that doesn't use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed


[ 
https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045177#comment-14045177
 ] 

Tilman Hausherr edited comment on PDFBOX-2163 at 6/27/14 7:46 PM:
--

And more:
http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/152/152584.pdf



was (Author: tilman):
And more:
http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf


 inline image with EI in the middle incorrectly parsed
 -

 Key: PDFBOX-2163
 URL: https://issues.apache.org/jira/browse/PDFBOX-2163
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.6, 1.8.7, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: inline
 Fix For: 1.8.7, 2.0.0


 This PDF
 http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf
 has an exception because the end of an inline image is improperly detected. 
 The stream looks like this:
 {code}
 BI
   /W 452
   /H 169
   /BPC 8
   /CS /RGB
   /D [0.0 1.0 0.0 1.0 0.0 1.0]
   /F [/A85 /Fl]
 ID
 ..
 EI
 ..
 ...
 
 EI Q
 {code}
 The inline images are handled in PDFStreamParser. This is tricky, we look for 
 followup bin data to check that it isn't an EI in the middle, but here it 
 isn't bin data, but ascii85 stuff. We also can't request that there be a LF 
 before the EI, because I remember that I had a PDF at work created by a well 
 known company that doesn't use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed


[ 
https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045177#comment-14045177
 ] 

Tilman Hausherr edited comment on PDFBOX-2163 at 6/27/14 8:01 PM:
--

And more:
http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/152/152584.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/092/092448.pdf



was (Author: tilman):
And more:
http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/152/152584.pdf


 inline image with EI in the middle incorrectly parsed
 -

 Key: PDFBOX-2163
 URL: https://issues.apache.org/jira/browse/PDFBOX-2163
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.6, 1.8.7, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: inline
 Fix For: 1.8.7, 2.0.0


 This PDF
 http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf
 has an exception because the end of an inline image is improperly detected. 
 The stream looks like this:
 {code}
 BI
   /W 452
   /H 169
   /BPC 8
   /CS /RGB
   /D [0.0 1.0 0.0 1.0 0.0 1.0]
   /F [/A85 /Fl]
 ID
 ..
 EI
 ..
 ...
 
 EI Q
 {code}
 The inline images are handled in PDFStreamParser. This is tricky, we look for 
 followup bin data to check that it isn't an EI in the middle, but here it 
 isn't bin data, but ascii85 stuff. We also can't request that there be a LF 
 before the EI, because I remember that I had a PDF at work created by a well 
 known company that doesn't use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Jenkins build became unstable: PDFBox-trunk » Apache PDFBox #1084

See 
https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox/1084/changes

Jenkins build became unstable: PDFBox-trunk #1084

See https://builds.apache.org/job/PDFBox-trunk/1084/changes

[jira] [Updated] (PDFBOX-2165) NPE with barcode ttf font


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2165:


Summary: NPE with barcode ttf font  (was: NPE Exception with barcode ttf 
font)

 NPE with barcode ttf font
 -

 Key: PDFBOX-2165
 URL: https://issues.apache.org/jira/browse/PDFBOX-2165
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
  Labels: barcode, font, truetype
 Attachments: Code39.ttf


 Inspired by this complaint
 http://stackoverflow.com/a/24432822/535646
 I tried loading barcode fonts with PDFBox, with the command
 {code}
 PDTrueTypeFont.loadTTF()
 {code}
 With the file Code39.ttf that I get here
 http://www.myfont.de/download.php?winfont=bar-code-39-lesbartype=zip
 I get this exception:
 {code}
 Exception in thread main java.lang.NullPointerException
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:328)
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:127)
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:80)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PDFBOX-2166) AIOOBE with barcode ttf font


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2166:


Summary: AIOOBE with barcode ttf font  (was: AIOOB Exception with barcode 
ttf font)

 AIOOBE with barcode ttf font
 

 Key: PDFBOX-2166
 URL: https://issues.apache.org/jira/browse/PDFBOX-2166
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
  Labels: 42, barcode, font, truetype
 Attachments: free3of9.ttf


 Inspired by this complaint
 http://stackoverflow.com/a/24432822/535646
 I tried loading barcode fonts with PDFBox, with the command
 {code}
 PDTrueTypeFont.loadTTF()
 {code}
 With the file free3of9.ttf that I can get here 
 http://www.barcodesinc.com/free-barcode-font/
 I get this exception:
 {code}
 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 42
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:337)
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:127)
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:80)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PDFBOX-2167) NPE in PDTrueTypeFont.makeFontDescriptor


 [ 
https://issues.apache.org/jira/browse/PDFBOX-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2167:


Attachment: 268554.pdf

 NPE in PDTrueTypeFont.makeFontDescriptor
 

 Key: PDFBOX-2167
 URL: https://issues.apache.org/jira/browse/PDFBOX-2167
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
 Attachments: 268554.pdf


 I get an NPE with the file from
 http://digitalcorpora.org/corp/nps/files/govdocs1/268/268554.pdf
 {code}
 java.lang.NullPointerException
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:292)
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontDescriptor(PDTrueTypeFont.java:150)
   at org.apache.pdfbox.pdmodel.font.PDFont.getFontWidth(PDFont.java:814)
 IOException for file 268554.pdf
   at 
 org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:382)
   at org.apache.pdfbox.pdmodel.font.PDFont.getFontWidth(PDFont.java:312)
   at org.apache.pdfbox.pdmodel.font.PDFont.getSpaceWidth(PDFont.java:855)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:328)
   at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:44)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:521)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:267)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:226)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:209)
   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:174)
   at 
 org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:227)
   at 
 org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:160)
   at 
 org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:109)
 {code}
 I first thought it is the same as PDFBOX-2165, but it's a different line 
 number.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (PDFBOX-2167) NPE in PDTrueTypeFont.makeFontDescriptor

Tilman Hausherr created PDFBOX-2167:
---

 Summary: NPE in PDTrueTypeFont.makeFontDescriptor
 Key: PDFBOX-2167
 URL: https://issues.apache.org/jira/browse/PDFBOX-2167
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
 Attachments: 268554.pdf

I get an NPE with the file from
http://digitalcorpora.org/corp/nps/files/govdocs1/268/268554.pdf
{code}
java.lang.NullPointerException
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.makeFontDescriptor(PDTrueTypeFont.java:292)
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontDescriptor(PDTrueTypeFont.java:150)
at org.apache.pdfbox.pdmodel.font.PDFont.getFontWidth(PDFont.java:814)
IOException for file 268554.pdf
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:382)
at org.apache.pdfbox.pdmodel.font.PDFont.getFontWidth(PDFont.java:312)
at org.apache.pdfbox.pdmodel.font.PDFont.getSpaceWidth(PDFont.java:855)
at 
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:328)
at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:44)
at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:521)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:267)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:226)
at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:209)
at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:174)
at 
org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:227)
at 
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:160)
at 
org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:109)
{code}

I first thought it is the same as PDFBOX-2165, but it's a different line number.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (PDFBOX-2163) inline image with EI in the middle incorrectly parsed


[ 
https://issues.apache.org/jira/browse/PDFBOX-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045177#comment-14045177
 ] 

Tilman Hausherr edited comment on PDFBOX-2163 at 6/27/14 8:56 PM:
--

And more:
http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/152/152584.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/092/092448.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/248/248066.pdf



was (Author: tilman):
And more:
http://digitalcorpora.org/corp/nps/files/govdocs1/322/322313.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/258/258126.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/662/662062.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/152/152584.pdf
http://digitalcorpora.org/corp/nps/files/govdocs1/092/092448.pdf


 inline image with EI in the middle incorrectly parsed
 -

 Key: PDFBOX-2163
 URL: https://issues.apache.org/jira/browse/PDFBOX-2163
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.6, 1.8.7, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
  Labels: inline
 Fix For: 1.8.7, 2.0.0


 This PDF
 http://digitalcorpora.org/corp/nps/files/govdocs1/876/876636.pdf
 has an exception because the end of an inline image is improperly detected. 
 The stream looks like this:
 {code}
 BI
   /W 452
   /H 169
   /BPC 8
   /CS /RGB
   /D [0.0 1.0 0.0 1.0 0.0 1.0]
   /F [/A85 /Fl]
 ID
 ..
 EI
 ..
 ...
 
 EI Q
 {code}
 The inline images are handled in PDFStreamParser. This is tricky, we look for 
 followup bin data to check that it isn't an EI in the middle, but here it 
 isn't bin data, but ascii85 stuff. We also can't request that there be a LF 
 before the EI, because I remember that I had a PDF at work created by a well 
 known company that doesn't use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2159) horizontal line above shaded text when printing on HP1320


[ 
https://issues.apache.org/jira/browse/PDFBOX-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046594#comment-14046594
 ] 

John Hewson commented on PDFBOX-2159:
-

If you have a printer-specific problem you could try the 2.0 trunk and use the 
constructor:

{code}
PDFPrinter(PDDocument document, Scaling scaling, Orientation orientation, Paper 
paper,  float dpi)
{code}

Which allows the image to be rasterized before being sent to the printer 
driver, you'll need to set a dpi, usually  300.

 horizontal line above shaded text when printing on HP1320
 -

 Key: PDFBOX-2159
 URL: https://issues.apache.org/jira/browse/PDFBOX-2159
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.6, 1.8.7
Reporter: Tilman Hausherr
 Attachments: IMG_20140624_073954101.jpg, PDFBOX-2159-2.pdf, 
 PDFBOX-2159-2.ps, PDFBOX-2159.pdf, PDFBOX-2159.ps


 This is a follow-up to PDFBOX-2141 and somewhat of PDFBOX-485. In the later, 
 [~vbier] reported weird printing problems related to a specific part of the 
 code that I changed again recently. While the original problem is gone, there 
 is a new one that we discovered in a discussion in PDFBOX-2141 and it appears 
 e.g. on the 4th page of the file pslib-shading.pdf that is in PDFBOX-1942: 
 above the text there is a horizontal line (about 3mm thick) that goes over 
 the whole page.
 I created a new test file that I am attaching. It has two shaded text lines 
 and two unshaded, each time once with a standard 14 font and with an embedded 
 type 1 font.
 [~vbier], please tell how many horizontal lines you get and where.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (PDFBOX-1875) Image and some text missing in rendered file


 [ 
https://issues.apache.org/jira/browse/PDFBOX-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson resolved PDFBOX-1875.
-

Resolution: Fixed

Yes that's it, the clipping path wasn't being intersected with the form's BBox. 
Good fix.

 Image and some text missing in rendered file
 

 Key: PDFBOX-1875
 URL: https://issues.apache.org/jira/browse/PDFBOX-1875
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
  Labels: bbox
 Fix For: 2.0.0

 Attachments: PDFBOX-1861-bbox-bad.png, PDFBOX-1861-bbox-good.png, 
 PDFBOX-1861-bbox.pdf, pdfbox-1861-tracemonkey.pdf-6.png


 An image and some text are missing on page 6 of the tracemonkey.pdf file of 
 PDFBOX-1861.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2104) Implement transparency groups

[
https://issues.apache.org/jira/browse/PDFBOX-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046645#comment-14046645
]

John Hewson commented on PDFBOX-2104:
-

I did some syntactic cleaning up of the code in
[r1606281|http://svn.apache.org/r1606281].

Implement transparency groups
-

Key: PDFBOX-2104
URL: https://issues.apache.org/jira/browse/PDFBOX-2104
Project: PDFBox
Issue Type: Improvement
Components: Rendering
Affects Versions: 2.0.0
Reporter: Petr Slaby
Assignee: John Hewson
Labels: transparency
Fix For: 2.0.0

Attachments: 01_MTEXT_CS6.pdf, TransparencyGroups.1.patch,
TransparencyGroups.2.patch, TransparencyGroups.3.patch,
TransparencyGroups.patch

The attached PDF uses transparency groups, blending and soft masks to create
the rounded corners and shades behind images. It appears that these features
are not implemented in PDFBox. An implementation proposal is attached in the
TransparencyGroup.patch. The basic idea is to create a buffered image, draw
the transparency group content onto it and then use the result to produce the
soft mask or draw the image on the original g2d.
Note: I am not the (only) author of the proposed change. It was developed in
our company few years ago in sources based on a 1.7.x version of PDFBox,
mostly by a guy who already left. Over the years, merging of the work done in
PDFBox main stream into our source base has become impossible due to many
refactorings and other deep going changes done. Now we would like to go the
opposite way - where possible - bring the changes and fixes we have done into
PDFBox main stream and start to use it in our installations.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

Jenkins build is back to stable : PDFBox-trunk » Apache PDFBox #1085

See 
https://builds.apache.org/job/PDFBox-trunk/org.apache.pdfbox$pdfbox/1085/changes

Jenkins build is back to stable : PDFBox-trunk #1085

See https://builds.apache.org/job/PDFBox-trunk/1085/changes

[jira] [Comment Edited] (PDFBOX-2126) Optimize clipping

[
https://issues.apache.org/jira/browse/PDFBOX-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046702#comment-14046702
]

John Hewson edited comment on PDFBOX-2126 at 6/28/14 2:42 AM:
--

I like what you're doing with this commit but when I looked closer I thought
that the handling of clipping paths needed some more significant improvements.
I've done some refactoring in [s1606283|http://svn.apache.org/s1606283] to
PDGraphicsState, see what you think.

I haven't applied any optimisations yet, you're welcome to update your patch
again. Feedback appreciated.

was (Author: jahewson):
I like what you're doing with this commit but when I looked at applying at I
thought that the handling of clipping paths needed some more significant
improvements. I've done some refactoring in
[s1606283|http://svn.apache.org/s1606283] to PDGraphicsState, see what you
think.

I haven't applied any optimisations yet, you're welcome to update your patch
again. Feedback appreciated.

Optimize clipping
-

Key: PDFBOX-2126
URL: https://issues.apache.org/jira/browse/PDFBOX-2126
Project: PDFBox
Issue Type: Improvement
Components: Rendering
Affects Versions: 2.0.0
Reporter: Petr Slaby
Attachments: ClipPath.1.patch, ClipPath.patch, example_010.pdf

As already stated in a TODO comment in PageDrawer, the call of
Graphics2D#setClip() is time and memory consuming. The attached patch
optimizes clipping by calling Graphics2D#setClip() only if the clipping path
has changed. The effect depends on the document, e.g. the attached one
renders in 10.5s without the optimization and in 5.5 seconds in the optimized
version.
The clipping has to be re-applied whenever the transform in Graphics2D
changes. This is not explicitly checked for, the implementation rather
depends on the cached value being reset manually. Currently this is only
needed at one place when processing annotations (AcroForms). Also, the
implementation relies upon the clipping path object stored in PDGraphicsState
to never change so that a comparison using == can be used. This works fine,
but needs a bit of awareness in future changes. To make the design more
clean, the clipping path could be made private to PDGraphcisState and thus
really immutable from outside.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2126) Optimize clipping