[jira] [Comment Edited] (PDFBOX-1991) Shading PaintContexts should not depend on the page height

2014-03-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940238#comment-13940238
 ] 

Tilman Hausherr edited comment on PDFBOX-1991 at 3/19/14 7:35 AM:
--

The branch where pageHeight is used comes from a code segment that was in 
radialShadingContext. However that segment was deleted when Andreas committed 
the "Luis Bernardo changes" (PDFBOX-615) so I could have thought about deleting 
it as well. Btw that code segment is never used for any of my test images, I 
just tested this.

So as a start, I removed that code in rev 1579148.

However...:
The comment says "the shading is used as pattern colorspace in combination with 
a fill-, stroke- or showText-operator". I remember that there are a few images 
(color_gradient.pdf and pslib-shading.pdf) that have never been rendered 
correctly. Both have a text with a shading pattern, one has also a line with a 
shading pattern. I wonder if we will need the height for these pages.


was (Author: tilman):
The branch where pageHeight is used comes from a code segment that was in 
radialShadingContext. However that segment was deleted when Andreas committed 
the "Luis Bernardo changes" (PDFBOX-615) so I could have thought about deleting 
it as well. Btw that code segment is never used for any of my test images, I 
just tested this.

So as a start, I removed that code in rev 1579148.

However...:
The comment says "the shading is used as pattern colorspace in combination with 
a fill-, stroke- or showText-operator". I remember that there are a few images 
that have never been rendered correctly. One is a text with a shading pattern, 
the other one a line with a shading pattern. I wonder if we will need the 
height for these pages.

> Shading PaintContexts should not depend on the page height
> --
>
> Key: PDFBOX-1991
> URL: https://issues.apache.org/jira/browse/PDFBOX-1991
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.0
>Reporter: John Hewson
>Priority: Minor
>
> I'd like to remove the page height parameter from PDPattern as soon as 
> possible because of doubts over its safety (i.e. the current stream being 
> processed may be a pattern or a form, not a page). Before I do that we need 
> to remove its only use, which is...
> The page height is passed to all shading PaintContext subclasses but it is 
> only used in GouraudShadingContext. However, all other drawing in PDFBox is 
> done using the native PDF y-axis which is flipped via a call to 
> Graphics2D#scale(0, -1) but the following code in GouraudShadingContext flips 
> the y-axis:
> v.point = new Point.Double(v.point.getX(), pageHeight + xform.getTranslateY() 
> - v.point.getY());
> So it seems like this could be removed and the y-axis inversion done 
> elsewhere with either a Matrix, AffineTransform or Grpahics2D#scale.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Removing processStream and processSubStream

2014-03-19 Thread Maruan Sahyoun
Hi,

in general I think that this is a valid change. From how I understand the 
rendering in PDF Form, Text, Image and Pattern maintain their own matrix to map 
to user space which is then transformed by the CTM to device space so handling 
them specifically is fine and inline with the spec. I’d suggest that we make 
sure that the different ‚spaces‘ are defined properly within the code and refer 
to the PDF spec so that the code is easier to read if this is not already the 
case. With so many changes it’s a good opportunity to enhance the documentation 
within the source code. Some of the old code enjoys very little documentation.  

I wouldn’t remove processStream and processSubStream but deprecate them and 
remove them in the next major release though as to keep the changes to a 
minimum. There are a number of very important changes in 2.0. The easier we can 
get people to use that version wo to many changes to their own code the better.

For 2.0 removing the deprecated stuff of 1.x is fine. Removing not deprecated 
stuff should be avoided if possible. 

For the rendering what might have been missed is taking the UserUnit entry in 
the page dictionary into account which might change the default user space. 
This was introduced in PDF 1.6. A good opportunity to read that entry and make 
sure that we handle it appropriately.

BR
Maruan Sahyoun

Am 18.03.2014 um 20:46 schrieb John Hewson :

> Hi All
> 
> I’m still working on getting Tiling Patterns to render correctly, and need to 
> make some
> changes to core PDFBox functionality in order to proceed. My problem is that 
> tiling
> patterns are defined in their parent stream’s initial coordinate space, 
> rather than the
> coordinate space defined by the CTM. However, in PDFBox there is no way to 
> access
> the parent stream, so I can’t find out what it’s initial matrix is. The 
> manner in which the
> initial coordinate space is determined is different for pages, forms, and 
> patterns
> 
> What this means is that the parent stream’s initial coordinate space needs to 
> be passed
> to processStream and processSubStream in PDFStreamEngine. This will 
> necessarily be
> a breaking change, and it will affect all downstream subclasses of 
> PDFStreamEngine.
> 
> Because this has to be a breaking change, I propose that we go all the way 
> and make
> the new API bulletproof, 1) so that we won’t have to introduce breaking 
> changes in the
> future if we encounter similar issues, 2) so that the caller of the method 
> can’t pass the
> wrong data in the parameters. We would remove the two generic methods:
> 
> public void processStream(PDResources resources, COSStream cosStream, 
> PDRectangle drawingSize, int rotation)
> public void processSubStream(PDResources resources, COSStream cosStream)
> 
> and replace them with four specific methods:
> 
> public void processPage(PDPage page)
> public void processForm(PDFormXObject form)
> public void processTilingPattern(PDTilingPattern pattern)
> public void processType3Font(PDType3Font font)
> 
> This would mean that the various “proces” methods have access to their 
> parent
> stream, and can read any of its public fields in the future without 
> introducing breaking
> changes by altering the method’s parameters.
> 
> What do you think?
> 
> -- John
> 



[jira] [Updated] (PDFBOX-1991) Shading PaintContexts should not depend on the page height

2014-03-19 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-1991:


Labels: shading shadingpattern  (was: )

> Shading PaintContexts should not depend on the page height
> --
>
> Key: PDFBOX-1991
> URL: https://issues.apache.org/jira/browse/PDFBOX-1991
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.0
>Reporter: John Hewson
>Priority: Minor
>  Labels: shading, shadingpattern
>
> I'd like to remove the page height parameter from PDPattern as soon as 
> possible because of doubts over its safety (i.e. the current stream being 
> processed may be a pattern or a form, not a page). Before I do that we need 
> to remove its only use, which is...
> The page height is passed to all shading PaintContext subclasses but it is 
> only used in GouraudShadingContext. However, all other drawing in PDFBox is 
> done using the native PDF y-axis which is flipped via a call to 
> Graphics2D#scale(0, -1) but the following code in GouraudShadingContext flips 
> the y-axis:
> v.point = new Point.Double(v.point.getX(), pageHeight + xform.getTranslateY() 
> - v.point.getY());
> So it seems like this could be removed and the y-axis inversion done 
> elsewhere with either a Matrix, AffineTransform or Grpahics2D#scale.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1991) Shading PaintContexts should not depend on the page height

2014-03-19 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940607#comment-13940607
 ] 

John Hewson commented on PDFBOX-1991:
-

Perhaps, though I'd expect if we need such a measurement then it would actually 
be the clipping rectangle (or similar), rather than the page height.

> Shading PaintContexts should not depend on the page height
> --
>
> Key: PDFBOX-1991
> URL: https://issues.apache.org/jira/browse/PDFBOX-1991
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.0
>Reporter: John Hewson
>Priority: Minor
>  Labels: shading, shadingpattern
>
> I'd like to remove the page height parameter from PDPattern as soon as 
> possible because of doubts over its safety (i.e. the current stream being 
> processed may be a pattern or a form, not a page). Before I do that we need 
> to remove its only use, which is...
> The page height is passed to all shading PaintContext subclasses but it is 
> only used in GouraudShadingContext. However, all other drawing in PDFBox is 
> done using the native PDF y-axis which is flipped via a call to 
> Graphics2D#scale(0, -1) but the following code in GouraudShadingContext flips 
> the y-axis:
> v.point = new Point.Double(v.point.getX(), pageHeight + xform.getTranslateY() 
> - v.point.getY());
> So it seems like this could be removed and the y-axis inversion done 
> elsewhere with either a Matrix, AffineTransform or Grpahics2D#scale.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PDFBOX-1936) text outline with shading pattern is invisible

2014-03-19 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-1936:


Attachment: pslib-shading.pdf-4.png
color_gradient.pdf-1.png

Update on this:
It's not a regression; the text was black last year because at that time, 
pixels there weren't rendered were left black, and today such pixels are left 
transparent.

The reason that nothing is shown is because AxialShadingContext is created with 
a wrong AffineTransform, that has giant scale values. That AffineTransform is 
related to the text rendering (drawString()) and is OK for that, but not for 
the shading. When I replace the wrong AffineTransform with a standard one for 
that dpi (e.g.  [4.16, 0, 0][0, 4.16, 2508] for 300dpi) I get a perfectly 
shaded text.

> text outline with shading pattern is invisible
> --
>
> Key: PDFBOX-1936
> URL: https://issues.apache.org/jira/browse/PDFBOX-1936
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>  Labels: shading, shadingpattern
> Attachments: color_gradient.pdf, color_gradient.pdf-1.png, 
> color_gradient.pdf-1.png, pslib-shading.pdf, pslib-shading.pdf-4.png, 
> pslib-shading.pdf-4.png
>
>
> This is also somewhat of a regression: in PDFBOX-615, the file 
> color_gradient.pdf-1.png had the text rendered, although in black. Currently, 
> the text is invisible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Removing processStream and processSubStream

2014-03-19 Thread Maruan Sahyoun
John,

Am 19.03.2014 um 18:15 schrieb John Hewson :

> Maruan
> 
>> From how I understand the rendering in PDF Form, Text, Image and Pattern 
>> maintain their own matrix to map to user space which is then transformed by 
>> the CTM to device space so handling them specifically is fine and inline 
>> with the spec.
> 
> No, that’s not right, what I said was:
> 
>>> My problem is that tiling patterns are defined in their parent stream’s 
>>> initial coordinate space, rather than the
>>> coordinate space defined by the CTM.
> 
> So patterns should *not* be using the CTM, which is what I’m trying to 
> achieve.
> 

I think you misunderstood what I wrote - patterns have their own matrix - so I 
think we are on the same page here. IMHO according to the spec CTM transforms 
from user space to device space. So it’s pattern space -> user space -> device 
space.


>> I’d suggest that we make sure that the different ‚spaces‘ are defined 
>> properly within the code and refer to the PDF spec so that the code is 
>> easier to read if this is not already the case. With so many changes it’s a 
>> good opportunity to enhance the documentation within the source code. Some 
>> of the old code enjoys very little documentation.
> 
> 
> I disagree, in general I don’t think that references to the PDF spec are a 
> good form of documentation (there are some exceptions). References to the 
> spec are meaningless to the reader unless they take the time to look them up 
> in a 700 page PDF document. I would argue that by just linking back to the 
> spec, we have *failed* to document PDFBox, not succeeded.
> 
> References to the PDF spec have another major flaw: they go out-of-date. For 
> example a Pattern Colour Space will always be called “Pattern Colour Space” 
> in future versions of the PDF spec but it may not be described in paragraph 
> 8.6.6.2 or on page 156. The existing code contains many references to the PDF 
> 1.6 and 1.7 specs as well as the ISO PDF32000 spec, which means that I need 
> three 700 page PDF files open at all times in order to look up PDFBox 
> references. With the new version of the PDF spec due this year, this 
> situation is going to get worse.
> 

Didn’t mean to only reference to the spec but to use the same terms as 
described by the spec. Adding references to the spec is an add-on not a 
replacement.

> I agree that some of the existing code needs more documentation, and I often 
> add documentation to old files which I’m working on. However, my approach is 
> to just paste in a sentence or two from the PDF spec (fair use). That way the 
> reader does not ever need to look at the PDF spec. Because we use the same 
> terminology in PDFBox as in the spec, if someone really wants to look 
> something up, it’s as simple as Ctrl+F, no reference needed, and it’s 
> guaranteed not to go out-of-date.
> 
>> I wouldn’t remove processStream and processSubStream but deprecate them and 
>> remove them in the next major release though as to keep the changes to a 
>> minimum.
> 
> This isn’t possible, as I said it "will necessarily be a breaking change”. 
> This is because in 2.0 PDFStreamEngine needs to know the parent of each 
> stream, but processStream and processSubStream do not provide this 
> information. That’s why I’m discussing this on the mailing list.

I don’t understand why this is shouldn’t be possible. It’s more effort, agreed, 
but beneficial.

> 
>> For the rendering what might have been missed is taking the UserUnit entry 
>> in the page dictionary into account which might change the default user 
>> space. This was introduced in PDF 1.6. A good opportunity to read that entry 
>> and make sure that we handle it appropriately.
> 
> Yes, I have this as a “todo” in my working copy, however, if we put the 
> UserUnit in the matrix then we should also put the page Rotation into the 
> matrix, but that’a a significant change.
> 
> -- John



[jira] [Commented] (PDFBOX-1936) text outline with shading pattern is invisible

2014-03-19 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940718#comment-13940718
 ] 

John Hewson commented on PDFBOX-1936:
-

So it just occurred to me that shading patterns are, well, patterns. This means 
that the section of the PDF spec "General Properties of Patterns" applies to 
them, just as it applied to Tiling Patterns. So the matrix for a shading 
pattern also needs to be calculated using it's parent stream's initial 
transform, instead of using the CTM. This is what I'm working on for 
PDFBOX-1094 correctly and am discussing the needed changes on the mailing list.

> text outline with shading pattern is invisible
> --
>
> Key: PDFBOX-1936
> URL: https://issues.apache.org/jira/browse/PDFBOX-1936
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>  Labels: shading, shadingpattern
> Attachments: color_gradient.pdf, color_gradient.pdf-1.png, 
> color_gradient.pdf-1.png, pslib-shading.pdf, pslib-shading.pdf-4.png, 
> pslib-shading.pdf-4.png
>
>
> This is also somewhat of a regression: in PDFBOX-615, the file 
> color_gradient.pdf-1.png had the text rendered, although in black. Currently, 
> the text is invisible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (PDFBOX-1936) text outline with shading pattern is invisible

2014-03-19 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940718#comment-13940718
 ] 

John Hewson edited comment on PDFBOX-1936 at 3/19/14 5:28 PM:
--

So it just occurred to me that shading patterns are, well, patterns. This means 
that the section of the PDF spec "General Properties of Patterns" applies to 
them, just as it applied to Tiling Patterns. So the matrix for a shading 
pattern also needs to be calculated using it's parent stream's initial 
transform, instead of using the CTM. This is what I'm working on for 
PDFBOX-1094 currently and am discussing the needed changes on the mailing list, 
see "Removing processStream and processSubStream".


was (Author: jahewson):
So it just occurred to me that shading patterns are, well, patterns. This means 
that the section of the PDF spec "General Properties of Patterns" applies to 
them, just as it applied to Tiling Patterns. So the matrix for a shading 
pattern also needs to be calculated using it's parent stream's initial 
transform, instead of using the CTM. This is what I'm working on for 
PDFBOX-1094 correctly and am discussing the needed changes on the mailing list.

> text outline with shading pattern is invisible
> --
>
> Key: PDFBOX-1936
> URL: https://issues.apache.org/jira/browse/PDFBOX-1936
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>  Labels: shading, shadingpattern
> Attachments: color_gradient.pdf, color_gradient.pdf-1.png, 
> color_gradient.pdf-1.png, pslib-shading.pdf, pslib-shading.pdf-4.png, 
> pslib-shading.pdf-4.png
>
>
> This is also somewhat of a regression: in PDFBOX-615, the file 
> color_gradient.pdf-1.png had the text rendered, although in black. Currently, 
> the text is invisible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Removing processStream and processSubStream

2014-03-19 Thread John Hewson
Maruan

> From how I understand the rendering in PDF Form, Text, Image and Pattern 
> maintain their own matrix to map to user space which is then transformed by 
> the CTM to device space so handling them specifically is fine and inline with 
> the spec.

No, that’s not right, what I said was:

>> My problem is that tiling patterns are defined in their parent stream’s 
>> initial coordinate space, rather than the
>> coordinate space defined by the CTM.


So patterns should *not* be using the CTM, which is what I’m trying to achieve.

> I’d suggest that we make sure that the different ‚spaces‘ are defined 
> properly within the code and refer to the PDF spec so that the code is easier 
> to read if this is not already the case. With so many changes it’s a good 
> opportunity to enhance the documentation within the source code. Some of the 
> old code enjoys very little documentation.


I disagree, in general I don’t think that references to the PDF spec are a good 
form of documentation (there are some exceptions). References to the spec are 
meaningless to the reader unless they take the time to look them up in a 700 
page PDF document. I would argue that by just linking back to the spec, we have 
*failed* to document PDFBox, not succeeded.

References to the PDF spec have another major flaw: they go out-of-date. For 
example a Pattern Colour Space will always be called “Pattern Colour Space” in 
future versions of the PDF spec but it may not be described in paragraph 
8.6.6.2 or on page 156. The existing code contains many references to the PDF 
1.6 and 1.7 specs as well as the ISO PDF32000 spec, which means that I need 
three 700 page PDF files open at all times in order to look up PDFBox 
references. With the new version of the PDF spec due this year, this situation 
is going to get worse.

I agree that some of the existing code needs more documentation, and I often 
add documentation to old files which I’m working on. However, my approach is to 
just paste in a sentence or two from the PDF spec (fair use). That way the 
reader does not ever need to look at the PDF spec. Because we use the same 
terminology in PDFBox as in the spec, if someone really wants to look something 
up, it’s as simple as Ctrl+F, no reference needed, and it’s guaranteed not to 
go out-of-date.

> I wouldn’t remove processStream and processSubStream but deprecate them and 
> remove them in the next major release though as to keep the changes to a 
> minimum.

This isn’t possible, as I said it "will necessarily be a breaking change”. This 
is because in 2.0 PDFStreamEngine needs to know the parent of each stream, but 
processStream and processSubStream do not provide this information. That’s why 
I’m discussing this on the mailing list.

> For the rendering what might have been missed is taking the UserUnit entry in 
> the page dictionary into account which might change the default user space. 
> This was introduced in PDF 1.6. A good opportunity to read that entry and 
> make sure that we handle it appropriately.

Yes, I have this as a “todo” in my working copy, however, if we put the 
UserUnit in the matrix then we should also put the page Rotation into the 
matrix, but that’a a significant change.

-- John

[jira] [Comment Edited] (PDFBOX-1936) text outline with shading pattern is invisible

2014-03-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940692#comment-13940692
 ] 

Tilman Hausherr edited comment on PDFBOX-1936 at 3/19/14 5:22 PM:
--

Update on this:
It's not a regression; the text was black last year because at that time, 
pixels that weren't rendered were left black, and today such pixels are left 
transparent.

The reason that nothing is shown is because AxialShadingContext is created with 
a wrong AffineTransform, that has giant scale values. That AffineTransform is 
related to the text rendering (drawString()) and is OK for that, but not for 
the shading. When I replace the wrong AffineTransform with a standard one for 
that dpi (e.g.  [4.16, 0, 0][0, 4.16, 2508] for 300dpi) I get a perfectly 
shaded text.


was (Author: tilman):
Update on this:
It's not a regression; the text was black last year because at that time, 
pixels there weren't rendered were left black, and today such pixels are left 
transparent.

The reason that nothing is shown is because AxialShadingContext is created with 
a wrong AffineTransform, that has giant scale values. That AffineTransform is 
related to the text rendering (drawString()) and is OK for that, but not for 
the shading. When I replace the wrong AffineTransform with a standard one for 
that dpi (e.g.  [4.16, 0, 0][0, 4.16, 2508] for 300dpi) I get a perfectly 
shaded text.

> text outline with shading pattern is invisible
> --
>
> Key: PDFBOX-1936
> URL: https://issues.apache.org/jira/browse/PDFBOX-1936
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>  Labels: shading, shadingpattern
> Attachments: color_gradient.pdf, color_gradient.pdf-1.png, 
> color_gradient.pdf-1.png, pslib-shading.pdf, pslib-shading.pdf-4.png, 
> pslib-shading.pdf-4.png
>
>
> This is also somewhat of a regression: in PDFBOX-615, the file 
> color_gradient.pdf-1.png had the text rendered, although in black. Currently, 
> the text is invisible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-19 Thread John Hewson
Hi Dimuthu

This is a good start. One point to address is that a String in Java is encoded 
as UTF-16,
so your getUTF8Text() method must be doing something wrong. It should perform
a UTF-16 conversion internally and be renamed to getText(). You can probably do 
the
conversion in Java rather than in C++ (or maybe Tesseract can return UTF-16?).

Cheers

-- John

On 16 Mar 2014, at 06:15, DImuthu Upeksha  wrote:

> Hi John,
> 
> For now I'm using those methods to debug the wrapper. I'll remove
> those methods after I finished testing it.
> 
> I started implementing OCR-plugin [1] for PDFBox. Currently it
> satisfies basic requirements such as getting word+location data [2].
> Please have a look at that and let me know if any changes are
> required.
> 
> [1] https://github.com/DImuthuUpe/OCR-Plugin
> [2] 
> https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/ocr/OCRConnector.java
> 
> Thanks
> Dimuthu
> 
> On Fri, Mar 14, 2014 at 12:09 AM, John Hewson  wrote:
>> Thanks, I saw your new refactoring too, it's good. Now the following methods 
>> are no longer needed:
>> 
>> public void setImagePath(String path)
>> public void setImage(byte[] imagedata, int width, int height, int bpp,int 
>> bpl)
>> 
>> Cheers
>> 
>> -- John
>> 
>> On 11 Mar 2014, at 22:58, DImuthu Upeksha  wrote:
>> 
>>> Hi John,
>>> Yes. I implemented a new method to accept byte streams of the image as
>>> an input. We directly can't send BufferedImage objects to native side.
>>> So what I did is converting buffered image into a byte array and
>>> passed it in to native side. At the native side it again converts in
>>> to compatible format. With that request we need to pass some metadata
>>> of byte stream like image width, height, bytes per pixel and bytes per
>>> row. I checked it with this [2] test case and it works fine.
>>> 
>>> [1] 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/java/com/apache/pdfbox/ocr/tesseract/TessBaseAPI.java#L74
>>> [2] 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/test/java/com/apache/pdfbox/ocr/tesseract/TessByteSteamTest.java
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>> On Wed, Mar 12, 2014 at 12:40 AM, John Hewson  wrote:
 Hi Dimuthu
 
 The Tesseract wrapper needs to take its input from a BufferedImage rather 
 than reading a file from disk, so instead of:
 
 api.setImagePath("test.tif");
 
 What we need is:
 
 BufferedImage image = ImageIO.read(new File("test.tif"));
 api.setImagePath(image);
 
 Because this will let us used the BufferedImage generated by PDFRenderer 
 without round-tripping to the disk.
 
 -- John
 
 On 11 Mar 2014, at 11:13, DImuthu Upeksha  
 wrote:
 
> Hi John,
> Thanks for the guidance.
> I did a small analysis of the accuracy and performance of new
> Tesseract wrapper. I used this [1] image as the input image and got
> following data [2] after OCR. First line is the recognised word
> followed by location details (bounding box) of the word. I think these
> details are pretty much enough for our task. Now what remaining is
> converting pdf file into a image as you have mentioned. These days I'm
> working on it.
> 
> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
> [2] https://gist.github.com/DImuthuUpe/9491660
> 
> Thanks
> Dimuthu
> 
> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson  wrote:
>> Dimuthu,
>> 
>>> I finished basic implementation of JNI wrapper for Tesseract. Now it 
>>> can be
>>> build using maven. Some useful methods that are needed to do basic OCR 
>>> were
>>> implemented.
>> 
>> Great, it's looking good, nice and clean.
>> 
>>> 1. What is the task of processStream method in PDFTextStripper class 
>>> line
>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>> page.findRotation() );
>> 
>> A PDF file is made up of pages, each of which contains a "content 
>> stream". This content stream contains a list of drawing commands such as 
>> "move to 10,15" or "write the word `foo`", these are called operators. 
>> The processStream function reads the stream for the current page and 
>> executes each of the operators. The operators themselves are implemented 
>> each in their own class which is a subclass of PDFOperator. The 
>> constructor of PDFStreamEngine creates the operator classes using 
>> reflection, which is rather odd and I'm not sure why this design was 
>> chosen. The operators used by PDFTextStripper can be found in 
>> org/apache/pdfbox/resources/PDFTextStripper.properties
>> 
>>> 2. Say I need to extract images and it's metadata from a pdf. What is 
>>> the better approach to do it?
>> 
>> You could subclass PDFTextStripper and override the startDocument method 
>> and use it to create a PDFRenderer and

Re: Removing processStream and processSubStream

2014-03-19 Thread Maruan Sahyoun
as an added note - initially you suggested

public void processTilingPattern(PDTilingPattern pattern) 

but as Patterns in general have their own matrix I think it applies to all 
patterns, that’s why I wrote „… Form, Text, Image and Pattern maintain …“

BR
Maruan

Am 19.03.2014 um 18:31 schrieb Maruan Sahyoun :

> John,
> 
> Am 19.03.2014 um 18:15 schrieb John Hewson :
> 
>> Maruan
>> 
>>> From how I understand the rendering in PDF Form, Text, Image and Pattern 
>>> maintain their own matrix to map to user space which is then transformed by 
>>> the CTM to device space so handling them specifically is fine and inline 
>>> with the spec.
>> 
>> No, that’s not right, what I said was:
>> 
 My problem is that tiling patterns are defined in their parent stream’s 
 initial coordinate space, rather than the
 coordinate space defined by the CTM.
>> 
>> So patterns should *not* be using the CTM, which is what I’m trying to 
>> achieve.
>> 
> 
> I think you misunderstood what I wrote - patterns have their own matrix - so 
> I think we are on the same page here. IMHO according to the spec CTM 
> transforms from user space to device space. So it’s pattern space -> user 
> space -> device space.
> 
> 
>>> I’d suggest that we make sure that the different ‚spaces‘ are defined 
>>> properly within the code and refer to the PDF spec so that the code is 
>>> easier to read if this is not already the case. With so many changes it’s a 
>>> good opportunity to enhance the documentation within the source code. Some 
>>> of the old code enjoys very little documentation.
>> 
>> 
>> I disagree, in general I don’t think that references to the PDF spec are a 
>> good form of documentation (there are some exceptions). References to the 
>> spec are meaningless to the reader unless they take the time to look them up 
>> in a 700 page PDF document. I would argue that by just linking back to the 
>> spec, we have *failed* to document PDFBox, not succeeded.
>> 
>> References to the PDF spec have another major flaw: they go out-of-date. For 
>> example a Pattern Colour Space will always be called “Pattern Colour Space” 
>> in future versions of the PDF spec but it may not be described in paragraph 
>> 8.6.6.2 or on page 156. The existing code contains many references to the 
>> PDF 1.6 and 1.7 specs as well as the ISO PDF32000 spec, which means that I 
>> need three 700 page PDF files open at all times in order to look up PDFBox 
>> references. With the new version of the PDF spec due this year, this 
>> situation is going to get worse.
>> 
> 
> Didn’t mean to only reference to the spec but to use the same terms as 
> described by the spec. Adding references to the spec is an add-on not a 
> replacement.
> 
>> I agree that some of the existing code needs more documentation, and I often 
>> add documentation to old files which I’m working on. However, my approach is 
>> to just paste in a sentence or two from the PDF spec (fair use). That way 
>> the reader does not ever need to look at the PDF spec. Because we use the 
>> same terminology in PDFBox as in the spec, if someone really wants to look 
>> something up, it’s as simple as Ctrl+F, no reference needed, and it’s 
>> guaranteed not to go out-of-date.
>> 
>>> I wouldn’t remove processStream and processSubStream but deprecate them and 
>>> remove them in the next major release though as to keep the changes to a 
>>> minimum.
>> 
>> This isn’t possible, as I said it "will necessarily be a breaking change”. 
>> This is because in 2.0 PDFStreamEngine needs to know the parent of each 
>> stream, but processStream and processSubStream do not provide this 
>> information. That’s why I’m discussing this on the mailing list.
> 
> I don’t understand why this is shouldn’t be possible. It’s more effort, 
> agreed, but beneficial.
> 
>> 
>>> For the rendering what might have been missed is taking the UserUnit entry 
>>> in the page dictionary into account which might change the default user 
>>> space. This was introduced in PDF 1.6. A good opportunity to read that 
>>> entry and make sure that we handle it appropriately.
>> 
>> Yes, I have this as a “todo” in my working copy, however, if we put the 
>> UserUnit in the matrix then we should also put the page Rotation into the 
>> matrix, but that’a a significant change.
>> 
>> -- John
> 



[jira] [Comment Edited] (PDFBOX-1936) text outline with shading pattern is invisible

2014-03-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940692#comment-13940692
 ] 

Tilman Hausherr edited comment on PDFBOX-1936 at 3/19/14 5:24 PM:
--

Update on this:
It's not a regression; the text was black last year because at that time, 
pixels that weren't rendered were left black, and today such pixels are left 
transparent.

The reason that nothing is shown is because AxialShadingContext is created with 
a wrong AffineTransform, that has giant scale values. That AffineTransform is 
related to the text rendering (drawString()) and is OK for that, but not for 
the shading. When I replace the wrong AffineTransform with a standard one for 
that dpi (e.g.  [4.16, 0, 0][0, 4.16, 2508] for 300dpi) I get a perfectly 
shaded text.

That's of course only a brute force solution that won't find a place in the 
sources. However I wonder why this is so.


was (Author: tilman):
Update on this:
It's not a regression; the text was black last year because at that time, 
pixels that weren't rendered were left black, and today such pixels are left 
transparent.

The reason that nothing is shown is because AxialShadingContext is created with 
a wrong AffineTransform, that has giant scale values. That AffineTransform is 
related to the text rendering (drawString()) and is OK for that, but not for 
the shading. When I replace the wrong AffineTransform with a standard one for 
that dpi (e.g.  [4.16, 0, 0][0, 4.16, 2508] for 300dpi) I get a perfectly 
shaded text.

> text outline with shading pattern is invisible
> --
>
> Key: PDFBOX-1936
> URL: https://issues.apache.org/jira/browse/PDFBOX-1936
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>  Labels: shading, shadingpattern
> Attachments: color_gradient.pdf, color_gradient.pdf-1.png, 
> color_gradient.pdf-1.png, pslib-shading.pdf, pslib-shading.pdf-4.png, 
> pslib-shading.pdf-4.png
>
>
> This is also somewhat of a regression: in PDFBOX-615, the file 
> color_gradient.pdf-1.png had the text rendered, although in black. Currently, 
> the text is invisible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (PDFBOX-1991) Shading PaintContexts should not depend on the page height

2014-03-19 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940607#comment-13940607
 ] 

John Hewson edited comment on PDFBOX-1991 at 3/19/14 4:02 PM:
--

Perhaps, though I'd expect if we need such a measurement then it would actually 
be the clipping rectangle (or similar), rather than the page height. Let's 
leave this issue open for now until we're 100% sure that the page height isn't 
needed.


was (Author: jahewson):
Perhaps, though I'd expect if we need such a measurement then it would actually 
be the clipping rectangle (or similar), rather than the page height.

> Shading PaintContexts should not depend on the page height
> --
>
> Key: PDFBOX-1991
> URL: https://issues.apache.org/jira/browse/PDFBOX-1991
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.0
>Reporter: John Hewson
>Priority: Minor
>  Labels: shading, shadingpattern
>
> I'd like to remove the page height parameter from PDPattern as soon as 
> possible because of doubts over its safety (i.e. the current stream being 
> processed may be a pattern or a form, not a page). Before I do that we need 
> to remove its only use, which is...
> The page height is passed to all shading PaintContext subclasses but it is 
> only used in GouraudShadingContext. However, all other drawing in PDFBox is 
> done using the native PDF y-axis which is flipped via a call to 
> Graphics2D#scale(0, -1) but the following code in GouraudShadingContext flips 
> the y-axis:
> v.point = new Point.Double(v.point.getX(), pageHeight + xform.getTranslateY() 
> - v.point.getY());
> So it seems like this could be removed and the y-axis inversion done 
> elsewhere with either a Matrix, AffineTransform or Grpahics2D#scale.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Removing processStream and processSubStream

2014-03-19 Thread John Hewson
Yes, one of those regressions is to Tiling Patterns but this refactoring is 
needed to fix it,
that’s why it’s taking so long.

-- John

On 19 Mar 2014, at 10:49, Tilman Hausherr  wrote:

> I don't yet have a complete understanding of the projekt so I can't comment 
> much on the contents, but my three cents:
> 
> - if the refactoring is based on the promise to solve problem XXX, then if 
> that problem XXX isn't solved after the refactoring, then it shouldn't be 
> done, or should be reversed
> - there are still regressions from the previous refactorings. The longer you 
> wait, the less you'll remember what you did
> 
> Nevertheless, at first sight, the refactorings seem to make sense. I always 
> like the idea of having access to information from above.
> 
> Tilman
> 
> Am 18.03.2014 20:46, schrieb John Hewson:
>> Hi All
>> 
>> I’m still working on getting Tiling Patterns to render correctly, and need 
>> to make some
>> changes to core PDFBox functionality in order to proceed. My problem is that 
>> tiling
>> patterns are defined in their parent stream’s initial coordinate space, 
>> rather than the
>> coordinate space defined by the CTM. However, in PDFBox there is no way to 
>> access
>> the parent stream, so I can’t find out what it’s initial matrix is. The 
>> manner in which the
>> initial coordinate space is determined is different for pages, forms, and 
>> patterns
>> 
>> What this means is that the parent stream’s initial coordinate space needs 
>> to be passed
>> to processStream and processSubStream in PDFStreamEngine. This will 
>> necessarily be
>> a breaking change, and it will affect all downstream subclasses of 
>> PDFStreamEngine.
>> 
>> Because this has to be a breaking change, I propose that we go all the way 
>> and make
>> the new API bulletproof, 1) so that we won’t have to introduce breaking 
>> changes in the
>> future if we encounter similar issues, 2) so that the caller of the method 
>> can’t pass the
>> wrong data in the parameters. We would remove the two generic methods:
>> 
>> public void processStream(PDResources resources, COSStream cosStream, 
>> PDRectangle drawingSize, int rotation)
>> public void processSubStream(PDResources resources, COSStream cosStream)
>> 
>> and replace them with four specific methods:
>> 
>> public void processPage(PDPage page)
>> public void processForm(PDFormXObject form)
>> public void processTilingPattern(PDTilingPattern pattern)
>> public void processType3Font(PDType3Font font)
>> 
>> This would mean that the various “proces” methods have access to their 
>> parent
>> stream, and can read any of its public fields in the future without 
>> introducing breaking
>> changes by altering the method’s parameters.
>> 
>> What do you think?
>> 
>> -- John
>> 
>> 
> 



Re: Removing processStream and processSubStream

2014-03-19 Thread John Hewson
Yes, this was just mentioned in PDFBOX-1936, it is indeed 
processPattern(PDPattern)
which is in fact needed.

-- John

On 19 Mar 2014, at 10:39, Maruan Sahyoun  wrote:

> as an added note - initially you suggested
> 
> public void processTilingPattern(PDTilingPattern pattern) 
> 
> but as Patterns in general have their own matrix I think it applies to all 
> patterns, that’s why I wrote „… Form, Text, Image and Pattern maintain …“
> 
> BR
> Maruan
> 
> Am 19.03.2014 um 18:31 schrieb Maruan Sahyoun :
> 
>> John,
>> 
>> Am 19.03.2014 um 18:15 schrieb John Hewson :
>> 
>>> Maruan
>>> 
 From how I understand the rendering in PDF Form, Text, Image and Pattern 
 maintain their own matrix to map to user space which is then transformed 
 by the CTM to device space so handling them specifically is fine and 
 inline with the spec.
>>> 
>>> No, that’s not right, what I said was:
>>> 
> My problem is that tiling patterns are defined in their parent stream’s 
> initial coordinate space, rather than the
> coordinate space defined by the CTM.
>>> 
>>> So patterns should *not* be using the CTM, which is what I’m trying to 
>>> achieve.
>>> 
>> 
>> I think you misunderstood what I wrote - patterns have their own matrix - so 
>> I think we are on the same page here. IMHO according to the spec CTM 
>> transforms from user space to device space. So it’s pattern space -> user 
>> space -> device space.
>> 
>> 
 I’d suggest that we make sure that the different ‚spaces‘ are defined 
 properly within the code and refer to the PDF spec so that the code is 
 easier to read if this is not already the case. With so many changes it’s 
 a good opportunity to enhance the documentation within the source code. 
 Some of the old code enjoys very little documentation.
>>> 
>>> 
>>> I disagree, in general I don’t think that references to the PDF spec are a 
>>> good form of documentation (there are some exceptions). References to the 
>>> spec are meaningless to the reader unless they take the time to look them 
>>> up in a 700 page PDF document. I would argue that by just linking back to 
>>> the spec, we have *failed* to document PDFBox, not succeeded.
>>> 
>>> References to the PDF spec have another major flaw: they go out-of-date. 
>>> For example a Pattern Colour Space will always be called “Pattern Colour 
>>> Space” in future versions of the PDF spec but it may not be described in 
>>> paragraph 8.6.6.2 or on page 156. The existing code contains many 
>>> references to the PDF 1.6 and 1.7 specs as well as the ISO PDF32000 spec, 
>>> which means that I need three 700 page PDF files open at all times in order 
>>> to look up PDFBox references. With the new version of the PDF spec due this 
>>> year, this situation is going to get worse.
>>> 
>> 
>> Didn’t mean to only reference to the spec but to use the same terms as 
>> described by the spec. Adding references to the spec is an add-on not a 
>> replacement.
>> 
>>> I agree that some of the existing code needs more documentation, and I 
>>> often add documentation to old files which I’m working on. However, my 
>>> approach is to just paste in a sentence or two from the PDF spec (fair 
>>> use). That way the reader does not ever need to look at the PDF spec. 
>>> Because we use the same terminology in PDFBox as in the spec, if someone 
>>> really wants to look something up, it’s as simple as Ctrl+F, no reference 
>>> needed, and it’s guaranteed not to go out-of-date.
>>> 
 I wouldn’t remove processStream and processSubStream but deprecate them 
 and remove them in the next major release though as to keep the changes to 
 a minimum.
>>> 
>>> This isn’t possible, as I said it "will necessarily be a breaking change”. 
>>> This is because in 2.0 PDFStreamEngine needs to know the parent of each 
>>> stream, but processStream and processSubStream do not provide this 
>>> information. That’s why I’m discussing this on the mailing list.
>> 
>> I don’t understand why this is shouldn’t be possible. It’s more effort, 
>> agreed, but beneficial.
>> 
>>> 
 For the rendering what might have been missed is taking the UserUnit entry 
 in the page dictionary into account which might change the default user 
 space. This was introduced in PDF 1.6. A good opportunity to read that 
 entry and make sure that we handle it appropriately.
>>> 
>>> Yes, I have this as a “todo” in my working copy, however, if we put the 
>>> UserUnit in the matrix then we should also put the page Rotation into the 
>>> matrix, but that’a a significant change.
>>> 
>>> -- John
>> 
> 



Re: Removing processStream and processSubStream

2014-03-19 Thread Tilman Hausherr
I don't yet have a complete understanding of the projekt so I can't 
comment much on the contents, but my three cents:


- if the refactoring is based on the promise to solve problem XXX, then 
if that problem XXX isn't solved after the refactoring, then it 
shouldn't be done, or should be reversed
- there are still regressions from the previous refactorings. The longer 
you wait, the less you'll remember what you did


Nevertheless, at first sight, the refactorings seem to make sense. I 
always like the idea of having access to information from above.


Tilman

Am 18.03.2014 20:46, schrieb John Hewson:

Hi All

I’m still working on getting Tiling Patterns to render correctly, and need to 
make some
changes to core PDFBox functionality in order to proceed. My problem is that 
tiling
patterns are defined in their parent stream’s initial coordinate space, rather 
than the
coordinate space defined by the CTM. However, in PDFBox there is no way to 
access
the parent stream, so I can’t find out what it’s initial matrix is. The manner 
in which the
initial coordinate space is determined is different for pages, forms, and 
patterns

What this means is that the parent stream’s initial coordinate space needs to 
be passed
to processStream and processSubStream in PDFStreamEngine. This will necessarily 
be
a breaking change, and it will affect all downstream subclasses of 
PDFStreamEngine.

Because this has to be a breaking change, I propose that we go all the way and 
make
the new API bulletproof, 1) so that we won’t have to introduce breaking changes 
in the
future if we encounter similar issues, 2) so that the caller of the method 
can’t pass the
wrong data in the parameters. We would remove the two generic methods:

public void processStream(PDResources resources, COSStream cosStream, 
PDRectangle drawingSize, int rotation)
public void processSubStream(PDResources resources, COSStream cosStream)

and replace them with four specific methods:

public void processPage(PDPage page)
public void processForm(PDFormXObject form)
public void processTilingPattern(PDTilingPattern pattern)
public void processType3Font(PDType3Font font)

This would mean that the various “proces” methods have access to their 
parent
stream, and can read any of its public fields in the future without introducing 
breaking
changes by altering the method’s parameters.

What do you think?

-- John






Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-19 Thread DImuthu Upeksha
Hi John,

I'm thinking about an approach to combine those word + location data
come from tesseract api in to actual sentences. What I get is
1 Identified word
2 Bounding box 's coordinates of that word

Likewise finally I have a set of words with bounding boxes. To combine
them I'm thinking about two approaches

1 Print those data into PDDocument again and pass through TextStripper
of PDFBox. This could reduce the performance of overall process.

2 Writing algorithms from scratch. This may need some extra research
work. However I feel that I could use same algorithms used in PDFBox
for this task.

What is the most feasible and efficient solution? I prefer second
approach. But it may require more time and testing than first one.

Thanks
Dimuthu

On Sun, Mar 16, 2014 at 6:45 PM, DImuthu Upeksha
 wrote:
> Hi John,
>
> For now I'm using those methods to debug the wrapper. I'll remove
> those methods after I finished testing it.
>
> I started implementing OCR-plugin [1] for PDFBox. Currently it
> satisfies basic requirements such as getting word+location data [2].
> Please have a look at that and let me know if any changes are
> required.
>
> [1] https://github.com/DImuthuUpe/OCR-Plugin
> [2] 
> https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/ocr/OCRConnector.java
>
> Thanks
> Dimuthu
>
> On Fri, Mar 14, 2014 at 12:09 AM, John Hewson  wrote:
>> Thanks, I saw your new refactoring too, it's good. Now the following methods 
>> are no longer needed:
>>
>> public void setImagePath(String path)
>> public void setImage(byte[] imagedata, int width, int height, int bpp,int 
>> bpl)
>>
>> Cheers
>>
>> -- John
>>
>> On 11 Mar 2014, at 22:58, DImuthu Upeksha  wrote:
>>
>>> Hi John,
>>> Yes. I implemented a new method to accept byte streams of the image as
>>> an input. We directly can't send BufferedImage objects to native side.
>>> So what I did is converting buffered image into a byte array and
>>> passed it in to native side. At the native side it again converts in
>>> to compatible format. With that request we need to pass some metadata
>>> of byte stream like image width, height, bytes per pixel and bytes per
>>> row. I checked it with this [2] test case and it works fine.
>>>
>>> [1] 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/java/com/apache/pdfbox/ocr/tesseract/TessBaseAPI.java#L74
>>> [2] 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/test/java/com/apache/pdfbox/ocr/tesseract/TessByteSteamTest.java
>>>
>>> Thanks
>>> Dimuthu
>>>
>>> On Wed, Mar 12, 2014 at 12:40 AM, John Hewson  wrote:
 Hi Dimuthu

 The Tesseract wrapper needs to take its input from a BufferedImage rather 
 than reading a file from disk, so instead of:

 api.setImagePath("test.tif");

 What we need is:

 BufferedImage image = ImageIO.read(new File("test.tif"));
 api.setImagePath(image);

 Because this will let us used the BufferedImage generated by PDFRenderer 
 without round-tripping to the disk.

 -- John

 On 11 Mar 2014, at 11:13, DImuthu Upeksha  
 wrote:

> Hi John,
> Thanks for the guidance.
> I did a small analysis of the accuracy and performance of new
> Tesseract wrapper. I used this [1] image as the input image and got
> following data [2] after OCR. First line is the recognised word
> followed by location details (bounding box) of the word. I think these
> details are pretty much enough for our task. Now what remaining is
> converting pdf file into a image as you have mentioned. These days I'm
> working on it.
>
> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
> [2] https://gist.github.com/DImuthuUpe/9491660
>
> Thanks
> Dimuthu
>
> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson  wrote:
>> Dimuthu,
>>
>>> I finished basic implementation of JNI wrapper for Tesseract. Now it 
>>> can be
>>> build using maven. Some useful methods that are needed to do basic OCR 
>>> were
>>> implemented.
>>
>> Great, it's looking good, nice and clean.
>>
>>> 1. What is the task of processStream method in PDFTextStripper class 
>>> line
>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>> page.findRotation() );
>>
>> A PDF file is made up of pages, each of which contains a "content 
>> stream". This content stream contains a list of drawing commands such as 
>> "move to 10,15" or "write the word `foo`", these are called operators. 
>> The processStream function reads the stream for the current page and 
>> executes each of the operators. The operators themselves are implemented 
>> each in their own class which is a subclass of PDFOperator. The 
>> constructor of PDFStreamEngine creates the operator classes using 
>> reflection, which is rather odd and I'm not sure why this design was 
>> chosen. The op

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-19 Thread John Hewson
Hi Dimuthu

> 1 Print those data into PDDocument again and pass through TextStripper
> of PDFBox. This could reduce the performance of overall process.

This was what I had in mind, but rather than printing the text into the 
PDDocument
you can inject it directly into PDFTextStripper as TextPosition instances. I 
mentioned
something like this a while ago:

> You could subclass PDFTextStripper and override the startDocument method and 
> use it to create a PDFRenderer and store it in a field. Then override the 
> processPage method and use the previously created PDFRenderer to render the 
> current page to a buffered image and perform OCR on the image. Once you have 
> the OCR text + positions, instead of calling processStream you can call 
> processTextPosition once for each character + position.

Let’s see how well it works and then re-evaluate.

-- John



Re: Removing processStream and processSubStream

2014-03-19 Thread John Hewson
Maruan,

>>> From how I understand the rendering in PDF Form, Text, Image and Pattern 
>>> maintain their own matrix to map to user space which is then transformed by 
>>> the CTM to device space so handling them specifically is fine and inline 
>>> with the spec.
>> 
>> No, that’s not right, what I said was:
>> 
 My problem is that tiling patterns are defined in their parent stream’s 
 initial coordinate space, rather than the
 coordinate space defined by the CTM.
>> 
>> So patterns should *not* be using the CTM, which is what I’m trying to 
>> achieve.
>> 
> 
> I think you misunderstood what I wrote - patterns have their own matrix - so 
> I think we are on the same page here. IMHO according to the spec CTM 
> transforms from user space to device space. So it’s pattern space -> user 
> space -> device space.

Nope, as I said, that’s what PDFBox currently does and it’s wrong. As you say 
the CTM transforms from user space to device space, but it’s not the only way 
to do so, and it is not used by patterns.

> Didn’t mean to only reference to the spec but to use the same terms as 
> described by the spec. Adding references to the spec is an add-on not a 
> replacement.

I don’t see what value this adds, given that the references will just go 
out-of-date when the next spec is released. We already use the same terminology 
as the PDF spec, so Ctrl+F can be used for quick look-ups that won’t go 
out-of-date.

>> This isn’t possible, as I said it "will necessarily be a breaking change”. 
>> This is because in 2.0 PDFStreamEngine needs to know the parent of each 
>> stream, but processStream and processSubStream do not provide this 
>> information. That’s why I’m discussing this on the mailing list.
> 
> I don’t understand why this is shouldn’t be possible. It’s more effort, 
> agreed, but beneficial.


What’s not to understand? PDFStreamEngine *needs* to know the parent of each 
stream, and the old methods don’t provide this, passing a null parent will not 
work because we need that information later in order to correctly process the 
stream. If we allowed a null parent to be passed, the result would be silently 
broken rendering - there’s no value in providing a backwards-compatible API if 
it can only produce broken results.

-- John

On 19 Mar 2014, at 10:31, Maruan Sahyoun  wrote:

> John,
> 
> Am 19.03.2014 um 18:15 schrieb John Hewson :
> 
>> Maruan
>> 
>>> From how I understand the rendering in PDF Form, Text, Image and Pattern 
>>> maintain their own matrix to map to user space which is then transformed by 
>>> the CTM to device space so handling them specifically is fine and inline 
>>> with the spec.
>> 
>> No, that’s not right, what I said was:
>> 
 My problem is that tiling patterns are defined in their parent stream’s 
 initial coordinate space, rather than the
 coordinate space defined by the CTM.
>> 
>> So patterns should *not* be using the CTM, which is what I’m trying to 
>> achieve.
>> 
> 
> I think you misunderstood what I wrote - patterns have their own matrix - so 
> I think we are on the same page here. IMHO according to the spec CTM 
> transforms from user space to device space. So it’s pattern space -> user 
> space -> device space.
> 
> 
>>> I’d suggest that we make sure that the different ‚spaces‘ are defined 
>>> properly within the code and refer to the PDF spec so that the code is 
>>> easier to read if this is not already the case. With so many changes it’s a 
>>> good opportunity to enhance the documentation within the source code. Some 
>>> of the old code enjoys very little documentation.
>> 
>> 
>> I disagree, in general I don’t think that references to the PDF spec are a 
>> good form of documentation (there are some exceptions). References to the 
>> spec are meaningless to the reader unless they take the time to look them up 
>> in a 700 page PDF document. I would argue that by just linking back to the 
>> spec, we have *failed* to document PDFBox, not succeeded.
>> 
>> References to the PDF spec have another major flaw: they go out-of-date. For 
>> example a Pattern Colour Space will always be called “Pattern Colour Space” 
>> in future versions of the PDF spec but it may not be described in paragraph 
>> 8.6.6.2 or on page 156. The existing code contains many references to the 
>> PDF 1.6 and 1.7 specs as well as the ISO PDF32000 spec, which means that I 
>> need three 700 page PDF files open at all times in order to look up PDFBox 
>> references. With the new version of the PDF spec due this year, this 
>> situation is going to get worse.
>> 
> 
> Didn’t mean to only reference to the spec but to use the same terms as 
> described by the spec. Adding references to the spec is an add-on not a 
> replacement.
> 
>> I agree that some of the existing code needs more documentation, and I often 
>> add documentation to old files which I’m working on. However, my approach is 
>> to just paste in a sentence or two from the PDF spec (fair use). That way 
>

[jira] [Commented] (PDFBOX-1848) Time Stamp Document Level Sigature

2014-03-19 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940782#comment-13940782
 ] 

John Hewson commented on PDFBOX-1848:
-

Waiting on PDFBOX-1847 to be resolved first.

> Time Stamp Document Level Sigature
> --
>
> Key: PDFBOX-1848
> URL: https://issues.apache.org/jira/browse/PDFBOX-1848
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Signing
>Affects Versions: 2.0.0
>Reporter: vakhtang koroghlishvili
>Assignee: John Hewson
> Fix For: 2.0.0
>
> Attachments: CreateTSASignature.java.patch, 
> TSA-SIG-LOOKS-LIKE-THIS.png
>
>
> We need TSA Document Level signature modulo too!
> At the moment we sign document with our certificate. But... sometimes we need 
> to sign document with TSA too. This is important part of signing. Sometimes 
> this is very very very important- for instance when we will implement PAdES 4 
> profile this module will be essential. without that Document Secure Store 
> will not work :)
> I'm working on this improvement. I'will finish this soon. It's almost done. I 
> only must add some java docs, and might be I change architect design and etc..
> So, please assign this it to me :) I will upload patch as soon as possible :)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1975) Improve TestImageIOUtils unit tests to check image resolution and compression

2014-03-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940860#comment-13940860
 ] 

Tilman Hausherr commented on PDFBOX-1975:
-

I clarified the javadoc in rev 1579355.

> Improve TestImageIOUtils unit tests to check image resolution and compression
> -
>
> Key: PDFBOX-1975
> URL: https://issues.apache.org/jira/browse/PDFBOX-1975
> Project: PDFBox
>  Issue Type: Task
>  Components: Utilities
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: imageio, test, tiff
> Fix For: 2.0.0
>
>
> Because of the problems with recent changes (see PDFBOX-1963), I will improve 
> the unit tests so that image resolution and compression is checked.
> I found out that JPEGs don't have a resolution, BMP had the wrong resolution. 
> The fault wasn't in the java TIFF writer as I thought before, it is in the 
> java PNG writer, which uses the PixelSize values wrongly, i.e. it interprets 
> them as "pixels per mm" instead of "mm per pixel" as per specification. The 
> JPEG writer throws an exception "JFIF APP0 must be first marker after SOI". 
> The BMP writer can set the resolution, but the BMP reader doesn't read it.
> (Some of this might be different depending on the version)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (PDFBOX-1975) Improve TestImageIOUtils unit tests to check image resolution and compression

2014-03-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940860#comment-13940860
 ] 

Tilman Hausherr edited comment on PDFBOX-1975 at 3/19/14 8:01 PM:
--

I clarified the javadoc in rev 1579355. I deprecated the confusing method and 
created a better one in rev. 1579369. I removed the deprecated call from the 
unit test in rev 1579372.


was (Author: tilman):
I clarified the javadoc in rev 1579355. I deprecated the confusing method and 
created a better one in rev. 1579369.

> Improve TestImageIOUtils unit tests to check image resolution and compression
> -
>
> Key: PDFBOX-1975
> URL: https://issues.apache.org/jira/browse/PDFBOX-1975
> Project: PDFBox
>  Issue Type: Task
>  Components: Utilities
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: imageio, test, tiff
> Fix For: 2.0.0
>
>
> Because of the problems with recent changes (see PDFBOX-1963), I will improve 
> the unit tests so that image resolution and compression is checked.
> I found out that JPEGs don't have a resolution, BMP had the wrong resolution. 
> The fault wasn't in the java TIFF writer as I thought before, it is in the 
> java PNG writer, which uses the PixelSize values wrongly, i.e. it interprets 
> them as "pixels per mm" instead of "mm per pixel" as per specification. The 
> JPEG writer throws an exception "JFIF APP0 must be first marker after SOI". 
> The BMP writer can set the resolution, but the BMP reader doesn't read it.
> (Some of this might be different depending on the version)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (PDFBOX-1975) Improve TestImageIOUtils unit tests to check image resolution and compression

2014-03-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940860#comment-13940860
 ] 

Tilman Hausherr edited comment on PDFBOX-1975 at 3/19/14 7:56 PM:
--

I clarified the javadoc in rev 1579355. I deprecated the confusing method and 
created a better one in rev. 1579369.


was (Author: tilman):
I clarified the javadoc in rev 1579355.

> Improve TestImageIOUtils unit tests to check image resolution and compression
> -
>
> Key: PDFBOX-1975
> URL: https://issues.apache.org/jira/browse/PDFBOX-1975
> Project: PDFBox
>  Issue Type: Task
>  Components: Utilities
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: imageio, test, tiff
> Fix For: 2.0.0
>
>
> Because of the problems with recent changes (see PDFBOX-1963), I will improve 
> the unit tests so that image resolution and compression is checked.
> I found out that JPEGs don't have a resolution, BMP had the wrong resolution. 
> The fault wasn't in the java TIFF writer as I thought before, it is in the 
> java PNG writer, which uses the PixelSize values wrongly, i.e. it interprets 
> them as "pixels per mm" instead of "mm per pixel" as per specification. The 
> JPEG writer throws an exception "JFIF APP0 must be first marker after SOI". 
> The BMP writer can set the resolution, but the BMP reader doesn't read it.
> (Some of this might be different depending on the version)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1975) Improve TestImageIOUtils unit tests to check image resolution and compression

2014-03-19 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940931#comment-13940931
 ] 

John Hewson commented on PDFBOX-1975:
-

When you deprecate methods make sure to add the \@Deprecated annotation to the 
method, as well as using the \@deprecated JavaDoc tag, e.g.

{code}
/**
 * @deprecated
 * explanation of why it was deprecated
 */
@Deprecated
static void deprecatedMethod()
{
   ...
}
{code}

> Improve TestImageIOUtils unit tests to check image resolution and compression
> -
>
> Key: PDFBOX-1975
> URL: https://issues.apache.org/jira/browse/PDFBOX-1975
> Project: PDFBox
>  Issue Type: Task
>  Components: Utilities
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: imageio, test, tiff
> Fix For: 2.0.0
>
>
> Because of the problems with recent changes (see PDFBOX-1963), I will improve 
> the unit tests so that image resolution and compression is checked.
> I found out that JPEGs don't have a resolution, BMP had the wrong resolution. 
> The fault wasn't in the java TIFF writer as I thought before, it is in the 
> java PNG writer, which uses the PixelSize values wrongly, i.e. it interprets 
> them as "pixels per mm" instead of "mm per pixel" as per specification. The 
> JPEG writer throws an exception "JFIF APP0 must be first marker after SOI". 
> The BMP writer can set the resolution, but the BMP reader doesn't read it.
> (Some of this might be different depending on the version)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1975) Improve TestImageIOUtils unit tests to check image resolution and compression

2014-03-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940939#comment-13940939
 ] 

Tilman Hausherr commented on PDFBOX-1975:
-

oops, yes, done in rev 1579386.

> Improve TestImageIOUtils unit tests to check image resolution and compression
> -
>
> Key: PDFBOX-1975
> URL: https://issues.apache.org/jira/browse/PDFBOX-1975
> Project: PDFBox
>  Issue Type: Task
>  Components: Utilities
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: imageio, test, tiff
> Fix For: 2.0.0
>
>
> Because of the problems with recent changes (see PDFBOX-1963), I will improve 
> the unit tests so that image resolution and compression is checked.
> I found out that JPEGs don't have a resolution, BMP had the wrong resolution. 
> The fault wasn't in the java TIFF writer as I thought before, it is in the 
> java PNG writer, which uses the PixelSize values wrongly, i.e. it interprets 
> them as "pixels per mm" instead of "mm per pixel" as per specification. The 
> JPEG writer throws an exception "JFIF APP0 must be first marker after SOI". 
> The BMP writer can set the resolution, but the BMP reader doesn't read it.
> (Some of this might be different depending on the version)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Removing processStream and processSubStream

2014-03-19 Thread Maruan Sahyoun
John

Am 19.03.2014 um 19:10 schrieb John Hewson :

> Maruan,
> 
 From how I understand the rendering in PDF Form, Text, Image and Pattern 
 maintain their own matrix to map to user space which is then transformed 
 by the CTM to device space so handling them specifically is fine and 
 inline with the spec.
>>> 
>>> No, that’s not right, what I said was:
>>> 
> My problem is that tiling patterns are defined in their parent stream’s 
> initial coordinate space, rather than the
> coordinate space defined by the CTM.
>>> 
>>> So patterns should *not* be using the CTM, which is what I’m trying to 
>>> achieve.
>>> 
>> 
>> I think you misunderstood what I wrote - patterns have their own matrix - so 
>> I think we are on the same page here. IMHO according to the spec CTM 
>> transforms from user space to device space. So it’s pattern space -> user 
>> space -> device space.
> 
> Nope, as I said, that’s what PDFBox currently does and it’s wrong. As you say 
> the CTM transforms from user space to device space, but it’s not the only way 
> to do so, and it is not used by patterns.

As the processing is defined in the spec this is a good reference so no need to 
discuss that further. Of course different people might come to different 
conclusions by reading and interpreting the spec. 

> 
>> Didn’t mean to only reference to the spec but to use the same terms as 
>> described by the spec. Adding references to the spec is an add-on not a 
>> replacement.
> 
> I don’t see what value this adds, given that the references will just go 
> out-of-date when the next spec is released. We already use the same 
> terminology as the PDF spec, so Ctrl+F can be used for quick look-ups that 
> won’t go out-of-date.

You are not enforced to add the information.

> 
>>> This isn’t possible, as I said it "will necessarily be a breaking change”. 
>>> This is because in 2.0 PDFStreamEngine needs to know the parent of each 
>>> stream, but processStream and processSubStream do not provide this 
>>> information. That’s why I’m discussing this on the mailing list.
>> 
>> I don’t understand why this is shouldn’t be possible. It’s more effort, 
>> agreed, but beneficial.
> 
> 
> What’s not to understand? PDFStreamEngine *needs* to know the parent of each 
> stream, and the old methods don’t provide this, passing a null parent will 
> not work because we need that information later in order to correctly process 
> the stream. If we allowed a null parent to be passed, the result would be 
> silently broken rendering - there’s no value in providing a 
> backwards-compatible API if it can only produce broken results.

Won’t get to the same conclusion here (as I think we won’t get on the other 
topics above).

> 
> -- John
> 
> On 19 Mar 2014, at 10:31, Maruan Sahyoun  wrote:
> 
>> John,
>> 
>> Am 19.03.2014 um 18:15 schrieb John Hewson :
>> 
>>> Maruan
>>> 
 From how I understand the rendering in PDF Form, Text, Image and Pattern 
 maintain their own matrix to map to user space which is then transformed 
 by the CTM to device space so handling them specifically is fine and 
 inline with the spec.
>>> 
>>> No, that’s not right, what I said was:
>>> 
> My problem is that tiling patterns are defined in their parent stream’s 
> initial coordinate space, rather than the
> coordinate space defined by the CTM.
>>> 
>>> So patterns should *not* be using the CTM, which is what I’m trying to 
>>> achieve.
>>> 
>> 
>> I think you misunderstood what I wrote - patterns have their own matrix - so 
>> I think we are on the same page here. IMHO according to the spec CTM 
>> transforms from user space to device space. So it’s pattern space -> user 
>> space -> device space.
>> 
>> 
 I’d suggest that we make sure that the different ‚spaces‘ are defined 
 properly within the code and refer to the PDF spec so that the code is 
 easier to read if this is not already the case. With so many changes it’s 
 a good opportunity to enhance the documentation within the source code. 
 Some of the old code enjoys very little documentation.
>>> 
>>> 
>>> I disagree, in general I don’t think that references to the PDF spec are a 
>>> good form of documentation (there are some exceptions). References to the 
>>> spec are meaningless to the reader unless they take the time to look them 
>>> up in a 700 page PDF document. I would argue that by just linking back to 
>>> the spec, we have *failed* to document PDFBox, not succeeded.
>>> 
>>> References to the PDF spec have another major flaw: they go out-of-date. 
>>> For example a Pattern Colour Space will always be called “Pattern Colour 
>>> Space” in future versions of the PDF spec but it may not be described in 
>>> paragraph 8.6.6.2 or on page 156. The existing code contains many 
>>> references to the PDF 1.6 and 1.7 specs as well as the ISO PDF32000 spec, 
>>> which means that I need three 700 page PDF files open at all times in order 
>>> to look up

[jira] [Commented] (PDFBOX-1975) Improve TestImageIOUtils unit tests to check image resolution and compression

2014-03-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941035#comment-13941035
 ] 

Tilman Hausherr commented on PDFBOX-1975:
-

Removed incorrect javadoc line in rev 1579414.

> Improve TestImageIOUtils unit tests to check image resolution and compression
> -
>
> Key: PDFBOX-1975
> URL: https://issues.apache.org/jira/browse/PDFBOX-1975
> Project: PDFBox
>  Issue Type: Task
>  Components: Utilities
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: imageio, test, tiff
> Fix For: 2.0.0
>
>
> Because of the problems with recent changes (see PDFBOX-1963), I will improve 
> the unit tests so that image resolution and compression is checked.
> I found out that JPEGs don't have a resolution, BMP had the wrong resolution. 
> The fault wasn't in the java TIFF writer as I thought before, it is in the 
> java PNG writer, which uses the PixelSize values wrongly, i.e. it interprets 
> them as "pixels per mm" instead of "mm per pixel" as per specification. The 
> JPEG writer throws an exception "JFIF APP0 must be first marker after SOI". 
> The BMP writer can set the resolution, but the BMP reader doesn't read it.
> (Some of this might be different depending on the version)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Removing processStream and processSubStream

2014-03-19 Thread John Hewson
Maruan

With the exception of the documentation issue, these are not subjective matters,
you can’t disagree with an objective truth. Either falsify my claims or concede 
that
I am correct - we need to reach a technical resolution on this.

-- John

On 19 Mar 2014, at 13:48, Maruan Sahyoun  wrote:

> John
> 
> Am 19.03.2014 um 19:10 schrieb John Hewson :
> 
>> Maruan,
>> 
> From how I understand the rendering in PDF Form, Text, Image and Pattern 
> maintain their own matrix to map to user space which is then transformed 
> by the CTM to device space so handling them specifically is fine and 
> inline with the spec.
 
 No, that’s not right, what I said was:
 
>> My problem is that tiling patterns are defined in their parent stream’s 
>> initial coordinate space, rather than the
>> coordinate space defined by the CTM.
 
 So patterns should *not* be using the CTM, which is what I’m trying to 
 achieve.
 
>>> 
>>> I think you misunderstood what I wrote - patterns have their own matrix - 
>>> so I think we are on the same page here. IMHO according to the spec CTM 
>>> transforms from user space to device space. So it’s pattern space -> user 
>>> space -> device space.
>> 
>> Nope, as I said, that’s what PDFBox currently does and it’s wrong. As you 
>> say the CTM transforms from user space to device space, but it’s not the 
>> only way to do so, and it is not used by patterns.
> 
> As the processing is defined in the spec this is a good reference so no need 
> to discuss that further. Of course different people might come to different 
> conclusions by reading and interpreting the spec. 
> 
>> 
>>> Didn’t mean to only reference to the spec but to use the same terms as 
>>> described by the spec. Adding references to the spec is an add-on not a 
>>> replacement.
>> 
>> I don’t see what value this adds, given that the references will just go 
>> out-of-date when the next spec is released. We already use the same 
>> terminology as the PDF spec, so Ctrl+F can be used for quick look-ups that 
>> won’t go out-of-date.
> 
> You are not enforced to add the information.
> 
>> 
 This isn’t possible, as I said it "will necessarily be a breaking change”. 
 This is because in 2.0 PDFStreamEngine needs to know the parent of each 
 stream, but processStream and processSubStream do not provide this 
 information. That’s why I’m discussing this on the mailing list.
>>> 
>>> I don’t understand why this is shouldn’t be possible. It’s more effort, 
>>> agreed, but beneficial.
>> 
>> 
>> What’s not to understand? PDFStreamEngine *needs* to know the parent of each 
>> stream, and the old methods don’t provide this, passing a null parent will 
>> not work because we need that information later in order to correctly 
>> process the stream. If we allowed a null parent to be passed, the result 
>> would be silently broken rendering - there’s no value in providing a 
>> backwards-compatible API if it can only produce broken results.
> 
> Won’t get to the same conclusion here (as I think we won’t get on the other 
> topics above).
> 
>> 
>> -- John
>> 
>> On 19 Mar 2014, at 10:31, Maruan Sahyoun  wrote:
>> 
>>> John,
>>> 
>>> Am 19.03.2014 um 18:15 schrieb John Hewson :
>>> 
 Maruan
 
> From how I understand the rendering in PDF Form, Text, Image and Pattern 
> maintain their own matrix to map to user space which is then transformed 
> by the CTM to device space so handling them specifically is fine and 
> inline with the spec.
 
 No, that’s not right, what I said was:
 
>> My problem is that tiling patterns are defined in their parent stream’s 
>> initial coordinate space, rather than the
>> coordinate space defined by the CTM.
 
 So patterns should *not* be using the CTM, which is what I’m trying to 
 achieve.
 
>>> 
>>> I think you misunderstood what I wrote - patterns have their own matrix - 
>>> so I think we are on the same page here. IMHO according to the spec CTM 
>>> transforms from user space to device space. So it’s pattern space -> user 
>>> space -> device space.
>>> 
>>> 
> I’d suggest that we make sure that the different ‚spaces‘ are defined 
> properly within the code and refer to the PDF spec so that the code is 
> easier to read if this is not already the case. With so many changes it’s 
> a good opportunity to enhance the documentation within the source code. 
> Some of the old code enjoys very little documentation.
 
 
 I disagree, in general I don’t think that references to the PDF spec are a 
 good form of documentation (there are some exceptions). References to the 
 spec are meaningless to the reader unless they take the time to look them 
 up in a 700 page PDF document. I would argue that by just linking back to 
 the spec, we have *failed* to document PDFBox, not succeeded.
 
 References to the PDF spec have another major fl