[jira] [Commented] (PDFBOX-1915) Implement shading with Coons and tensor-product patch meshes

2014-07-07 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054580#comment-14054580
 ] 

Tilman Hausherr commented on PDFBOX-1915:
-

I'll test the new code later today... I wonder what will happen with this PDF 
by Prof. Kerstin Upmeyer that took "almost forever" to render:
http://kupmeyer.com/wp-content/uploads/2012/02/K_UPMEYER_SPRING10.pdf
Apparently, illustrators are using type 6 and 7 patches for what it was 
intended: shading 3d objects.


> Implement shading with Coons and tensor-product patch meshes
> 
>
> Key: PDFBOX-1915
> URL: https://issues.apache.org/jira/browse/PDFBOX-1915
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 1.8.5, 1.8.6, 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Shaola Ren
>  Labels: graphical, gsoc2014, java, math, shading
> Fix For: 2.0.0
>
> Attachments: CIB-coons-vs-tensormesh.pdf, CIB-coonsmesh.pdf, 
> CONICAL.pdf, GWG060_Shading_x1a.pdf, GWG060_Shading_x1a_1.png, HSBWHEEL.pdf, 
> McAfee-ShadingType7.pdf, Shadingtype6week1.pdf, TENSOR.pdf, XYZsweep.pdf, 
> _gwg060_shading_x1a.pdf-1.png, _mcafee-shadingtype7.pdf-1.png, 
> asy-coons-but-really-tensor.pdf, asy-tensor-rainbow.pdf, asy-tensor.pdf, 
> coons-function.pdf, coons-function.ps, coons-nofunction-CMYK.pdf, 
> coons-nofunction-CMYK.ps, coons-nofunction-Duotone.pdf, 
> coons-nofunction-Duotone.ps, coons-nofunction-Gray.pdf, 
> coons-nofunction-Gray.ps, coons-nofunction-RGB.pdf, coons-nofunction-RGB.ps, 
> coons2-function.pdf, coons2-function.ps, coons4-function.ps, crestron-p9.pdf, 
> eci_altona-test-suite-v2_technical_H.pdf, example_030.pdf, failedTest.rar, 
> lamp_cairo.pdf, lamp_cairo7_0.png, lamp_cairo7_1.png, lamp_cairo7_1.png, 
> lineRasterization.jpg, mcafeeU5.pdf, mcafeeU5_1.png, mcafeeu5.pdf-1.png, 
> pass4FlagTest.rar, patchCases.jpg, patchMap.jpg, shading6ContourTest.rar, 
> shading6Done.rar, shading7.rar, tensor-nofunction-RGB.pdf, 
> tensor-nofunction-RGB.ps, tensor-nofunction-RGB_1.png, 
> tensor4-nofunction.pdf, tensor4-nofunction.ps, tensor4-nofunction_1.png, 
> updateshading6ContourTest.rar
>
>
> Of the seven shading methods described in the PDF specification, type 6 
> (Coons patch meshes) and type 7 (Tensor-product patch meshes) haven't been 
> implemented. I have done type 1, 4 and 5, but I don't know the math for type 
> 6 and 7. My math days are decades away.
> Knowledge prerequisites: 
> - java, although you don't have to be a java ace, just feel confortable
> - math: you should know what "cubic Bézier curves", "Degenerate Bézier 
> curves", "bilinear interpolation", "tensor-product", "affine transform 
> matrix" and "Bernstein polynomials" are, or be able to learn it
> - maven (basic)
> - svn (basic)
> - an IDE like Netbeans or Eclipse or IntelliJ (basic)
> - ideally, you are either a math student who likes to program, or a computer 
> science student who is specializing in graphics.
> A first look at PDFBOX: try the command utility here:
> https://pdfbox.apache.org/commandline/#pdfToImage
> and use your favorite PDF, or the PDFs mentioned in PDFBOX-615, these have 
> the shading types that are already implemented.
> Some simple source code to convert to images:
> String filename = "blah.pdf";
> PDDocument document = PDDocument.loadNonSeq(new File(filename), null);
> List pdPages = document.getDocumentCatalog().getAllPages();
> int page = 0;
> for (PDPage pdPage : pdPages)
> {
> ++page;
> BufferedImage bim = RenderUtil.convertToImage(pdPage, 
> BufferedImage.TYPE_BYTE_BINARY, 300);
> ImageIO.write(bim, "png", new File(filename+page+".png"));
> }
> document.close();
> You are not starting from scratch. The implementation of type 4 and 5 shows 
> you how to read parameters from the PDF and set the graphics. You don't have 
> to learn the complete PDF spec, only 15 pages related to the two shading 
> types, and 6 pages about shading in general. The PDF specification is here:
> http://www.adobe.com/devnet/pdf/pdf_reference.html
> The tricky parts are:
> - decide whether a point(x,y) is inside or outside a patch
> - decide the color of a point within the patch
> To get an idea about the code, look at the classes GouraudTriangle, 
> GouraudShadingContext, Type4ShadingContext and Vertex here
> https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/graphics/shading/
> or download the whole project from the repository.
> https://pdfbox.apache.org/downloads.html#scm
> If you want to see the existing code in the debugger with a Gouraud shading, 
> try this file:
> http://asymptote.sourceforge.net/gallery/Gouraud.pdf
> Testing:
> I have attached several example PDFs. To see which one has which shading, 
> open them with an editor li

[jira] [Commented] (PDFBOX-1915) Implement shading with Coons and tensor-product patch meshes

2014-07-07 Thread Shaola Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054527#comment-14054527
 ] 

Shaola Ren commented on PDFBOX-1915:


As I thought at the very beginning, I used a hashmap to store all the pixel 
point of an image, there is no memory consumption problem, and the speed for 
type 6 and 7 shading is much faster now. Using the hashmap avoids much 
duplicated calculation and useless list traverse. This method also avoids to 
write a new structure of quadtree. For other shading type, I'm sure the similar 
technique will be helpful too, I looked at the type 2 function and type 2 
shading, there is not an obvious flaw in the code except the common flaw as 
type 6 and 7 shading in the getRaster() which I mostly inherited from previous 
code before.

If you are willing to close this project earlier than the planned time Aug 19, 
I think I can try to finish all the improvements in 2 or 3 more weeks, the math 
in other types is simple and obvious. I hope you can consider this, thanks in 
advance.

the repository https://bitbucket.org/xinshu/pdfbox.git is up to date 
(https://bitbucket.org/xinshu/pdfbox).

> Implement shading with Coons and tensor-product patch meshes
> 
>
> Key: PDFBOX-1915
> URL: https://issues.apache.org/jira/browse/PDFBOX-1915
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 1.8.5, 1.8.6, 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Shaola Ren
>  Labels: graphical, gsoc2014, java, math, shading
> Fix For: 2.0.0
>
> Attachments: CIB-coons-vs-tensormesh.pdf, CIB-coonsmesh.pdf, 
> CONICAL.pdf, GWG060_Shading_x1a.pdf, GWG060_Shading_x1a_1.png, HSBWHEEL.pdf, 
> McAfee-ShadingType7.pdf, Shadingtype6week1.pdf, TENSOR.pdf, XYZsweep.pdf, 
> _gwg060_shading_x1a.pdf-1.png, _mcafee-shadingtype7.pdf-1.png, 
> asy-coons-but-really-tensor.pdf, asy-tensor-rainbow.pdf, asy-tensor.pdf, 
> coons-function.pdf, coons-function.ps, coons-nofunction-CMYK.pdf, 
> coons-nofunction-CMYK.ps, coons-nofunction-Duotone.pdf, 
> coons-nofunction-Duotone.ps, coons-nofunction-Gray.pdf, 
> coons-nofunction-Gray.ps, coons-nofunction-RGB.pdf, coons-nofunction-RGB.ps, 
> coons2-function.pdf, coons2-function.ps, coons4-function.ps, crestron-p9.pdf, 
> eci_altona-test-suite-v2_technical_H.pdf, example_030.pdf, failedTest.rar, 
> lamp_cairo.pdf, lamp_cairo7_0.png, lamp_cairo7_1.png, lamp_cairo7_1.png, 
> lineRasterization.jpg, mcafeeU5.pdf, mcafeeU5_1.png, mcafeeu5.pdf-1.png, 
> pass4FlagTest.rar, patchCases.jpg, patchMap.jpg, shading6ContourTest.rar, 
> shading6Done.rar, shading7.rar, tensor-nofunction-RGB.pdf, 
> tensor-nofunction-RGB.ps, tensor-nofunction-RGB_1.png, 
> tensor4-nofunction.pdf, tensor4-nofunction.ps, tensor4-nofunction_1.png, 
> updateshading6ContourTest.rar
>
>
> Of the seven shading methods described in the PDF specification, type 6 
> (Coons patch meshes) and type 7 (Tensor-product patch meshes) haven't been 
> implemented. I have done type 1, 4 and 5, but I don't know the math for type 
> 6 and 7. My math days are decades away.
> Knowledge prerequisites: 
> - java, although you don't have to be a java ace, just feel confortable
> - math: you should know what "cubic Bézier curves", "Degenerate Bézier 
> curves", "bilinear interpolation", "tensor-product", "affine transform 
> matrix" and "Bernstein polynomials" are, or be able to learn it
> - maven (basic)
> - svn (basic)
> - an IDE like Netbeans or Eclipse or IntelliJ (basic)
> - ideally, you are either a math student who likes to program, or a computer 
> science student who is specializing in graphics.
> A first look at PDFBOX: try the command utility here:
> https://pdfbox.apache.org/commandline/#pdfToImage
> and use your favorite PDF, or the PDFs mentioned in PDFBOX-615, these have 
> the shading types that are already implemented.
> Some simple source code to convert to images:
> String filename = "blah.pdf";
> PDDocument document = PDDocument.loadNonSeq(new File(filename), null);
> List pdPages = document.getDocumentCatalog().getAllPages();
> int page = 0;
> for (PDPage pdPage : pdPages)
> {
> ++page;
> BufferedImage bim = RenderUtil.convertToImage(pdPage, 
> BufferedImage.TYPE_BYTE_BINARY, 300);
> ImageIO.write(bim, "png", new File(filename+page+".png"));
> }
> document.close();
> You are not starting from scratch. The implementation of type 4 and 5 shows 
> you how to read parameters from the PDF and set the graphics. You don't have 
> to learn the complete PDF spec, only 15 pages related to the two shading 
> types, and 6 pages about shading in general. The PDF specification is here:
> http://www.adobe.com/devnet/pdf/pdf_reference.html
> The tricky parts are:
> - decide whether a point(x,y) is inside or outside a p

RE: Regression Testing

2014-07-07 Thread Allison, Timothy B.
John,

   My initial plan for TIKA-1302 is very similar to what Tilman outlined, and 
my understanding/concerns/thoughts were very much in line with what he 
articulated.  The idea is that there should be a small Apache license-able gold 
truth set like both projects now have for specific unit tests (patient-based 
care), but that we should also occasionally take a public-health view and 
compare the outputs of  different versions of our parsers on a large set of 
docs to identify new exceptions or large changes in extracted content/metadata. 

   I'm persuaded by your points about fair use and the importance of "open 
data."  Before proceeding on TIKA-1302, I'd like to get broader feedback on the 
way ahead via legal-discuss or maybe jira's Legal.  Do you mind if I quote your 
arguments?

   Also, I was on my way to requesting a vm from infra for TIKA-1302.  Do you 
see any way that we could share resources so that we're not double-storing 
files on Apache infrastructure?  There may be easy ways to share some eval code 
as well.

  Best,

   Tim

-Original Message-
From: John Hewson [mailto:j...@jahewson.com] 
Sent: Saturday, July 05, 2014 5:01 PM
To: dev@pdfbox.apache.org
Subject: Re: Regression Testing


On 5 Jul 2014, at 13:47, Tilman Hausherr  wrote:

> Am 05.07.2014 22:12, schrieb John Hewson:
> Copyrights is a problem: I'm testing mostly with JIRA attachments that 
> I've downloaded over the years. While uploading such files to JIRA might 
> count as fair use, I doubt that this would still be true if they are 
> included in a distribution. Instead, they should be stored somewhere on 
> Apache servers where only committers and build software ("Travis", 
> "Jenkins", ...) can access then. The public PDFs that Maruan mentions 
> don't possibly have all the Problem cases that we solved before. However 
> I have started working with these files and there are at least 5 recent 
> issues that deals with them.
 The PDFs won't be in a distribution. They will just happen to be stored in 
 an SVN repo but not our source code repo, in the same way that the website 
 is stored in the "cmssite" branch of SVN or indeed, are on JIRA. The law 
 doesn't distinguish between JIRA and SVN, both are publicly available via 
 HTTP, so using SVN will simply be a continuation of what we're already 
 doing with JIRA.
 
 The crucial factor is that we're only storing publicly available PDFs,  
 because we have the right to do so, just like Google's cache, and like we 
 currently do with JIRA.
>>> Yes but many PDFs we got aren't really "public". If this svn repo is only 
>>> accessible to committers, and if the publicly available build scripts won't 
>>> break because of this, then it is OK.
>> Any non-public PDFs will not be permitted in our test suite, just as they 
>> shouldn't be on JIRA.
>> 
>>> Note that even if something is "publicly available", it may still be 
>>> copyrighted. Other risks can be that some people upload PDFs that include 
>>> personal data. One really good test PDF was apparently a loan application. 
>>> I remember that the user insisted that 1. it was test data, and 2. that it 
>>> be removed.
>> All Apache development should be in the open, this is a key ASF principle, 
>> having a committers-only test suite is basically a no-no. It's important to 
>> understand that "fair use" allows us to use copyrighted works - this is 
>> expressly permitted, it's the same legal principle as Google's cache. There 
>> is no need to seek permission. This is what we've been doing with JIRA 
>> already for years, so we are already doing this - it's fine.
> 
> The problem is that this has all happened before. A few years ago, many files 
> were deleted, see PDFBOX-391.

That issue is about including files in the source code repo as part of the 
PDFBox distribution, where there is a need to put files under an Apache 2.0 
compatible license. What I'm advocating is keeping a separate public repository 
of test files which are not a part of the PDFBox source, like we currently have 
on JIRA.

-- John


[jira] [Updated] (PDFBOX-1533) When merging certain PDF's several odd looking empty pages occure in the result

2014-07-07 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-1533:


Description: 
Unfortunately I cannot attach a input file for this case as it contains 
confidential customer data, but I'll try to explain the problem in depth so you 
hopefully are able to track it down.

When we merge certain PDF's with the PDFMergerUtility the result contains 
several empty pages at the end of the document. It seems like that only certain 
pdf versions are effected (i.p.: 1.5 (Acrobat 6.x)). 

I tracked the problem down to the the following part in the appendDocument 
method of the PDFMergerUtility:
{code}
 //finally append the pages
List pages = srcCatalog.getAllPages();
Iterator pageIter = pages.iterator();
while( pageIter.hasNext() )
{
PDPage page = pageIter.next();
PDPage newPage =
new PDPage( (COSDictionary)cloner.cloneForNewDocument( 
page.getCOSDictionary() ) );
newPage.setCropBox( page.findCropBox() );
newPage.setMediaBox( page.findMediaBox() );
newPage.setRotation( page.findRotation() );
destination.addPage( newPage );
}
{code}
The problem is that call to srcCatalog.getAllPages(); returns for example 6 
PDPage objects, but for the same input document, the call to 
source.getNumberOfPages() returns only 2. Thus we add 4 odd empty pages to the 
result document.

I hope this description is good enough to figure out the problem. Don't 
hesitate to ask for further details.

  was:
Unfortunately I cannot attach a input file for this case as it contains 
confidential customer data, but I'll try to explain the problem in depth so you 
hopefully are able to track it down.

When we merg certain PDF's with the PDFMergerUtility the result contains 
serveral empty pages at the end of the document. It seems like that only 
certain pdf versions are effected (i.p.: 1.5 (Acrobat 6.x)). 
I tracked the problem down to the the following part in the appendDocument 
method of the PDFMergerUtility:

 //finally append the pages
List pages = srcCatalog.getAllPages();
Iterator pageIter = pages.iterator();
while( pageIter.hasNext() )
{
PDPage page = pageIter.next();
PDPage newPage =
new PDPage( (COSDictionary)cloner.cloneForNewDocument( 
page.getCOSDictionary() ) );
newPage.setCropBox( page.findCropBox() );
newPage.setMediaBox( page.findMediaBox() );
newPage.setRotation( page.findRotation() );
destination.addPage( newPage );
}

The problem is that call to srcCatalog.getAllPages(); returns for expamle 6 
PDPage objects, but for the same input document, the call to 
source.getNumberOfPages() returns only 2. Thus we add 4 odd empty pages to the 
result document.

I hope this description is good enought to figure out the problem. Don't 
hesitate to ask for further details.


> When merging certain PDF's several odd looking empty pages occure in the 
> result
> ---
>
> Key: PDFBOX-1533
> URL: https://issues.apache.org/jira/browse/PDFBOX-1533
> Project: PDFBox
>  Issue Type: Bug
>  Components: Utilities
>Affects Versions: 1.7.1, 1.8.4
>Reporter: Christian Connert
> Attachments: sample_pdf.zip
>
>
> Unfortunately I cannot attach a input file for this case as it contains 
> confidential customer data, but I'll try to explain the problem in depth so 
> you hopefully are able to track it down.
> When we merge certain PDF's with the PDFMergerUtility the result contains 
> several empty pages at the end of the document. It seems like that only 
> certain pdf versions are effected (i.p.: 1.5 (Acrobat 6.x)). 
> I tracked the problem down to the the following part in the appendDocument 
> method of the PDFMergerUtility:
> {code}
>  //finally append the pages
> List pages = srcCatalog.getAllPages();
> Iterator pageIter = pages.iterator();
> while( pageIter.hasNext() )
> {
> PDPage page = pageIter.next();
> PDPage newPage =
> new PDPage( (COSDictionary)cloner.cloneForNewDocument( 
> page.getCOSDictionary() ) );
> newPage.setCropBox( page.findCropBox() );
> newPage.setMediaBox( page.findMediaBox() );
> newPage.setRotation( page.findRotation() );
> destination.addPage( newPage );
> }
> {code}
> The problem is that call to srcCatalog.getAllPages(); returns for example 6 
> PDPage objects, but for the same input document, the call to 
> source.getNumberOfPages() returns only 2. Thus we add 4 odd empty pages to 
> the result document.
> I hope this description is good enough to figure out

Re: Paid PDFBox support

2014-07-07 Thread Leonard Rosenthol
FWIW: It¹s unclear if such a file (with multiple references from the Pages
tree) is valid.  There is nothing that prevents it, but it¹s not necessary
an expected thing.

Leonard

On 7/7/14, 5:05 PM, "Maruan Sahyoun"  wrote:

>the issue is because part1.pdf in PDFBOX-1533 references the same 2 pages
>3 times within the document catalog (/Kids [3 0 R, 3 0 R, 3 0 R]). Could
>you attach a sample pdf to PDFBOX-1533 to verify that your issue has the
>same cause or verify it for yourself?
>
>We are using PDFBox for merging documents ourselves successfully.
>Obviously this file would need some special treatment.
>
>BR
>Maruan
>
>Am 07.07.2014 um 11:31 schrieb Aleksander Blomskøld :
>
>> Hi,
>> 
>> We're using PDFBox for PDF validation and PDF merging in a backend
>> invoicing system. It's working pretty well for most of the time, but
>>right
>> now we're having some unhappy customers because of
>> https://issues.apache.org/jira/browse/PDFBOX-1533.
>> 
>> As it's important for us to have this fixed pretty soon, we're
>>wondering if
>> anyone of you would be willing to fix this issue for pay. If so, please
>> contact me so we can work out the details.
>> 
>> 
>> Regards,
>> 
>> Aleksander Blomskøld
>



Re: Paid PDFBox support

2014-07-07 Thread Maruan Sahyoun
the issue is because part1.pdf in PDFBOX-1533 references the same 2 pages 3 
times within the document catalog (/Kids [3 0 R, 3 0 R, 3 0 R]). Could you 
attach a sample pdf to PDFBOX-1533 to verify that your issue has the same cause 
or verify it for yourself?

We are using PDFBox for merging documents ourselves successfully. Obviously 
this file would need some special treatment. 

BR
Maruan

Am 07.07.2014 um 11:31 schrieb Aleksander Blomskøld :

> Hi,
> 
> We're using PDFBox for PDF validation and PDF merging in a backend
> invoicing system. It's working pretty well for most of the time, but right
> now we're having some unhappy customers because of
> https://issues.apache.org/jira/browse/PDFBOX-1533.
> 
> As it's important for us to have this fixed pretty soon, we're wondering if
> anyone of you would be willing to fix this issue for pay. If so, please
> contact me so we can work out the details.
> 
> 
> Regards,
> 
> Aleksander Blomskøld



Re: Paid PDFBox support

2014-07-07 Thread Tilman Hausherr
I don't do freelancing and I never looked at the merge code so I'm 
hardly your guy, but maybe somebody else will come forward.


This workaround code worked for me with the files in the JIRA issue:

PDDocument doc1 = PDDocument.loadNonSeq(new File("part1.pdf"), 
null);
PDDocument doc2 = PDDocument.loadNonSeq(new File("part2.pdf"), 
null);

PDDocument doc3 = new PDDocument();
doc3.importPage(doc1.getPage(0));
doc3.importPage(doc1.getPage(1));
doc3.importPage(doc2.getPage(0));
doc3.importPage(doc2.getPage(1));
for (PDPage pdPage : (List) 
doc3.getDocumentCatalog().getAllPages())

{
if (pdPage.getMediaBox() == null)
pdPage.setMediaBox(PDPage.PAGE_SIZE_A4);
}
doc3.save(new File("res.pdf"));
doc1.close();
doc2.close();
doc3.close();


The only mystery (to me) is why I had to set the media box. Without 
that, the page is too small (probably US letter)



Tilman

Am 07.07.2014 11:31, schrieb Aleksander Blomskøld:

Hi,

We're using PDFBox for PDF validation and PDF merging in a backend
invoicing system. It's working pretty well for most of the time, but right
now we're having some unhappy customers because of
https://issues.apache.org/jira/browse/PDFBOX-1533.

As it's important for us to have this fixed pretty soon, we're wondering if
anyone of you would be willing to fix this issue for pay. If so, please
contact me so we can work out the details.


Regards,

Aleksander Blomskøld





Re: Custom PDFTextStripper Warning (sometimes)

2014-07-07 Thread -A
John:

Excellent! That fixed it. I appreciate the fast reply. I've been scouring
about for any PDFBox resources I could find and unfortunately have not
found much. If there are any sites or books that go over the API that you
would recommend, then by all means, please do.

Thanks again though!

-Aaron


On Mon, Jul 7, 2014 at 12:13 PM, John Hewson  wrote:

> Hi Aaron
>
> You’re using the operator classes from the
> “org.apache.pdfbox.util.operator.pagedrawer” package with your custom
> TextStripper, however these class are only for use with a PageDrawer. If
> you look at the top entry in the stack trace
> "org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule.process(FillEvenOddRule.java:56)”
> then you’ll see that the code at this line is:
>
> PageDrawer drawer = (PageDrawer)context;
>
> But your context class is TextStripper (or at least a subclass of it) not
> a PageDrawer. The solution is not to initialise your TextStripper with the
> .properties file which maps PageDrawer operators, take a look at some of
> the subclasses of TextStripper which are already in PDFBox to see how this
> is done.
>
> -- John
>
> On 7 Jul 2014, at 10:50, -A  wrote:
>
> > Hi everyone; I have a program written that has two PDF function
> > requirements:
> >
> >
> >   1. It must be able to return all of the text from the file
> >   2. It must be able to find red text within the file
> >
> >
> > I have two different types of PDF files. One we can call a Job Output
> File,
> > which may or may not have red text in it. The other is a Job Location
> File
> > which contains a table with all of the locations of the Job Output Files.
> > Originally I wrote the program with a custom text stripper which simply
> > adds a state boolean to track whether it found red in a given file. I
> then
> > created an overloaded processTextPosition method that looks like the
> > following:
> >
> > [I found this method through researching but if there is a better method,
> > by all means share]
> >
> > @Override
> >protected void processTextPosition(TextPosition textPos)
> >{
> >try
> >{
> >PDGraphicsState graphicsState = getGraphicsState();
> >
> >// IF the current text contains RED
> >if
> (graphicsState.getNonStrokingColor().getJavaColor().getRed()
> > == 255)
> >{
> >this.hasRed = true;
> >}
> >
> >}
> >catch (IOException ioe)
> >{
> >ioe.printStackTrace();
> >}
> >
> >}
> >
> > If I run the program on a Job Output File it works flawlessly. If I run
> it
> > on a Job Location File (which will never have red in it), I get the
> > following warning:
> >
> > org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule process
> > WARNING: java.lang.ClassCastException: MyPDFStripper cannot be cast to
> > org.apache.pdfbox.pdfviewer.PageDrawer
> > java.lang.ClassCastException: MyPDFStripper cannot be cast to
> > org.apache.pdfbox.pdfviewer.PageDrawer
> > at
> >
> org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule.process(FillEvenOddRule.java:56)
> > at
> >
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
> > at
> >
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> > at
> >
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> > at
> >
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
> > at MyPDFStripper.containsRed(IncrementalPDFStripper.java:68)
> >
> >
> > The program will generate NO warnings if I comment out the method call
> for
> > containsRed when passing it a Job Location File. Knowing this, I could
> get
> > around this warning rather easily by handling this case differently
> (which
> > it would be, but this is what testing is for; right?). But my question to
> > all of you is, why am I getting this? Is it because this Job Location
> File
> > has locations in a table that is throwing off the TextStripper? This is
> the
> > only difference between the files (neither contains images) that I can
> tell.
> >
> >
> > Thank you guys for your time!
> > Sincerely,
> > Aaron
>
>


Re: Custom PDFTextStripper Warning (sometimes)

2014-07-07 Thread John Hewson
Hi Aaron

You’re using the operator classes from the 
“org.apache.pdfbox.util.operator.pagedrawer” package with your custom 
TextStripper, however these class are only for use with a PageDrawer. If you 
look at the top entry in the stack trace 
"org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule.process(FillEvenOddRule.java:56)”
 then you’ll see that the code at this line is:

PageDrawer drawer = (PageDrawer)context;

But your context class is TextStripper (or at least a subclass of it) not a 
PageDrawer. The solution is not to initialise your TextStripper with the 
.properties file which maps PageDrawer operators, take a look at some of the 
subclasses of TextStripper which are already in PDFBox to see how this is done.

-- John

On 7 Jul 2014, at 10:50, -A  wrote:

> Hi everyone; I have a program written that has two PDF function
> requirements:
> 
> 
>   1. It must be able to return all of the text from the file
>   2. It must be able to find red text within the file
> 
> 
> I have two different types of PDF files. One we can call a Job Output File,
> which may or may not have red text in it. The other is a Job Location File
> which contains a table with all of the locations of the Job Output Files.
> Originally I wrote the program with a custom text stripper which simply
> adds a state boolean to track whether it found red in a given file. I then
> created an overloaded processTextPosition method that looks like the
> following:
> 
> [I found this method through researching but if there is a better method,
> by all means share]
> 
> @Override
>protected void processTextPosition(TextPosition textPos)
>{
>try
>{
>PDGraphicsState graphicsState = getGraphicsState();
> 
>// IF the current text contains RED
>if (graphicsState.getNonStrokingColor().getJavaColor().getRed()
> == 255)
>{
>this.hasRed = true;
>}
> 
>}
>catch (IOException ioe)
>{
>ioe.printStackTrace();
>}
> 
>}
> 
> If I run the program on a Job Output File it works flawlessly. If I run it
> on a Job Location File (which will never have red in it), I get the
> following warning:
> 
> org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule process
> WARNING: java.lang.ClassCastException: MyPDFStripper cannot be cast to
> org.apache.pdfbox.pdfviewer.PageDrawer
> java.lang.ClassCastException: MyPDFStripper cannot be cast to
> org.apache.pdfbox.pdfviewer.PageDrawer
> at
> org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule.process(FillEvenOddRule.java:56)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
> at MyPDFStripper.containsRed(IncrementalPDFStripper.java:68)
> 
> 
> The program will generate NO warnings if I comment out the method call for
> containsRed when passing it a Job Location File. Knowing this, I could get
> around this warning rather easily by handling this case differently (which
> it would be, but this is what testing is for; right?). But my question to
> all of you is, why am I getting this? Is it because this Job Location File
> has locations in a table that is throwing off the TextStripper? This is the
> only difference between the files (neither contains images) that I can tell.
> 
> 
> Thank you guys for your time!
> Sincerely,
> Aaron



Jenkins build became unstable: PDFBox-trunk #1124

2014-07-07 Thread Apache Jenkins Server
See 



Jenkins build became unstable: PDFBox-trunk » Apache PDFBox #1124

2014-07-07 Thread Apache Jenkins Server
See 




[jira] [Resolved] (PDFBOX-2194) Refactor predictor

2014-07-07 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-2194.
-

Resolution: Fixed

Done in rev 1608530 for the trunk and rev 1608537 for the 1.8 branch.

> Refactor predictor
> --
>
> Key: PDFBOX-2194
> URL: https://issues.apache.org/jira/browse/PDFBOX-2194
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.6, 1.8.7, 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: Predictor
> Fix For: 1.8.7, 2.0.0
>
>
> The predictor class has an unneeded ByteArrayOutputStream. I observed a 
> temporary memory usage of sometimes 1MB. I will remove this and do some minor 
> cleanup.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (PDFBOX-2194) Refactor predictor

2014-07-07 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-2194:
---

 Summary: Refactor predictor
 Key: PDFBOX-2194
 URL: https://issues.apache.org/jira/browse/PDFBOX-2194
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 1.8.6, 1.8.7, 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Priority: Minor
 Fix For: 1.8.7, 2.0.0


The predictor class has an unneeded ByteArrayOutputStream. I observed a 
temporary memory usage of sometimes 1MB. I will remove this and do some minor 
cleanup.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Custom PDFTextStripper Warning (sometimes)

2014-07-07 Thread -A
Hi everyone; I have a program written that has two PDF function
requirements:


   1. It must be able to return all of the text from the file
   2. It must be able to find red text within the file


I have two different types of PDF files. One we can call a Job Output File,
which may or may not have red text in it. The other is a Job Location File
which contains a table with all of the locations of the Job Output Files.
Originally I wrote the program with a custom text stripper which simply
adds a state boolean to track whether it found red in a given file. I then
created an overloaded processTextPosition method that looks like the
following:

[I found this method through researching but if there is a better method,
by all means share]

@Override
protected void processTextPosition(TextPosition textPos)
{
try
{
PDGraphicsState graphicsState = getGraphicsState();

// IF the current text contains RED
if (graphicsState.getNonStrokingColor().getJavaColor().getRed()
== 255)
{
this.hasRed = true;
}

}
catch (IOException ioe)
{
ioe.printStackTrace();
}

}

If I run the program on a Job Output File it works flawlessly. If I run it
on a Job Location File (which will never have red in it), I get the
following warning:

org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule process
WARNING: java.lang.ClassCastException: MyPDFStripper cannot be cast to
org.apache.pdfbox.pdfviewer.PageDrawer
java.lang.ClassCastException: MyPDFStripper cannot be cast to
org.apache.pdfbox.pdfviewer.PageDrawer
at
org.apache.pdfbox.util.operator.pagedrawer.FillEvenOddRule.process(FillEvenOddRule.java:56)
at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at MyPDFStripper.containsRed(IncrementalPDFStripper.java:68)


The program will generate NO warnings if I comment out the method call for
containsRed when passing it a Job Location File. Knowing this, I could get
around this warning rather easily by handling this case differently (which
it would be, but this is what testing is for; right?). But my question to
all of you is, why am I getting this? Is it because this Job Location File
has locations in a table that is throwing off the TextStripper? This is the
only difference between the files (neither contains images) that I can tell.


Thank you guys for your time!
Sincerely,
Aaron


Re: Improving OCR plugin for PDFBox

2014-07-07 Thread John Hewson
Santosh,

Please don’t e-mail the entire mailing list asking to be unsubscribed, simply 
send an e-mail to:

dev-unsubscr...@pdfbox.apache.org

-- John

On 7 Jul 2014, at 10:39, Santosh Arakeri  wrote:

> Pl dont send me mail.
> 
> 
> On Fri, Jun 27, 2014 at 12:28 PM, John Hewson  wrote:
> 
>> Hi Dimuthu
>> 
>> That’s great. We should wait until closer to the end of the GSoC period to
>> integrate your work with PDFBox, as ideally we only want to have to do it
>> once. We’ve not included C++ dependencies before so no, there won’t be a
>> standard way, we’ll have to think something up. We’ll either make it an
>> optional sub-project and the Tesseract JNI bindings might be better of
>> having their own branch so that they are more like an external dependency -
>> I’ll ask the dev mailing list.
>> 
>> To prepare your code for contribution you’ll need to add the Apache header
>> to each.java file (see any PDFBox .java file for an example) and submit a
>> signed ICLA http://www.apache.org/licenses/icla.pdf to Apache.
>> 
>> Regarding additional functionality, the most useful would be for a new
>> command line tool which could write the OCR’d text back into the original
>> PDF file as “invisible text”, which would allow for copy and paste and text
>> search to then work for that PDF file. A starting point for this would be
>> to try and write the OCR’d text into the original PDF as “visible” text -
>> we can make it invisible later!
>> 
>> -- John
>> 
>> On 19 Jun 2014, at 13:57, DImuthu Upeksha 
>> wrote:
>> 
>>> Hi John,
>>> Except providing compatibility for platforms like windows, I think most
>> of the functionalities of OCR plugin are finished (Please correct me if I'm
>> wrong). But I would like to contribute to project further. Do  you have
>> anything to add as a new functionality? And If you plan to add this to
>> PDFBox code, how should prepare my code? Is there any standard way?
>>> 
>>> Thanks
>>> Dimuthu
>>> --
>>> Regards
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> Department of Computer Science And Engineering
>>> University of Moratuwa, Sri Lanka
>> 
>> 



Re: Improving OCR plugin for PDFBox

2014-07-07 Thread Santosh Arakeri
Pl dont send me mail.


On Fri, Jun 27, 2014 at 12:28 PM, John Hewson  wrote:

> Hi Dimuthu
>
> That’s great. We should wait until closer to the end of the GSoC period to
> integrate your work with PDFBox, as ideally we only want to have to do it
> once. We’ve not included C++ dependencies before so no, there won’t be a
> standard way, we’ll have to think something up. We’ll either make it an
> optional sub-project and the Tesseract JNI bindings might be better of
> having their own branch so that they are more like an external dependency -
> I’ll ask the dev mailing list.
>
> To prepare your code for contribution you’ll need to add the Apache header
> to each.java file (see any PDFBox .java file for an example) and submit a
> signed ICLA http://www.apache.org/licenses/icla.pdf to Apache.
>
> Regarding additional functionality, the most useful would be for a new
> command line tool which could write the OCR’d text back into the original
> PDF file as “invisible text”, which would allow for copy and paste and text
> search to then work for that PDF file. A starting point for this would be
> to try and write the OCR’d text into the original PDF as “visible” text -
> we can make it invisible later!
>
> -- John
>
> On 19 Jun 2014, at 13:57, DImuthu Upeksha 
> wrote:
>
> > Hi John,
> > Except providing compatibility for platforms like windows, I think most
> of the functionalities of OCR plugin are finished (Please correct me if I'm
> wrong). But I would like to contribute to project further. Do  you have
> anything to add as a new functionality? And If you plan to add this to
> PDFBox code, how should prepare my code? Is there any standard way?
> >
> > Thanks
> > Dimuthu
> > --
> > Regards
> > W.Dimuthu Upeksha
> > Undergraduate
> > Department of Computer Science And Engineering
> > University of Moratuwa, Sri Lanka
>
>


[jira] [Comment Edited] (PDFBOX-1695) Improve pdfbox tests

2014-07-07 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052896#comment-14052896
 ] 

Tilman Hausherr edited comment on PDFBOX-1695 at 7/7/14 5:33 PM:
-

Committed changes (create a diff image in target dir) for the trunk in rev 
1608040 and 1608521. I'm leaving this open for a while, because we're 
discussing something related in the dev list.


was (Author: tilman):
Committed changes (create a diff image in target dir) for the trunk in rev 
1608040. I'm leaving this open for a while, because we're discussing something 
related in the dev list.

> Improve pdfbox tests
> 
>
> Key: PDFBOX-1695
> URL: https://issues.apache.org/jira/browse/PDFBOX-1695
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 1.8.2, 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: tdd, test-driven, testing
> Attachments: ccitt4.tif, jbig2test-01.png, jbig2test.pdf
>
>
> I'd like to improve the tests for rendering.
> org/apache/pdfbox/util/TestPDFToImage.java is disabled in pdfbox\pom.xml . 
> This has been disabled since 2009 ?! So I enabled it here.
> The subdir "rendering" is missing in pdfbox\target\test-output for these tests
> When a test fails because the rendered image is not identical, no detailed 
> message appears on the console. It appears only in pdfbox.log and not on the 
> console.
> this is because of the settings in
> pdfbox\src\test\resources\logging.properties
> If this is on purpose, please change the texts in 
> pdfbox\src\test\java\org\apache\pdfbox\util\*.java from
> "One or more failures, see test log for details"
> to
> "One or more failures, see test logfile 'pdfbox.log' for details"
> I wanted to attach a PDF with ccitt g4 compression and its rendering created 
> with the 1.8.2 version, but it doesn't work out, seems that CIB generates 
> files that can be rendered properly with 1.8.2. However I attach the TIFF g4 
> file, and a JBIG2 test file from it. I don't have access to a Xerox 
> WorkCentre (enter jbig2 in google news :-) ) so I used a free service, so 
> there's a watermark.
> It should be included into
> pdfbox\src\test\resources\input\rendering
> I have created the image myself and I give it into the public domain.
> If my suggestion is accepted, it would be nice if people could create files 
> that fail in current versions or have failed in old versions, and release 
> these files to the public domain, so that they can be added to the tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-283) Character encoding/appearance issues when filling forms

2014-07-07 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053793#comment-14053793
 ] 

Tilman Hausherr commented on PDFBOX-283:


Done in rev 1608502 for the 1.8 branch and rev 1608506 for the trunk. Sorry for 
the regression. The code made sense, I looked at the specification too, so I 
committed it. Please test a snapshot version to be sure that all is OK now. If 
possible, please submit a "minimal" code example that fails (e.g. with an 
exception) with the bad code, and doesn't fail with the good code. I would then 
add it as a test.

> Character encoding/appearance issues when filling forms
> ---
>
> Key: PDFBOX-283
> URL: https://issues.apache.org/jira/browse/PDFBOX-283
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
> Attachments: PDAppearance.patch
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1735902
> Originally submitted by scop on 2007-06-12 10:23.
> When filling a text field with non-ASCII characters such as in my surname 
> "Skyttä" and saving the document in a UTF-8 environment, something goes 
> wrong with the appearance of the text.
> The value itself seems to be stored correctly, but when opening the doc, the 
> appearance of "ä" is not that, but rather something which happens when UTF-8 
> is mistakenly treated as ISO-8859-1 (two garbage characters).
> PDAppearance uses the platform default encoding in quite a few places which 
> apparently has potential to mess things up.  In particular, 
> insertGeneratedAppearance() generates a PrintWriter from an OutputStream 
> without specifying the encoding.  In fact, if I hack that to use ISO-8859-1, 
> the appearance of my "ä" case is correct, but that won't obviously work with 
> anything else than chars that are valid ISO-8859-1.
> In which char encoding should the value be written to the appearance stream 
> (at end of insertGeneratedAppearance())?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-283) Character encoding/appearance issues when filling forms

2014-07-07 Thread Marco Primiceri (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053683#comment-14053683
 ] 

Marco Primiceri commented on PDFBOX-283:


Hello [~tilman]

Maruans patch has solved one of my issues as well (thanks!) but it introduced a 
new bug when filling multi-line text boxes.
Do you mind applying this fix to the multi line conversion method as well?
{code}
--- 
a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAppearance.java
+++ 
b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAppearance.java
@@ -452,7 +452,7 @@
 while( (currIdx = line.indexOf('\n',lastIdx )) > -1 )
 {
 result.append(line.substring(lastIdx,currIdx));
-result.append(" ) Tj\n0 -13 Td\n(");
+result.append(" > Tj\n0 -13 Td\n<");
 lastIdx = currIdx + 1;
 }
 result.append(line.substring(lastIdx));
{code}

Kind Regards,
Marco

> Character encoding/appearance issues when filling forms
> ---
>
> Key: PDFBOX-283
> URL: https://issues.apache.org/jira/browse/PDFBOX-283
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
> Attachments: PDAppearance.patch
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1735902
> Originally submitted by scop on 2007-06-12 10:23.
> When filling a text field with non-ASCII characters such as in my surname 
> "Skyttä" and saving the document in a UTF-8 environment, something goes 
> wrong with the appearance of the text.
> The value itself seems to be stored correctly, but when opening the doc, the 
> appearance of "ä" is not that, but rather something which happens when UTF-8 
> is mistakenly treated as ISO-8859-1 (two garbage characters).
> PDAppearance uses the platform default encoding in quite a few places which 
> apparently has potential to mess things up.  In particular, 
> insertGeneratedAppearance() generates a PrintWriter from an OutputStream 
> without specifying the encoding.  In fact, if I hack that to use ISO-8859-1, 
> the appearance of my "ä" case is correct, but that won't obviously work with 
> anything else than chars that are valid ISO-8859-1.
> In which char encoding should the value be written to the appearance stream 
> (at end of insertGeneratedAppearance())?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2107) Make PDFBox XMP library agnostic

2014-07-07 Thread MH (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053651#comment-14053651
 ] 

MH commented on PDFBOX-2107:


It took me a while to figure out why suddenly those 2 methods are "deprecated". 
Would you please add a full Javadoc explanation HOW it shoul dbe replaced? The 
Javadoc/SOurce of org.apache.pdfbox.pdmodel.common.PDMetadata just says

Deprecated

That's all. If you deprecate a method/class you should always give a 
link/explanation why it is deprecated and how it can be replaced.

So, the reason is this tikcte PDFBOX-2107, right? And what's the equivalent 
replacement using the new xmpbox.jar?

> Make PDFBox XMP library agnostic
> 
>
> Key: PDFBOX-2107
> URL: https://issues.apache.org/jira/browse/PDFBOX-2107
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.0
>
>
> PDFBox should become agnostic to how XMP metadata has been generated. This 
> will also remove the dependency on Jempbox.
> The benefit will be that Jembox, Xmpbox as well as other libraries could be 
> used for generating the XMP metadata. PDFBox will only provide methods to get 
> and set the XMP metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Paid PDFBox support

2014-07-07 Thread Aleksander Blomskøld
Hi,

We're using PDFBox for PDF validation and PDF merging in a backend
invoicing system. It's working pretty well for most of the time, but right
now we're having some unhappy customers because of
https://issues.apache.org/jira/browse/PDFBOX-1533.

As it's important for us to have this fixed pretty soon, we're wondering if
anyone of you would be willing to fix this issue for pay. If so, please
contact me so we can work out the details.


Regards,

Aleksander Blomskøld


Re: Regression Testing

2014-07-07 Thread Petr Slabý

Hi,
following is a description of what we are doing in our company.

With our software, we run regression tests after each nightly build and 
sometimes it is a tough fight. If there is a regression, it is not so easy 
to find which commit caused it, because there are potentially many between 
the nightly builds. Then, the decision whether the change is wanted and 
expected is in some cases also difficult (this part might be easier with PDF 
where there is the "golden standard" rendering in Acrobat). If the change is 
expected and the new rendering "better" then one has to commit the new 
reference. This means that the files produced on the nightly build machine 
must be available somehow - it is almost impossible to produce them locally 
as the rendering results are slightly different with different versions of 
java and many other reasons. All this has to be done before the next 
regression test is run to avoid that new regressions are hidden by earlier 
ones. Our complete build with all tests runs several hours...


To improve this workflow, we now use the following schema in addition:
- there is a smaller set of regression tests which runs relatively fast
- these tests are triggered by each commit in formatting and rendering 
related projects
- before running the test itself, the modified project(s) are compiled 
locally, w/o publishing the result to maven

- the reference rendering files are stored in SVN
- if a test finds a regression, it immediately stores the new result as a 
new reference into SVN. This makes sure that a) the test renderings do not 
get lost and b) that each regression exactly points to the commit that has 
caused it - the one that triggered the test. The failed test creates a new 
issue in JIRA with a pointer to SVN to the before and after rendering and a 
bitmap of the differencies. The issue is then processed. If we find the 
change to be expected then the issue is simply closed, otherwise we take 
actions to fix the problem. The only annoying thing about this scheme is 
that, after commiting the correction, the test runs again and reports a 
regression because it now compares to the faulty version of the rendering.


Best regards,
Petr.

-Původní zpráva- 
From: John Hewson

Sent: Friday, July 04, 2014 7:39 PM
To: dev@pdfbox.apache.org
Subject: Re: Regression Testing

Hi Tilman

Thanks for your thoughts, I think that your concerns are already covered by 
my original proposal, I’ll try to explain why and how:


Of course I agree with the need for regression tests, however it isn't 
easy: besides the problems of the different JDKs (I use JDK7 Windows 64 
bit), there is the problem that some enhancements create slight changes in 
rendering that are not errors, i.e. both the "before" and the "after" 
files look OK by itself. This has happened when we changed the text 
rendering recently, and has happened again when the clipping was improved. 
The cause are probably slight changes in color or in boundaries.


If a rendering has changed then the regression test should fail. When a 
failure occurs the developer needs to manually inspect the differences (we 
could generate a visual diff which highlights what changed to make this 
easier) and if ok then they can replace the known-good PNG with the ones 
just rendered. Indeed this will be the basic workflow for working with 
regression tests.


Copyrights is a problem: I'm testing mostly with JIRA attachments that 
I've downloaded over the years. While uploading such files to JIRA might 
count as fair use, I doubt that this would still be true if they are 
included in a distribution. Instead, they should be stored somewhere on 
Apache servers where only committers and build software ("Travis", 
"Jenkins", ...) can access then. The public PDFs that Maruan mentions 
don't possibly have all the Problem cases that we solved before. However I 
have started working with these files and there are at least 5 recent 
issues that deals with them.


The PDFs won’t be in a distribution. They will just happen to be stored in 
an SVN repo but not our source code repo, in the same way that the website 
is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law 
doesn’t distinguish between JIRA and SVN, both are publicly available via 
HTTP, so using SVN will simply be a continuation of what we’re already doing 
with JIRA.


The crucial factor is that we’re only storing publicly available PDFs, 
because we have the right to do so, just like Google’s cache, and like we 
currently do with JIRA.


Additionally, the PDFs need to be version controlled otherwise we won’t be 
able to reliably recreate previous builds, so storing the files on a web 
server won’t be practical. Also committers will frequently be updating the 
renderings as bugs are fixed and we’ll need to version-control the rendered 
PNG files for the same reason. Finally, having committers-only files doesn’t 
fit well with the Apache goal of open development and would be u