fop-pdf-image and fonts; as requested
Hi As requsted by Mehdi Houshmand I'm elaborating on the issue we've been running into with fop-pdf-image. I've asked about aspects of it on the list before, but now have a better understanding of what's going on. Where input pdfs being used as form XObjects contain embedded subset fonts, I'm seeing many copies of those fonts being embedded in the output document. This creates huge output files with lots of duplicate font data, and in a few cases has even crashed the RIP used by my work's offset press printer. I think they use a Firey, but struggle to get any more info than that out of them. The issue is that fop-pdf-image copies PDFs into fop output PDFs by copying the content stream and resources dictionary verbatim from the page being extracted from the input PDF, translating it from PDFBox into fop PDF structures in the process. This is extremely reliable, ensuring that fop-pdf-image form XObjects don't conflict with / interfere with the embedding page or vice versa. Unfortunately it also leads to massive duplication of data, including: - Fonts, both subsets and fully embedded fonts - Embedded ICC profiles, if present - Images re-used across multiple pages or documents In the case of images, ICC profiles, and fully embedded fonts it'd potentially be relatively easy to coalesce these so that all resources dictionaries refer to the same object. It's a little hacky because fop doesn't give image plugins any "official" way to store data about a rendering run for later reference, but it's easy enough to do by storing a WeakHashMap associating object type and checksum data with a particular rendering run. I haven't implemented coalescing of images and profiles because it's not part of my problem space, but it shouldn't be too hard. Unfortunately, the above approach doesn't work for our problem, which is duplicated *subset* fonts. There are 20 or 30 copies of Helvetica Regular alone in one of our typical runs, with a mixture of MacRoman, Custom and WinAnsi encodings. They're drawn from the same two or three copies of Helvetica from different sources, but each subset has a different (though largely overlapping) glyph set. Fop-pdf-image correctly but rather sub-optimally copies each subset and references it from the associated Form XObject, creating working output but lots of wasted space and duplication. We can't just write the font out the first time we see it and adjust all future references to the copy we've already written, because unlike with ICC profiles and repeatedly used images each copy is different. I see two possible solutions to this problem. Both have the same pre-requisites: (1) A mechanism for image plugins to keep plugin-specific data associated with a specific rendering run. A WeakHashMap works for this, though it isn't pretty. (2) Code in the image plugin to record each use of each font and group usages up into compatible groups so all font references in the group can point to the same font in the output. This code can also collect up glyph usage information, producing a map of which glyphs are required by one or more content streams. (3) A way to create a new embedded font in the output, either by combining input subsets into a single new subset font object or by loading a whole font off the HDD and making a new subset with just the required glyphs from it. (4) Some way to be notified, at minimum, just before the xref table is going to be written out, so the new font can be written to the output stream. The new font can't be written until we know the last embedded PDF has been written out, because a future pdf might add use additional glpyhs that must be added to the subset. (5) [Optional but useful] Smarter font loading where more than just (family, weight, slant) 3-tuples are used to match fonts, so I can use fop's font loading and cache code to see whether there's a whole font available to fop that can be substituted for an embedded subset. For example, I might need to match Myriad Pro Ultrabold Italic SemiCond, a small caps variant face, or similar with no confusion between different condensed/expanded versions of the same face, different specialist variants, etc. Right now fop's font matching code simply cannot do that, so I can't really create new font subsets as an alternative for (3) and have to try to combine subsets from the input instead. I have (1) working and I have a prototype of (2) that dumps font usage data for a run including a glyph usage map. I was trying to avoid (3) for Base14 fonts by just replacing the Resources reference to the font with a base14 font ref, but PDF readers seem to choke on this for reasons I haven't yet determined. (4) is the big problem. I can't do a proper implementation of (3) without some way to write the produced font out at the end. For (4) I'd really appreciate advice from the fop community. I need a way for a plugin to hook into output just before the xref table is written, so it can write new objects to the pdf stream. The ob
Re: Bugzilla #46962 - Deadlock in PropertyCache
Thank you all for your replies. I just printed and will send the ICLA anyway so that it will not be an impediment for applying this or future patches. @Vincent I will be happy to make any clarification related to the patch. But it would be transparent if there is a comment on the issue or an email at any FOP mailing list so that I can get feedback. Alexios Giotis On Feb 28, 2012, at 7:19 PM, Glenn Adams wrote: > benson, thanks for that clarification, i see in [1] that though an ICLA is > not required of a contributor, it is nevertheless desirable to have one > submitted; so, Alexios, if you wish to submit an ICLA please do so; however, > given the limited scope of the patch, I would agree that it is not strictly > required, and the lack of one should not impede applying the patch > > glenn > > [1] http://www.apache.org/licenses/#clas > > On Tue, Feb 28, 2012 at 10:05 AM, Benson Margulies > wrote: > an icla is not required for a patch attached to a bz unless it is of unusual > size or not coded be the bz submitter. > > > On Feb 28, 2012, at 11:53 AM, Glenn Adams wrote: > >> I support committing this patch, however I don't see an ICLA listed at [1] >> for Alexios. Alexios, if you have not submitted an ICLA [2], please do so. >> >> I would be happy to apply the patch (if Mehdi doesn't have the time). >> >> [1] http://people.apache.org/committer-index.html#unlistedclas >> [2] http://www.apache.org/licenses/icla.txt >
Re: Bugzilla #46962 - Deadlock in PropertyCache
benson, thanks for that clarification, i see in [1] that though an ICLA is not required of a contributor, it is nevertheless desirable to have one submitted; so, Alexios, if you wish to submit an ICLA please do so; however, given the limited scope of the patch, I would agree that it is not strictly required, and the lack of one should not impede applying the patch glenn [1] http://www.apache.org/licenses/#clas On Tue, Feb 28, 2012 at 10:05 AM, Benson Margulies wrote: > an icla is not required for a patch attached to a bz unless it is of > unusual size or not coded be the bz submitter. > > > On Feb 28, 2012, at 11:53 AM, Glenn Adams wrote: > > I support committing this patch, however I don't see an ICLA listed at [1] > for Alexios. Alexios, if you have not submitted an ICLA [2], please do so. > > I would be happy to apply the patch (if Mehdi doesn't have the time). > > [1] http://people.apache.org/committer-index.html#unlistedclas > [2] http://www.apache.org/licenses/icla.txt > >
Re: Bugzilla #46962 - Deadlock in PropertyCache
Hi Guys, My apologies for the lack of transparency on this issue, but I didn't actually review the changes you made here, in fact, I barely looked at what PropertyCache actually does. I had some free time, and added a bunch of unit tests. The reason this hasn't been committed yet was because Vincent said he had some questions about the patch. That's as far as I know, maybe he could give some feedback on the issue. Let me reiterate my apologies again on this, it's not fair that this has been ignored. I'll endeavour to make the process more transparent in future, I hope this doesn't prevent you or any other contributors from submitting patches. Mehdi On 28 February 2012 16:52, Glenn Adams wrote: > I support committing this patch, however I don't see an ICLA listed at [1] > for Alexios. Alexios, if you have not submitted an ICLA [2], please do so. > > I would be happy to apply the patch (if Mehdi doesn't have the time). > > [1] http://people.apache.org/committer-index.html#unlistedclas > [2] http://www.apache.org/licenses/icla.txt > > > On Tue, Feb 28, 2012 at 6:27 AM, Alexios Giotis > wrote: >> >> Hi, >> >> About 6 months ago, I had a deadlock issue that regularly stopped >> production servers. While I was opening a bugzilla ticket, I found that this >> was already reported back in 2009. This issue is still opened as it was >> difficult to reproduce. On that issue, I added: >> >> [1] An explanation of why a deadlock is possible. >> [1] Stacktraces of deadlocked threads from a production server. >> [2] A small unit test that adds a Thread.sleep() to the PropertyCache to >> make it always reproducable. >> [3] A patch solving this issue. >> [4] Explanations of why the patch rewrites the existing PropertyCache >> class. >> >> This was then reviewed and unit tests were added [5]. On top of this, I >> have committed the fix in my private branch and it works well on several big >> production systems. This is as far as I can go before a FOP committer takes >> it further. I am writing this because: >> >> - Deadlocks should be fixed. When they occur, there is no way around them. >> - The trunk is moving, the patch is aging and it will be more difficult to >> apply it over time. >> - It is discouraging for submitting more patches. >> >> >> Alexios Giotis >> >> >> >> >> [1] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c3 >> [2] https://issues.apache.org/bugzilla/attachment.cgi?id=27342 >> [3] https://issues.apache.org/bugzilla/attachment.cgi?id=27477&action=diff >> [4] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c7 >> [5] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c9 >> >
Re: Bugzilla #46962 - Deadlock in PropertyCache
an icla is not required for a patch attached to a bz unless it is of unusual size or not coded be the bz submitter. On Feb 28, 2012, at 11:53 AM, Glenn Adams wrote: I support committing this patch, however I don't see an ICLA listed at [1] for Alexios. Alexios, if you have not submitted an ICLA [2], please do so. I would be happy to apply the patch (if Mehdi doesn't have the time). [1] http://people.apache.org/committer-index.html#unlistedclas [2] http://www.apache.org/licenses/icla.txt On Tue, Feb 28, 2012 at 6:27 AM, Alexios Giotis wrote: > Hi, > > About 6 months ago, I had a deadlock issue that regularly stopped > production servers. While I was opening a bugzilla ticket, I found that > this was already reported back in 2009. This issue is still opened as it > was difficult to reproduce. On that issue, I added: > > [1] An explanation of why a deadlock is possible. > [1] Stacktraces of deadlocked threads from a production server. > [2] A small unit test that adds a Thread.sleep() to the PropertyCache to > make it always reproducable. > [3] A patch solving this issue. > [4] Explanations of why the patch rewrites the existing PropertyCache > class. > > This was then reviewed and unit tests were added [5]. On top of this, I > have committed the fix in my private branch and it works well on several > big production systems. This is as far as I can go before a FOP committer > takes it further. I am writing this because: > > - Deadlocks should be fixed. When they occur, there is no way around them. > - The trunk is moving, the patch is aging and it will be more difficult to > apply it over time. > - It is discouraging for submitting more patches. > > > Alexios Giotis > > > > > [1] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c3 > [2] https://issues.apache.org/bugzilla/attachment.cgi?id=27342 > [3] https://issues.apache.org/bugzilla/attachment.cgi?id=27477&action=diff > [4] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c7 > [5] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c9 > >
Re: Bugzilla #46962 - Deadlock in PropertyCache
I support committing this patch, however I don't see an ICLA listed at [1] for Alexios. Alexios, if you have not submitted an ICLA [2], please do so. I would be happy to apply the patch (if Mehdi doesn't have the time). [1] http://people.apache.org/committer-index.html#unlistedclas [2] http://www.apache.org/licenses/icla.txt On Tue, Feb 28, 2012 at 6:27 AM, Alexios Giotis wrote: > Hi, > > About 6 months ago, I had a deadlock issue that regularly stopped > production servers. While I was opening a bugzilla ticket, I found that > this was already reported back in 2009. This issue is still opened as it > was difficult to reproduce. On that issue, I added: > > [1] An explanation of why a deadlock is possible. > [1] Stacktraces of deadlocked threads from a production server. > [2] A small unit test that adds a Thread.sleep() to the PropertyCache to > make it always reproducable. > [3] A patch solving this issue. > [4] Explanations of why the patch rewrites the existing PropertyCache > class. > > This was then reviewed and unit tests were added [5]. On top of this, I > have committed the fix in my private branch and it works well on several > big production systems. This is as far as I can go before a FOP committer > takes it further. I am writing this because: > > - Deadlocks should be fixed. When they occur, there is no way around them. > - The trunk is moving, the patch is aging and it will be more difficult to > apply it over time. > - It is discouraging for submitting more patches. > > > Alexios Giotis > > > > > [1] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c3 > [2] https://issues.apache.org/bugzilla/attachment.cgi?id=27342 > [3] https://issues.apache.org/bugzilla/attachment.cgi?id=27477&action=diff > [4] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c7 > [5] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c9 > >
Re: Implementing PDF Object Streams
Hi Craig, Just out of curiosity, what issues are you having with the pdf-image-plugin? I spent quite a lot of time with it and submitted a patch to Jeremias (not sure if he's committed it). Maybe we could help you there? We've also got some commits lying around that we're not happy with per-se because they sacrifice rendered fidelity for file size that may help you. Let us know what you've done and what you're trying to do in a new thread and I'll let you know if we can help. Mehdi On 28 February 2012 00:39, Craig Ringer wrote: > On 27/02/2012 8:08 PM, Vincent Hennebert wrote: >> >> We would like to implement PDF Object Streams as defined in the PDF 1.5 >> Reference. In short, the structure tree would be stored inside a stream >> to allow for compression in the same way as the page content. > > What's the status of object stream support in PDFBox? Is it possible the > feature is bettern implemented by adopting a PDFBox based backend? > > There's been long term planning talk of moving over to PDFBox as the > underlying PDF support library. It'd massively simplify work with PDF-in-PDF > embedding, reduce maintenance work, etc. Is it worth doing major enhancement > work on fop's pdf library if it may go away in future? > > I'm struggling with getting fop and pdfbox to play well together at the > moment as I work on enhancing fop-pdf-image to merge duplicate font subsets. > The use of two different pdf libraries makes fop-pdf-image much more complex > and makes working with fonts a lot harder. I'm sure it's not the only area > where a pdfbox-based backend might be good. > > -- > Craig Ringer
Re: update to site/deploy/fop
Nice work troubleshooting the file date problem. I have had many problems w site deployment since long ago when I first brought the site to more or less it's current state of Forrest-y crunchiness. I'm currently spending some time researching the method for converting to the newly blessed method for site deployment using the new Apache CMS system, so hopefully it won't be an issue for too much longer. Clay "My religion is simple. My religion is kindness." - HH The Dalai Lama of Tibet On Feb 27, 2012, at 12:39 AM, Glenn Adams wrote: > I've been attempting for a few hours now to successfully update the FOP site > directory. After a number of attempts I believe I've finally performed an > update (subject to an upcoming rsync). I noticed that the first time I was > able to perform a deploy.svn successfully, it only updated two files, two > newly added files, and did not update any of the other existing files. > > I finally determined that the following lines in > forrest/tools/forrestbot/core/deploy.xml > > > > > > were failing to copy the changed (modified) files since the last modified > date on the target directory (work/svn-deploy/forrest-docs) were later than > the just previously built site directory (build/forrest-docs). > > This was because the newly checkout out files in the target directory had the > time of checkout as opposed to the last time of commit on the file, and, > consequently, the local site directory files, which are rendered (built) by > forrest prior to the checkout, had older last modified times. > > By adding overwrite="true" as follows (along with verbose for a little > debugging help), I finally got all the modified site files copied, and > subsequently committed by deploy.svn. > > > > > > Has anyone else encountered this problem? What is the best way to effect a > shared fix? > > G. > >
Bugzilla #46962 - Deadlock in PropertyCache
Hi, About 6 months ago, I had a deadlock issue that regularly stopped production servers. While I was opening a bugzilla ticket, I found that this was already reported back in 2009. This issue is still opened as it was difficult to reproduce. On that issue, I added: [1] An explanation of why a deadlock is possible. [1] Stacktraces of deadlocked threads from a production server. [2] A small unit test that adds a Thread.sleep() to the PropertyCache to make it always reproducable. [3] A patch solving this issue. [4] Explanations of why the patch rewrites the existing PropertyCache class. This was then reviewed and unit tests were added [5]. On top of this, I have committed the fix in my private branch and it works well on several big production systems. This is as far as I can go before a FOP committer takes it further. I am writing this because: - Deadlocks should be fixed. When they occur, there is no way around them. - The trunk is moving, the patch is aging and it will be more difficult to apply it over time. - It is discouraging for submitting more patches. Alexios Giotis [1] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c3 [2] https://issues.apache.org/bugzilla/attachment.cgi?id=27342 [3] https://issues.apache.org/bugzilla/attachment.cgi?id=27477&action=diff [4] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c7 [5] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c9
new Temp_CFF branch
I have created a new branch Temp_CFF [1], in order to add support for Adobe CFF (Compact Font Format) encoded OpenType/TrueType fonts. CFF encoded fonts use a different format, more compact representation for glyph outline data [2][3]; specifically, they use Adobe Type 2 charstring format [4]. [1] http://mail-archives.apache.org/mod_mbox/xmlgraphics-fop-commits/201202.mbox/%3c20120227204524.5219d2388...@eris.apache.org%3e [2] http://en.wikipedia.org/wiki/PostScript_fonts#Compact_Font_Format [3] http://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5176.CFF.pdf [4] http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5177.Type2.pdf