fop-pdf-image and fonts; as requested

2012-02-28 Thread Craig Ringer
Hi

As requsted by Mehdi Houshmand I'm elaborating on the issue we've been
running into with fop-pdf-image. I've asked about aspects of it on the
list before, but now have a better understanding of what's going on.

Where input pdfs being used as form XObjects contain embedded subset
fonts, I'm seeing many copies of those fonts being embedded in the
output document. This creates huge output files with lots of duplicate
font data, and in a few cases has even crashed the RIP used by my work's
offset press printer. I think they use a Firey, but struggle to get any
more info than that out of them.

The issue is that fop-pdf-image copies PDFs into fop output PDFs by
copying the content stream and resources dictionary verbatim from the
page being extracted from the input PDF, translating it from PDFBox into
fop PDF structures in the process. This is extremely reliable, ensuring
that fop-pdf-image form XObjects don't conflict with / interfere with
the embedding page or vice versa. Unfortunately it also leads to massive
duplication of data, including:

- Fonts, both subsets and fully embedded fonts
- Embedded ICC profiles, if present
- Images re-used across multiple pages or documents

In the case of images, ICC profiles, and fully embedded fonts it'd
potentially be relatively easy to coalesce these so that all resources
dictionaries refer to the same object. It's a little hacky because fop
doesn't give image plugins any "official" way to store data about a
rendering run for later reference, but it's easy enough to do by storing
a WeakHashMap associating object type and checksum data
with a particular rendering run. I haven't implemented coalescing of
images and profiles because it's not part of my problem space, but it
shouldn't be too hard.

Unfortunately, the above approach doesn't work for our problem, which is
duplicated *subset* fonts. There are 20 or 30 copies of Helvetica
Regular alone in one of our typical runs, with a mixture of MacRoman,
Custom and WinAnsi encodings. They're drawn from the same two or three
copies of Helvetica from different sources, but each subset has a
different (though largely overlapping) glyph set. Fop-pdf-image
correctly but rather sub-optimally copies each subset and references it
from the associated Form XObject, creating working output but lots of
wasted space and duplication. We can't just write the font out the first
time we see it and adjust all future references to the copy we've
already written, because unlike with ICC profiles and repeatedly used
images each copy is different.

I see two possible solutions to this problem. Both have the same
pre-requisites:

(1) A mechanism for image plugins to keep plugin-specific data
associated with a specific rendering run. A WeakHashMap
works for this, though it isn't pretty.

(2) Code in the image plugin to record each use of each font and group
usages up into compatible groups so all font references in the group can
point to the same font in the output. This code can also collect up
glyph usage information, producing a map of which glyphs are required by
one or more content streams.

(3) A way to create a new embedded font in the output, either by
combining input subsets into a single new subset font object or by
loading a whole font off the HDD and making a new subset with just the
required glyphs from it.

(4) Some way to be notified, at minimum, just before the xref table is
going to be written out, so the new font can be written to the output
stream. The new font can't be written until we know the last embedded
PDF has been written out, because a future pdf might add use additional
glpyhs that must be added to the subset.

(5) [Optional but useful] Smarter font loading where more than just
(family, weight, slant) 3-tuples are used to match fonts, so I can use
fop's font loading and cache code to see whether there's a whole font
available to fop that can be substituted for an embedded subset. For
example, I might need to match Myriad Pro Ultrabold Italic SemiCond, a
small caps variant face, or similar with no confusion between different
condensed/expanded versions of the same face, different specialist
variants, etc. Right now fop's font matching code simply cannot do that,
so I can't really create new font subsets as an alternative for (3) and
have to try to combine subsets from the input instead.


I have (1) working and I have a prototype of (2) that dumps font usage
data for a run including a glyph usage map. I was trying to avoid (3)
for Base14 fonts by just replacing the Resources reference to the font
with a base14 font ref, but PDF readers seem to choke on this for
reasons I haven't yet determined.

(4) is the big problem. I can't do a proper implementation of (3)
without some way to write the produced font out at the end.

For (4) I'd really appreciate advice from the fop community. I need a
way for a plugin to hook into output just before the xref table is
written, so it can write new objects to the pdf stream. The ob

Re: Bugzilla #46962 - Deadlock in PropertyCache

2012-02-28 Thread Alexios Giotis
Thank you all for your replies. I just printed and will send the ICLA anyway so 
that it will not be an impediment for applying this or future patches.

@Vincent  I will be happy to make any clarification related to the patch. But 
it would be transparent if there is a comment on the issue or an email at any 
FOP mailing list so that I can get feedback.

Alexios Giotis


On Feb 28, 2012, at 7:19 PM, Glenn Adams wrote:

> benson, thanks for that clarification, i see in [1] that though an ICLA is 
> not required of a contributor, it is nevertheless desirable to have one 
> submitted; so, Alexios, if you wish to submit an ICLA please do so; however, 
> given the limited scope of the patch, I would agree that it is not strictly 
> required, and the lack of one should not impede applying the patch
> 
> glenn
> 
> [1] http://www.apache.org/licenses/#clas
> 
> On Tue, Feb 28, 2012 at 10:05 AM, Benson Margulies  
> wrote:
> an icla is not required for a patch attached to a bz unless it is of unusual 
> size or not coded be the bz submitter.
> 
> 
> On Feb 28, 2012, at 11:53 AM, Glenn Adams  wrote:
> 
>> I support committing this patch, however I don't see an ICLA listed at [1] 
>> for Alexios. Alexios, if you have not submitted an ICLA [2], please do so.
>> 
>> I would be happy to apply the patch (if Mehdi doesn't have the time).
>> 
>> [1] http://people.apache.org/committer-index.html#unlistedclas
>> [2] http://www.apache.org/licenses/icla.txt
> 



Re: Bugzilla #46962 - Deadlock in PropertyCache

2012-02-28 Thread Glenn Adams
benson, thanks for that clarification, i see in [1] that though an ICLA is
not required of a contributor, it is nevertheless desirable to have one
submitted; so, Alexios, if you wish to submit an ICLA please do so;
however, given the limited scope of the patch, I would agree that it is not
strictly required, and the lack of one should not impede applying the patch

glenn

[1] http://www.apache.org/licenses/#clas

On Tue, Feb 28, 2012 at 10:05 AM, Benson Margulies wrote:

> an icla is not required for a patch attached to a bz unless it is of
> unusual size or not coded be the bz submitter.
>
>
> On Feb 28, 2012, at 11:53 AM, Glenn Adams  wrote:
>
> I support committing this patch, however I don't see an ICLA listed at [1]
> for Alexios. Alexios, if you have not submitted an ICLA [2], please do so.
>
> I would be happy to apply the patch (if Mehdi doesn't have the time).
>
> [1] http://people.apache.org/committer-index.html#unlistedclas
> [2] http://www.apache.org/licenses/icla.txt
>
>


Re: Bugzilla #46962 - Deadlock in PropertyCache

2012-02-28 Thread mehdi houshmand
Hi Guys,

My apologies for the lack of transparency on this issue, but I didn't
actually review the changes you made here, in fact, I barely looked at
what PropertyCache actually does. I had some free time, and added a
bunch of unit tests.

The reason this hasn't been committed yet was because Vincent said he
had some questions about the patch. That's as far as I know, maybe he
could give some feedback on the issue.

Let me reiterate my apologies again on this, it's not fair that this
has been ignored. I'll endeavour to make the process more transparent
in future, I hope this doesn't prevent you or any other contributors
from submitting patches.

Mehdi


On 28 February 2012 16:52, Glenn Adams  wrote:
> I support committing this patch, however I don't see an ICLA listed at [1]
> for Alexios. Alexios, if you have not submitted an ICLA [2], please do so.
>
> I would be happy to apply the patch (if Mehdi doesn't have the time).
>
> [1] http://people.apache.org/committer-index.html#unlistedclas
> [2] http://www.apache.org/licenses/icla.txt
>
>
> On Tue, Feb 28, 2012 at 6:27 AM, Alexios Giotis 
> wrote:
>>
>> Hi,
>>
>> About 6 months ago, I had a deadlock issue that regularly stopped
>> production servers. While I was opening a bugzilla ticket, I found that this
>> was already reported back in 2009. This issue is still opened as it was
>> difficult to reproduce. On that issue, I added:
>>
>> [1] An explanation of why a deadlock is possible.
>> [1] Stacktraces of deadlocked threads from a production server.
>> [2] A small unit test that adds a Thread.sleep() to the PropertyCache to
>> make it always reproducable.
>> [3] A patch solving this issue.
>> [4] Explanations of why the patch rewrites the existing PropertyCache
>> class.
>>
>> This was then reviewed and unit tests were added [5]. On top of this, I
>> have committed the fix in my private branch and it works well on several big
>> production systems. This is as far as I can go before a FOP committer takes
>> it further. I am writing this because:
>>
>> - Deadlocks should be fixed. When they occur, there is no way around them.
>> - The trunk is moving, the patch is aging and it will be more difficult to
>> apply it over time.
>> - It is discouraging for submitting more patches.
>>
>>
>> Alexios Giotis
>>
>>
>>
>>
>> [1] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c3
>> [2] https://issues.apache.org/bugzilla/attachment.cgi?id=27342
>> [3] https://issues.apache.org/bugzilla/attachment.cgi?id=27477&action=diff
>> [4] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c7
>> [5] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c9
>>
>


Re: Bugzilla #46962 - Deadlock in PropertyCache

2012-02-28 Thread Benson Margulies
an icla is not required for a patch attached to a bz unless it is of
unusual size or not coded be the bz submitter.

On Feb 28, 2012, at 11:53 AM, Glenn Adams  wrote:

I support committing this patch, however I don't see an ICLA listed at [1]
for Alexios. Alexios, if you have not submitted an ICLA [2], please do so.

I would be happy to apply the patch (if Mehdi doesn't have the time).

[1] http://people.apache.org/committer-index.html#unlistedclas
[2] http://www.apache.org/licenses/icla.txt

On Tue, Feb 28, 2012 at 6:27 AM, Alexios Giotis wrote:

> Hi,
>
> About 6 months ago, I had a deadlock issue that regularly stopped
> production servers. While I was opening a bugzilla ticket, I found that
> this was already reported back in 2009. This issue is still opened as it
> was difficult to reproduce. On that issue, I added:
>
> [1] An explanation of why a deadlock is possible.
> [1] Stacktraces of deadlocked threads from a production server.
> [2] A small unit test that adds a Thread.sleep() to the PropertyCache to
> make it always reproducable.
> [3] A patch solving this issue.
> [4] Explanations of why the patch rewrites the existing PropertyCache
> class.
>
> This was then reviewed and unit tests were added [5]. On top of this, I
> have committed the fix in my private branch and it works well on several
> big production systems. This is as far as I can go before a FOP committer
> takes it further. I am writing this because:
>
> - Deadlocks should be fixed. When they occur, there is no way around them.
> - The trunk is moving, the patch is aging and it will be more difficult to
> apply it over time.
> - It is discouraging for submitting more patches.
>
>
> Alexios Giotis
>
>
>
>
> [1] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c3
> [2] https://issues.apache.org/bugzilla/attachment.cgi?id=27342
> [3] https://issues.apache.org/bugzilla/attachment.cgi?id=27477&action=diff
> [4] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c7
> [5] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c9
>
>


Re: Bugzilla #46962 - Deadlock in PropertyCache

2012-02-28 Thread Glenn Adams
I support committing this patch, however I don't see an ICLA listed at [1]
for Alexios. Alexios, if you have not submitted an ICLA [2], please do so.

I would be happy to apply the patch (if Mehdi doesn't have the time).

[1] http://people.apache.org/committer-index.html#unlistedclas
[2] http://www.apache.org/licenses/icla.txt

On Tue, Feb 28, 2012 at 6:27 AM, Alexios Giotis wrote:

> Hi,
>
> About 6 months ago, I had a deadlock issue that regularly stopped
> production servers. While I was opening a bugzilla ticket, I found that
> this was already reported back in 2009. This issue is still opened as it
> was difficult to reproduce. On that issue, I added:
>
> [1] An explanation of why a deadlock is possible.
> [1] Stacktraces of deadlocked threads from a production server.
> [2] A small unit test that adds a Thread.sleep() to the PropertyCache to
> make it always reproducable.
> [3] A patch solving this issue.
> [4] Explanations of why the patch rewrites the existing PropertyCache
> class.
>
> This was then reviewed and unit tests were added [5]. On top of this, I
> have committed the fix in my private branch and it works well on several
> big production systems. This is as far as I can go before a FOP committer
> takes it further. I am writing this because:
>
> - Deadlocks should be fixed. When they occur, there is no way around them.
> - The trunk is moving, the patch is aging and it will be more difficult to
> apply it over time.
> - It is discouraging for submitting more patches.
>
>
> Alexios Giotis
>
>
>
>
> [1] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c3
> [2] https://issues.apache.org/bugzilla/attachment.cgi?id=27342
> [3] https://issues.apache.org/bugzilla/attachment.cgi?id=27477&action=diff
> [4] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c7
> [5] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c9
>
>


Re: Implementing PDF Object Streams

2012-02-28 Thread mehdi houshmand
Hi Craig,

Just out of curiosity, what issues are you having with the
pdf-image-plugin? I spent quite a lot of time with it and submitted a
patch to Jeremias (not sure if he's committed it). Maybe we could help
you there? We've also got some commits lying around that we're not
happy with per-se because they sacrifice rendered fidelity for file
size that may help you.

Let us know what you've done and what you're trying to do in a new
thread and I'll let you know if we can help.

Mehdi


On 28 February 2012 00:39, Craig Ringer  wrote:
> On 27/02/2012 8:08 PM, Vincent Hennebert wrote:
>>
>> We would like to implement PDF Object Streams as defined in the PDF 1.5
>> Reference. In short, the structure tree would be stored inside a stream
>> to allow for compression in the same way as the page content.
>
> What's the status of object stream support in PDFBox? Is it possible the
> feature is bettern implemented by adopting a PDFBox based backend?
>
> There's been long term planning talk of moving over to PDFBox as the
> underlying PDF support library. It'd massively simplify work with PDF-in-PDF
> embedding, reduce maintenance work, etc. Is it worth doing major enhancement
> work on fop's pdf library if it may go away in future?
>
> I'm struggling with getting fop and pdfbox to play well together at the
> moment as I work on enhancing fop-pdf-image to merge duplicate font subsets.
> The use of two different pdf libraries makes fop-pdf-image much more complex
> and makes working with fonts a lot harder. I'm sure it's not the only area
> where a pdfbox-based backend might be good.
>
> --
> Craig Ringer


Re: update to site/deploy/fop

2012-02-28 Thread Clay Leeds
Nice work troubleshooting the file date problem. I have had many problems w 
site deployment since long ago when I first brought the site to more or less 
it's current state of Forrest-y crunchiness. 

I'm currently spending some time researching the method for converting to the 
newly blessed method for site deployment using the new Apache CMS system, so 
hopefully it won't be an issue for too much longer. 

Clay

"My religion is simple. My religion is kindness."
- HH The Dalai Lama of Tibet

On Feb 27, 2012, at 12:39 AM, Glenn Adams  wrote:

> I've been attempting for a few hours now to successfully update the FOP site 
> directory. After a number of attempts I believe I've finally performed an 
> update (subject to an upcoming rsync). I noticed that the first time I was 
> able to perform a deploy.svn successfully, it only updated two files, two 
> newly added files, and did not update any of the other existing files.
> 
> I finally determined that the following lines in 
> forrest/tools/forrestbot/core/deploy.xml
> 
> 
>   
> 
> 
> were failing to copy the changed (modified) files since the last modified 
> date on the target directory (work/svn-deploy/forrest-docs) were later than 
> the just previously built site directory (build/forrest-docs).
> 
> This was because the newly checkout out files in the target directory had the 
> time of checkout as opposed to the last time of commit on the file, and, 
> consequently, the local site directory files, which are rendered (built) by 
> forrest prior to the checkout, had older last modified times.
> 
> By adding overwrite="true" as follows (along with verbose for a little 
> debugging help), I finally got all the modified site files copied, and 
> subsequently committed by deploy.svn.
> 
> 
>   
> 
> 
> Has anyone else encountered this problem? What is the best way to effect a 
> shared fix?
> 
> G.
> 
> 


Bugzilla #46962 - Deadlock in PropertyCache

2012-02-28 Thread Alexios Giotis
Hi,

About 6 months ago, I had a deadlock issue that regularly stopped production 
servers. While I was opening a bugzilla ticket, I found that this was already 
reported back in 2009. This issue is still opened as it was difficult to 
reproduce. On that issue, I added:

[1] An explanation of why a deadlock is possible.
[1] Stacktraces of deadlocked threads from a production server.
[2] A small unit test that adds a Thread.sleep() to the PropertyCache to make 
it always reproducable.
[3] A patch solving this issue.
[4] Explanations of why the patch rewrites the existing PropertyCache class.

This was then reviewed and unit tests were added [5]. On top of this, I have 
committed the fix in my private branch and it works well on several big 
production systems. This is as far as I can go before a FOP committer takes it 
further. I am writing this because:

- Deadlocks should be fixed. When they occur, there is no way around them.
- The trunk is moving, the patch is aging and it will be more difficult to 
apply it over time.
- It is discouraging for submitting more patches.


Alexios Giotis




[1] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c3
[2] https://issues.apache.org/bugzilla/attachment.cgi?id=27342
[3] https://issues.apache.org/bugzilla/attachment.cgi?id=27477&action=diff
[4] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c7
[5] https://issues.apache.org/bugzilla/show_bug.cgi?id=46962#c9



new Temp_CFF branch

2012-02-28 Thread Glenn Adams
I have created a new branch Temp_CFF [1], in order to add support for Adobe
CFF (Compact Font Format) encoded OpenType/TrueType fonts. CFF encoded
fonts use a different format, more compact representation for glyph outline
data [2][3]; specifically, they use Adobe Type 2 charstring format [4].

[1]
http://mail-archives.apache.org/mod_mbox/xmlgraphics-fop-commits/201202.mbox/%3c20120227204524.5219d2388...@eris.apache.org%3e
[2] http://en.wikipedia.org/wiki/PostScript_fonts#Compact_Font_Format
[3] http://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5176.CFF.pdf
[4]
http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5177.Type2.pdf