> On 19 Jun 2016, at 08:04, Andreas Lehmkuehler <[email protected]> wrote:
>
> Am 19.06.2016 um 16:11 schrieb Tilman Hausherr:
>> Am 19.06.2016 um 08:52 schrieb John Hewson:
>>>>> >>JIRA, and attach your code as a patch / diff.
>>>> >There is already some code handling those operators, see
>>>> PDFMarkedContentExtractor. It could be moved to a more generic place so
>>>> that
>>>> we have to add some filtering only.
>>> Yes, that's is the proper way to handle this. Operators are handled with a
>>> an
>>> OperatorProcessor, not my modifying the parser (e.g.
>>> processStreamOperators).
>>> Better yet, we already have the code to handle BMC/EMC. All that is needed
>>> is
>>> for PDFRenderer to add a constructor which accepts a list of layer names to
>>> render, which are then passed as part of PageDrawerParmeters.
>>
>> The problem is that these two operators influence whether or not all the
>> other
>> tokens in the content stream are used or not. So the method by C. makes
>> sense to
>> me. The alternative would be to alter every operator processor to check
>> whether
>> it is relevant or not.
>> Or they would have to be extended from some common class that does this
>> check.
The alternative is actually really simple. The parser should no be responsible
for high-level
processing such as this. It’s the job of an OperatorProcessor to handle how
operators are
processed, and of PDFStreamEngine to handle the actual work - that’s the core
of our
subclassing & extensibility model for PDFBox.
So take the view that BMC and EMC don’t affect the tokens, they affect
rendering. We should still process the
tokens as normal and have BMC and EMC set a flag on PageDrawer (or one of its
superclasses)
which indicates which layer is currently being processed. The PageDrawer can
then decide what to do
with this information - namely check in strokePath, fillPath,
fillAndStrokePath, and drawImage whether
or not to suppress rendering. No need to extend any OperatorProcessor’s.
I’ve explained how this would be done for PageDrawer , but i t might be better
to do all of this in
PDFStreamEngine rather than PageDrawer, as then other subclasses can benefit
form this functionality.
>> PDFMarkedContentExtractor is not really helpful. Here's some code to show
>> what
>> it does - it shows the objects that belong to a specific group. The output
>> cannot be used for rendering.
> Maybe there is a misunderstanding. We need to track the current layer and the
> stack of all current layers. C. provided some code doing that and we already
> have some code doing it (I'm talking about the operators in
> org.apache.pdfbox.contentstream.operator.markedcontent). What is missing is
> some sort of filter based on that information.
Exactly, PDFMarkedContentExtractor already contains implementations of the
necessary OperatorProcessor’s. We just need to move them into separate files,
and as you say, add some sort of filter in PDFStreamEngine / PageDrawer.
> BR
> Andreas
>>
>>
>> import java.io.File;
>> import java.io.IOException;
>> import java.util.Arrays;
>> import java.util.List;
>> import org.apache.pdfbox.cos.COSName;
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.pdmodel.PDPage;
>> import
>> org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
>> import
>> org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
>> import
>> org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
>> import
>> org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentProperties;
>> import org.apache.pdfbox.text.PDFMarkedContentExtractor;
>>
>> public class ExtractMarkedContent extends PDFMarkedContentExtractor
>> {
>>
>> public ExtractMarkedContent() throws IOException
>> {
>> }
>>
>> public static void main(String[] args) throws IOException
>> {
>>
>> PDDocument doc = PDDocument.load(new File("C......\\PDFBox
>> reactor\\pdfbox\\target\\test-output","ocg-generation.pdf"));
>> PDOptionalContentProperties ocp =
>> doc.getDocumentCatalog().getOCProperties();
>> System.out.println("Group names in document catalog: " +
>> Arrays.toString(ocp.getGroupNames()));
>> for (String groupName : ocp.getGroupNames())
>> {
>> PDOptionalContentGroup group = ocp.getGroup(groupName);
>> System.out.println(group.getCOSObject());
>> }
>> ExtractMarkedContent extractMarkedContent = new
>> ExtractMarkedContent();
>> PDPage page = doc.getPage(0);
>> System.out.println("Property names in page resources: " +
>> page.getResources().getPropertiesNames());
>> extractMarkedContent.processPage(page);
>> List<PDMarkedContent> markedContents =
>> extractMarkedContent.getMarkedContents();
>> System.out.println("Extracted contents: ");
>> for (PDMarkedContent mc : markedContents)
>> {
>> PDPropertyList propertyList =
>> page.getResources().getProperties(COSName.getPDFName(mc.getTag()));
>> String propName =
>> propertyList.getCOSObject().getString(COSName.NAME);
>> System.out.println(mc.getTag() + " (" + propName + "): " +
>> mc.getContents());
>> }
>> doc.close();
>> }
>> }
>>
>>
>> The output is:
>>
>> Group names in document catalog: [background, enabled, disabled]
>> COSDictionary{(COSName{Type}:COSName{OCG})
>> (COSName{Name}:COSString{background}) }
>> COSDictionary{(COSName{Type}:COSName{OCG})
>> (COSName{Name}:COSString{enabled}) }
>> COSDictionary{(COSName{Type}:COSName{OCG})
>> (COSName{Name}:COSString{disabled}) }
>> Property names in page resources: [COSName{oc1}, COSName{oc2}, COSName{oc3}]
>> Extracted contents:
>> oc1 (background): [P, D, F, , 1, ., 5, :, , O, p, t, i, o, n, a, l, , C,
>> o,
>> n, t, e, n, t, , G, r, o, u, p, s, Y, o, u, , s, h, o, u, l, d, , s, e,
>> e, ,
>> a, , g, r, e, e, n, , t, e, x, t, l, i, n, e, ,, , b, u, t, , n, o, ,
>> r, e,
>> d, , t, e, x, t, , l, i, n, e, .]
>> oc2 (enabled): [T, h, i, s, , i, s, , f, r, o, m, , a, n, , e, n, a, b,
>> l,
>> e, d, , l, a, y, e, r, ., , I, f, , y, o, u, , s, e, e, , t, h, i, s,
>> ,, ,
>> t, h, a, t, ', s, , g, o, o, d, .]
>> oc3 (disabled): [T, h, i, s, , i, s, , f, r, o, m, , a, , d, i, s, a, b,
>> l,
>> e, d, , l, a, y, e, r, ., , I, f, , y, o, u, , s, e, e, , t, h, i, s,
>> ,, ,
>> t, h, a, t, ', s, , N, O, T, , g, o, o, d, !]
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> <mailto:[email protected]>
> For additional commands, e-mail: [email protected]
> <mailto:[email protected]>