Right but we need to address the confusion and complexity that has been caused 
by .properties files which made PDFBOX-2246 so tricky to figure out.

Lets remove this wart!

-- John

On 29 Jul 2014, at 10:44, Tilman Hausherr <thaush...@t-online.de> wrote:

> Hi,
> 
> At this time, the problem I see and wanted to solve (PDFBOX-2246) exists 
> regardless whether we use a properties file or initialize directly in the 
> code.
> 
> Tilman
> 
> 
> Am 29.07.2014 19:41, schrieb John Hewson:
>> On 29 Jul 2014, at 03:44, Andreas Lehmkühler <andr...@lehmi.de> wrote:
>> 
>>> Hi,
>>> 
>>> it's not a black and white issue (comments inline)
>>> 
>>>> John Hewson <j...@jahewson.com> hat am 29. Juli 2014 um 07:44 geschrieben:
>>>> 
>>>> 
>>>> Yes, really I should have said subclasses of PDFStreamEngine -  that's 
>>>> where
>>>> the .properties file originates. I'd propose replacing the properties
>>>> mechanism with a simple method containing the mapping which can be 
>>>> overridden
>>>> in subclasses. Ultimately, users expect to be able to subclass the 
>>>> behaviour
>>>> of a class by just subclassing the class.
>>> PDFStreamEngine doesn't configure any operator set itself. The subclasses 
>>> are
>>> supposed to configure their own set of operators depending on the particular
>>> usecase. E.g. to extend the text extraction one has to subclass 
>>> PDFTextStripper
>>> and so on.
>> It’s PDFStreamEngine which implements the .property mechanism though, via the
>> PDFStreamEngine(Properties properties) constructor.
>> 
>>> E.g. to extend the text extraction one has to subclass PDFTextStripper and 
>>> so on.
>> That’s true, but it’s only half the story, don’t forget that the .properties 
>> files need
>> to be copied and pasted elsewhere and modified along with overriding which 
>> .property
>> file is passed in the constructor if you want to truly override the class’ 
>> behaviour.
>> 
>>>> We've seen a number of incidents of confusion on the mailing list due to 
>>>> the
>>>> current design.
>>> IMHO, most of the confusion is based on the lack of knowledge of the pdf 
>>> spec.
>>> One can't understand how pdfbox works under the hood by simply looking at 
>>> the
>>> code. One has to understand the pdf spec as well, at least the base 
>>> concepts.
>> I’m specifically talking about confusion surrounding how to override 
>> operators, and
>> .properties files, this has come up before. This entire thread has been 
>> caused by
>> PDFBox’s design and *not* the PDF spec.
>> 
>>>> I'd say that to the modern Java developer having non-code runtime binding 
>>>> has
>>>> become an anti-pattern, resulting in brittle code which can't easily be
>>>> navigated in an IDE and which resists automated analysis and exhibits 
>>>> runtime
>>>> failures despite compiling ok. This is one of those cases where the 
>>>> collective
>>>> wisdom has just evolved over the years.
>>> It depends on the given usecase. All solutions have advantages and
>>> disadvantages. E.g. if someone wants to configure the PDFTextStripper 
>>> without
>>> recompiling the code, it is quite handy to keep the configuration in a text
>>> file.
>> Has anybody *ever* wanted to change the operators which PDFTextStripper is
>> processing without recompiling the code? These are internal implementation
>> details that shouldn’t be exposed in the first place - it’s not a 
>> “configuration” at
>> all, especially as 99% of possible changes would just break PDFTextStripper.
>> 
>>> In this case I'm neither pro or con a text based config, but I tend to agree
>>> with John to have the different configurations in some method within the
>>> subclasses of PDFStreamEngine.
>> As above, this isn’t “configuration” at all, it lacks even a basic use case. 
>> I don’t
>> see any pros which aren’t fabricated for the sake of argument, but the cons 
>> are
>> causing us significant problems right here, right now.
>> 
>>> BR
>>> Andreas Lehmkühler
>>> 
>>>> -- John
>>>> 
>>>>> On 28 Jul 2014, at 13:42, Tilman Hausherr <thaush...@t-online.de> wrote:
>>>>> 
>>>>> I disagree - one doesn't *have* to pass a property file to PDFTextStripper
>>>>> and PageDrawer. The properties file for PDFTextStripper is optional. The
>>>>> property parameter was already there before it became an apache project.
>>>>> 
>>>>> 
>>>>> Tilman
>>>>> 
>>>>> 
>>>>> 
>>>>> Am 28.07.2014 22:08, schrieb John Hewson:
>>>>>> We need to get rid of these .properties files, they’re causing endless
>>>>>> confusion, not to mention that they hide runtime dependencies in text
>>>>>> files.
>>>>>> 
>>>>>> We should make it so that overriding a TextStripper, PageDrawer, etc.
>>>>>> doesn’t require external .properties files, currently Preflight works in
>>>>>> this manner and it’s much clearer.
>>>>>> 
>>>>>> I guess this is a legacy of the “old” ways of Java XML everything.
>>>>>> 
>>>>>> -- John
>>>>>> 
>>>>>>> On 27 Jul 2014, at 10:09, -A <aa...@hrtmn.net> wrote:
>>>>>>> 
>>>>>>> Thank you, that works as promised and removes the warning. I'm still
>>>>>>> hoping
>>>>>>> to find a resource that better explains the pieces of PDFBox and how 
>>>>>>> they
>>>>>>> work together. Unfortunately most posts on the internet are solely how 
>>>>>>> and
>>>>>>> not why.
>>>>>>> 
>>>>>>> Appreciate it!
>>>>>>> 
>>>>>>> -Aaron
>>>>>>> 
>>>>>>> 
>>>>>>> On Sun, Jul 27, 2014 at 8:00 AM, Tilman Hausherr <thaush...@t-online.de>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> That didn't happen to me, but maybe it did happen to you with another
>>>>>>>> file.
>>>>>>>> 
>>>>>>>> Another solution would be to pass your own properties file, and it 
>>>>>>>> should
>>>>>>>> have this content:
>>>>>>>> 
>>>>>>>> =======================
>>>>>>>> # Licensed to the Apache Software Foundation (ASF) under one or more
>>>>>>>> # contributor license agreements.  See the NOTICE file distributed with
>>>>>>>> # this work for additional information regarding copyright ownership.
>>>>>>>> # The ASF licenses this file to You under the Apache License, Version 
>>>>>>>> 2.0
>>>>>>>> # (the "License"); you may not use this file except in compliance with
>>>>>>>> # the License.  You may obtain a copy of the License at
>>>>>>>> #
>>>>>>>> #      http://www.apache.org/licenses/LICENSE-2.0
>>>>>>>> #
>>>>>>>> # Unless required by applicable law or agreed to in writing, software
>>>>>>>> # distributed under the License is distributed on an "AS IS" BASIS,
>>>>>>>> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>>>>>>> implied.
>>>>>>>> # See the License for the specific language governing permissions and
>>>>>>>> # limitations under the License.
>>>>>>>> 
>>>>>>>> # This table is maps PDF stream operators to concrete OperatorProcessor
>>>>>>>> # subclasses that are used by the PDFStreamEngine class to interpret 
>>>>>>>> the
>>>>>>>> # PDF document. The classes configured here allow the PDFTextStripper
>>>>>>>> # subclass of PDFStreamEngine to extract text content of the document.
>>>>>>>> 
>>>>>>>> BT = org.apache.pdfbox.util.operator.BeginText
>>>>>>>> cm = org.apache.pdfbox.util.operator.Concatenate
>>>>>>>> Do = org.apache.pdfbox.util.operator.Invoke
>>>>>>>> ET = org.apache.pdfbox.util.operator.EndText
>>>>>>>> gs = org.apache.pdfbox.util.operator.SetGraphicsStateParameters
>>>>>>>> q  = org.apache.pdfbox.util.operator.GSave
>>>>>>>> Q  = org.apache.pdfbox.util.operator.GRestore
>>>>>>>> T* = org.apache.pdfbox.util.operator.NextLine
>>>>>>>> Tc = org.apache.pdfbox.util.operator.SetCharSpacing
>>>>>>>> Td = org.apache.pdfbox.util.operator.MoveText
>>>>>>>> TD = org.apache.pdfbox.util.operator.MoveTextSetLeading
>>>>>>>> Tf = org.apache.pdfbox.util.operator.SetTextFont
>>>>>>>> Tj = org.apache.pdfbox.util.operator.ShowText
>>>>>>>> TJ = org.apache.pdfbox.util.operator.ShowTextGlyph
>>>>>>>> TL = org.apache.pdfbox.util.operator.SetTextLeading
>>>>>>>> Tm = org.apache.pdfbox.util.operator.SetMatrix
>>>>>>>> Tr = org.apache.pdfbox.util.operator.SetTextRenderingMode
>>>>>>>> Ts = org.apache.pdfbox.util.operator.SetTextRise
>>>>>>>> Tw = org.apache.pdfbox.util.operator.SetWordSpacing
>>>>>>>> Tz = org.apache.pdfbox.util.operator.SetHorizontalTextScaling
>>>>>>>> w  = org.apache.pdfbox.util.operator.SetLineWidth
>>>>>>>> \' = org.apache.pdfbox.util.operator.MoveAndShow
>>>>>>>> \" = org.apache.pdfbox.util.operator.SetMoveAndShow
>>>>>>>> 
>>>>>>>> CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace
>>>>>>>> cs=org.apache.pdfbox.util.operator.SetNonStrokingColorSpace
>>>>>>>> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor
>>>>>>>> G=org.apache.pdfbox.util.operator.SetStrokingGrayColor
>>>>>>>> g=org.apache.pdfbox.util.operator.SetNonStrokingGrayColor
>>>>>>>> K=org.apache.pdfbox.util.operator.SetStrokingCMYKColor
>>>>>>>> k=org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor
>>>>>>>> RG=org.apache.pdfbox.util.operator.SetStrokingRGBColor
>>>>>>>> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor
>>>>>>>> SC=org.apache.pdfbox.util.operator.SetStrokingColor
>>>>>>>> sc=org.apache.pdfbox.util.operator.SetNonStrokingColor
>>>>>>>> SCN=org.apache.pdfbox.util.operator.SetStrokingColor
>>>>>>>> scn=org.apache.pdfbox.util.operator.SetNonStrokingColor
>>>>>>>> 
>>>>>>>> # The following operators are not relevant to text extraction,
>>>>>>>> # so we can silently ignore them.
>>>>>>>> 
>>>>>>>> b
>>>>>>>> B
>>>>>>>> b*
>>>>>>>> B*
>>>>>>>> BDC
>>>>>>>> BI
>>>>>>>> BMC
>>>>>>>> BX
>>>>>>>> c
>>>>>>>> d
>>>>>>>> d0
>>>>>>>> d1
>>>>>>>> DP
>>>>>>>> El
>>>>>>>> EMC
>>>>>>>> EX
>>>>>>>> f
>>>>>>>> F
>>>>>>>> f*
>>>>>>>> h
>>>>>>>> i
>>>>>>>> ID
>>>>>>>> j
>>>>>>>> J
>>>>>>>> l
>>>>>>>> m
>>>>>>>> M
>>>>>>>> MP
>>>>>>>> n
>>>>>>>> re
>>>>>>>> ri
>>>>>>>> s
>>>>>>>> S
>>>>>>>> sh
>>>>>>>> v
>>>>>>>> W
>>>>>>>> W*
>>>>>>>> y
>>>>>>>> 
>>>>>>>> =======================
>>>>>>>> 
>>>>>>>> Tilman
>>>>>>>> 
>>>>>>>> Am 27.07.2014 15:54, schrieb -A:
>>>>>>>> 
>>>>>>>> Tilman;
>>>>>>>>> That is somewhat embarrassing. At one point I brought this to the
>>>>>>>>> mailing
>>>>>>>>> list (because of the following warning) and was told to remove that 
>>>>>>>>> line
>>>>>>>>> because the TextStripper wasn't actually a PageDrawer. The 
>>>>>>>>> functionality
>>>>>>>>> still worked after that, however.
>>>>>>>>> 
>>>>>>>>> Is there a way to do this without the warning, perhaps something 
>>>>>>>>> within
>>>>>>>>> PageDrawer?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thank you,
>>>>>>>>> -Aaron
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> WARNING: java.lang.ClassCastException: IncrementalPDFStripper cannot 
>>>>>>>>> be
>>>>>>>>> cast to org.apache.pdfbox.pdfviewer.PageDrawer
>>>>>>>>> java.lang.ClassCastException: IncrementalPDFStripper cannot be cast to
>>>>>>>>> org.apache.pdfbox.pdfviewer.PageDrawer
>>>>>>>>>   at
>>>>>>>>> org.apache.pdfbox.util.operator.pagedrawer.AppendRectangleToPath.process(
>>>>>>>>> AppendRectangleToPath.java:46)
>>>>>>>>>   at
>>>>>>>>> org.apache.pdfbox.util.PDFStreamEngine.processOperator(
>>>>>>>>> PDFStreamEngine.java:557)
>>>>>>>>> at
>>>>>>>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(
>>>>>>>>> PDFStreamEngine.java:268)
>>>>>>>>>   at
>>>>>>>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(
>>>>>>>>> PDFStreamEngine.java:235)
>>>>>>>>>   at
>>>>>>>>> org.apache.pdfbox.util.PDFStreamEngine.processStream(
>>>>>>>>> PDFStreamEngine.java:215)
>>>>>>>>> at IncrementalPDFStripper.containsRed(IncrementalPDFStripper.java:90)
>>>>>>>>>   at IncrementalPDFStripper.main(IncrementalPDFStripper.java:56)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sun, Jul 27, 2014 at 5:47 AM, Tilman Hausherr 
>>>>>>>>> <thaush...@t-online.de>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> It is even easier than I thought - replace super() with this:
>>>>>>>>>> super(ResourceLoader.loadProperties("org/apache/
>>>>>>>>>> pdfbox/resources/PageDrawer.properties", true));
>>>>>>>>>> 
>>>>>>>>>> Tilman
>>>>>>>>>> 
>>>>>>>>>> Am 27.07.2014 13:03, schrieb Tilman Hausherr:
>>>>>>>>>> 
>>>>>>>>>>   After having written the text below, I tested by including the "rg"
>>>>>>>>>> 
>>>>>>>>>>> operator in the properties list and now it worked. I also tested
>>>>>>>>>>> deleting
>>>>>>>>>>> your println and instead adding this if the text is red:
>>>>>>>>>>> 
>>>>>>>>>>>      System.out.print (textPos.getCharacter());
>>>>>>>>>>> 
>>>>>>>>>>> and so I got this output:
>>>>>>>>>>> 
>>>>>>>>>>> 21_Key .1295 R~Wall Prof LinP 0.003             0.004     0.000 true
>>>>>>>>>>> 
>>>>>>>>>>> which is exactly what is red in the PDF.
>>>>>>>>>>> 
>>>>>>>>>>> Another way (probably better) to do it would probably be to not 
>>>>>>>>>>> derive
>>>>>>>>>>> PDFTextStripper but |PDFStreamEngine and construct it with||
>>>>>>>>>>> 
>>>>>>>>>>> ResourceLoader.loadProperties("org/apache/pdfbox/resources/
>>>>>>>>>>> PageDrawer.properties")|
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> see also http://stackoverflow.com/a/9157714/535646
>>>>>>>>>>> 
>>>>>>>>>>> Tilman
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Am 27.07.2014 12:14, schrieb Tilman Hausherr:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>>> Do you still have the code that worked?
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm not the text extraction specialist here, but what I did was to
>>>>>>>>>>>> look
>>>>>>>>>>>> in the uncompressed source of the PDF. The stream has code like 
>>>>>>>>>>>> this:
>>>>>>>>>>>> 
>>>>>>>>>>>> 0 0 0 rg
>>>>>>>>>>>> 0 0.5019 0 rg
>>>>>>>>>>>> 1 0 0 rg
>>>>>>>>>>>> 
>>>>>>>>>>>> The first line sets to black, the second to green, the third to 
>>>>>>>>>>>> red.
>>>>>>>>>>>> And
>>>>>>>>>>>> from what I saw, it can't work at all, because the "rg" operator
>>>>>>>>>>>> isn't
>>>>>>>>>>>> processed when extracting text, because PDFTextStripper.properties
>>>>>>>>>>>> doesn't
>>>>>>>>>>>> contain the "rg" operator. (The operator is in another list, which 
>>>>>>>>>>>> is
>>>>>>>>>>>> used
>>>>>>>>>>>> when rendering)
>>>>>>>>>>>> 
>>>>>>>>>>>> So that is what puzzles me. I think it can't work at all. But you
>>>>>>>>>>>> said
>>>>>>>>>>>> it did work at a time.
>>>>>>>>>>>> 
>>>>>>>>>>>> Tilman
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Am 27.07.2014 07:43, schrieb Tilman Hausherr:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> Please upload the PDF somewhere and post the URL, PDF files are
>>>>>>>>>>>>> removed
>>>>>>>>>>>>> from the mailing list.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Tilman
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Am 27.07.2014 02:35, schrieb -A:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello again. I've been trying to figure out this issue that has 
>>>>>>>>>>>>> come
>>>>>>>>>>>>>> up for me and in my research I found someone posting on
>>>>>>>>>>>>>> StackOverflow (
>>>>>>>>>>>>>> http://stackoverflow.com/questions/10844271/how-to-get-
>>>>>>>>>>>>>> font-color-using-pdfbox) a similar issue where they could not 
>>>>>>>>>>>>>> read
>>>>>>>>>>>>>> any colors from a PDF. The user posted the code and someone else
>>>>>>>>>>>>>> took it,
>>>>>>>>>>>>>> ran it, and reported that it worked. The users approach was
>>>>>>>>>>>>>> different than
>>>>>>>>>>>>>> mine, but alas.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm not sure at this point what is going on. I have stepped 
>>>>>>>>>>>>>> through
>>>>>>>>>>>>>> each individual character and checked the PDGraphicsState object,
>>>>>>>>>>>>>> and even
>>>>>>>>>>>>>> when I am looking at an open file with visibly red text 
>>>>>>>>>>>>>> (attached)
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> debugger only reports DeviceGray. If I print out the ColorSpace
>>>>>>>>>>>>>> name
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> the PDGraphicsState this is what is printed - for every 
>>>>>>>>>>>>>> character.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I would appreciate if someone could perhaps run the attached text
>>>>>>>>>>>>>> stripper with the attached PDF file and report back if it 
>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>> prints
>>>>>>>>>>>>>> trueinstead of false, as it does for me. Since I saw this
>>>>>>>>>>>>>> occurrence
>>>>>>>>>>>>>> elsewhere I'd like to rule that out - in case an IDE setting of
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> sort
>>>>>>>>>>>>>> may be causing this?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It should be noted that I began using PDFBox with 1.8.5 and had
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> code working fine. Still with 1.8.5 yesterday it was failing.
>>>>>>>>>>>>>> Upgrading to
>>>>>>>>>>>>>> 1.8.6 yielded the same results.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If this is an actual issue I do not mind attempting to solve it 
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>> someone may have a general idea where to point me as to prevent
>>>>>>>>>>>>>> needless
>>>>>>>>>>>>>> meddling with graphics state objects. Or, if this should be
>>>>>>>>>>>>>> reported
>>>>>>>>>>>>>> I can
>>>>>>>>>>>>>> do that as well.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -Aaron
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> *Previous Message:*
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> I've attached an updated stripper file with the only addition 
>>>>>>>>>>>>>> being
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>> main function to test the class specifically.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> When ran with the PDF I have also attached it indeed does not
>>>>>>>>>>>>>> recognize the red text.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> At this point it seems that this issue is solely dependent on
>>>>>>>>>>>>>> PDFBox.
>>>>>>>>>>>>>> I'll stay tuned for some insight hopefully. If any other
>>>>>>>>>>>>>> information
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> needed, let me know!
> 

Reply via email to