+1 for removing the .properties file if the new mechanism is easier to understand and handle. The discussion doesn’t provide that proof or some information about that.
How would a replacement look like? OTOH if it’s a documentation issue we could also add some more information to the javadocs to explain the dependencies. We could add a register/unregister method to allow to add/remove custom operator handling or provide a service discovery mechanism. This way we still have the old flexibility. BR Maruan Am 29.07.2014 um 21:48 schrieb John Hewson <j...@jahewson.com>: > Right but we need to address the confusion and complexity that has been > caused by .properties files which made PDFBOX-2246 so tricky to figure out. > > Lets remove this wart! > > -- John > > On 29 Jul 2014, at 10:44, Tilman Hausherr <thaush...@t-online.de> wrote: > >> Hi, >> >> At this time, the problem I see and wanted to solve (PDFBOX-2246) exists >> regardless whether we use a properties file or initialize directly in the >> code. >> >> Tilman >> >> >> Am 29.07.2014 19:41, schrieb John Hewson: >>> On 29 Jul 2014, at 03:44, Andreas Lehmkühler <andr...@lehmi.de> wrote: >>> >>>> Hi, >>>> >>>> it's not a black and white issue (comments inline) >>>> >>>>> John Hewson <j...@jahewson.com> hat am 29. Juli 2014 um 07:44 geschrieben: >>>>> >>>>> >>>>> Yes, really I should have said subclasses of PDFStreamEngine - that's >>>>> where >>>>> the .properties file originates. I'd propose replacing the properties >>>>> mechanism with a simple method containing the mapping which can be >>>>> overridden >>>>> in subclasses. Ultimately, users expect to be able to subclass the >>>>> behaviour >>>>> of a class by just subclassing the class. >>>> PDFStreamEngine doesn't configure any operator set itself. The subclasses >>>> are >>>> supposed to configure their own set of operators depending on the >>>> particular >>>> usecase. E.g. to extend the text extraction one has to subclass >>>> PDFTextStripper >>>> and so on. >>> It’s PDFStreamEngine which implements the .property mechanism though, via >>> the >>> PDFStreamEngine(Properties properties) constructor. >>> >>>> E.g. to extend the text extraction one has to subclass PDFTextStripper and >>>> so on. >>> That’s true, but it’s only half the story, don’t forget that the >>> .properties files need >>> to be copied and pasted elsewhere and modified along with overriding which >>> .property >>> file is passed in the constructor if you want to truly override the class’ >>> behaviour. >>> >>>>> We've seen a number of incidents of confusion on the mailing list due to >>>>> the >>>>> current design. >>>> IMHO, most of the confusion is based on the lack of knowledge of the pdf >>>> spec. >>>> One can't understand how pdfbox works under the hood by simply looking at >>>> the >>>> code. One has to understand the pdf spec as well, at least the base >>>> concepts. >>> I’m specifically talking about confusion surrounding how to override >>> operators, and >>> .properties files, this has come up before. This entire thread has been >>> caused by >>> PDFBox’s design and *not* the PDF spec. >>> >>>>> I'd say that to the modern Java developer having non-code runtime binding >>>>> has >>>>> become an anti-pattern, resulting in brittle code which can't easily be >>>>> navigated in an IDE and which resists automated analysis and exhibits >>>>> runtime >>>>> failures despite compiling ok. This is one of those cases where the >>>>> collective >>>>> wisdom has just evolved over the years. >>>> It depends on the given usecase. All solutions have advantages and >>>> disadvantages. E.g. if someone wants to configure the PDFTextStripper >>>> without >>>> recompiling the code, it is quite handy to keep the configuration in a text >>>> file. >>> Has anybody *ever* wanted to change the operators which PDFTextStripper is >>> processing without recompiling the code? These are internal implementation >>> details that shouldn’t be exposed in the first place - it’s not a >>> “configuration” at >>> all, especially as 99% of possible changes would just break PDFTextStripper. >>> >>>> In this case I'm neither pro or con a text based config, but I tend to >>>> agree >>>> with John to have the different configurations in some method within the >>>> subclasses of PDFStreamEngine. >>> As above, this isn’t “configuration” at all, it lacks even a basic use >>> case. I don’t >>> see any pros which aren’t fabricated for the sake of argument, but the cons >>> are >>> causing us significant problems right here, right now. >>> >>>> BR >>>> Andreas Lehmkühler >>>> >>>>> -- John >>>>> >>>>>> On 28 Jul 2014, at 13:42, Tilman Hausherr <thaush...@t-online.de> wrote: >>>>>> >>>>>> I disagree - one doesn't *have* to pass a property file to >>>>>> PDFTextStripper >>>>>> and PageDrawer. The properties file for PDFTextStripper is optional. The >>>>>> property parameter was already there before it became an apache project. >>>>>> >>>>>> >>>>>> Tilman >>>>>> >>>>>> >>>>>> >>>>>> Am 28.07.2014 22:08, schrieb John Hewson: >>>>>>> We need to get rid of these .properties files, they’re causing endless >>>>>>> confusion, not to mention that they hide runtime dependencies in text >>>>>>> files. >>>>>>> >>>>>>> We should make it so that overriding a TextStripper, PageDrawer, etc. >>>>>>> doesn’t require external .properties files, currently Preflight works in >>>>>>> this manner and it’s much clearer. >>>>>>> >>>>>>> I guess this is a legacy of the “old” ways of Java XML everything. >>>>>>> >>>>>>> -- John >>>>>>> >>>>>>>> On 27 Jul 2014, at 10:09, -A <aa...@hrtmn.net> wrote: >>>>>>>> >>>>>>>> Thank you, that works as promised and removes the warning. I'm still >>>>>>>> hoping >>>>>>>> to find a resource that better explains the pieces of PDFBox and how >>>>>>>> they >>>>>>>> work together. Unfortunately most posts on the internet are solely how >>>>>>>> and >>>>>>>> not why. >>>>>>>> >>>>>>>> Appreciate it! >>>>>>>> >>>>>>>> -Aaron >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Jul 27, 2014 at 8:00 AM, Tilman Hausherr >>>>>>>> <thaush...@t-online.de> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> That didn't happen to me, but maybe it did happen to you with another >>>>>>>>> file. >>>>>>>>> >>>>>>>>> Another solution would be to pass your own properties file, and it >>>>>>>>> should >>>>>>>>> have this content: >>>>>>>>> >>>>>>>>> ======================= >>>>>>>>> # Licensed to the Apache Software Foundation (ASF) under one or more >>>>>>>>> # contributor license agreements. See the NOTICE file distributed >>>>>>>>> with >>>>>>>>> # this work for additional information regarding copyright ownership. >>>>>>>>> # The ASF licenses this file to You under the Apache License, Version >>>>>>>>> 2.0 >>>>>>>>> # (the "License"); you may not use this file except in compliance with >>>>>>>>> # the License. You may obtain a copy of the License at >>>>>>>>> # >>>>>>>>> # http://www.apache.org/licenses/LICENSE-2.0 >>>>>>>>> # >>>>>>>>> # Unless required by applicable law or agreed to in writing, software >>>>>>>>> # distributed under the License is distributed on an "AS IS" BASIS, >>>>>>>>> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >>>>>>>>> implied. >>>>>>>>> # See the License for the specific language governing permissions and >>>>>>>>> # limitations under the License. >>>>>>>>> >>>>>>>>> # This table is maps PDF stream operators to concrete >>>>>>>>> OperatorProcessor >>>>>>>>> # subclasses that are used by the PDFStreamEngine class to interpret >>>>>>>>> the >>>>>>>>> # PDF document. The classes configured here allow the PDFTextStripper >>>>>>>>> # subclass of PDFStreamEngine to extract text content of the document. >>>>>>>>> >>>>>>>>> BT = org.apache.pdfbox.util.operator.BeginText >>>>>>>>> cm = org.apache.pdfbox.util.operator.Concatenate >>>>>>>>> Do = org.apache.pdfbox.util.operator.Invoke >>>>>>>>> ET = org.apache.pdfbox.util.operator.EndText >>>>>>>>> gs = org.apache.pdfbox.util.operator.SetGraphicsStateParameters >>>>>>>>> q = org.apache.pdfbox.util.operator.GSave >>>>>>>>> Q = org.apache.pdfbox.util.operator.GRestore >>>>>>>>> T* = org.apache.pdfbox.util.operator.NextLine >>>>>>>>> Tc = org.apache.pdfbox.util.operator.SetCharSpacing >>>>>>>>> Td = org.apache.pdfbox.util.operator.MoveText >>>>>>>>> TD = org.apache.pdfbox.util.operator.MoveTextSetLeading >>>>>>>>> Tf = org.apache.pdfbox.util.operator.SetTextFont >>>>>>>>> Tj = org.apache.pdfbox.util.operator.ShowText >>>>>>>>> TJ = org.apache.pdfbox.util.operator.ShowTextGlyph >>>>>>>>> TL = org.apache.pdfbox.util.operator.SetTextLeading >>>>>>>>> Tm = org.apache.pdfbox.util.operator.SetMatrix >>>>>>>>> Tr = org.apache.pdfbox.util.operator.SetTextRenderingMode >>>>>>>>> Ts = org.apache.pdfbox.util.operator.SetTextRise >>>>>>>>> Tw = org.apache.pdfbox.util.operator.SetWordSpacing >>>>>>>>> Tz = org.apache.pdfbox.util.operator.SetHorizontalTextScaling >>>>>>>>> w = org.apache.pdfbox.util.operator.SetLineWidth >>>>>>>>> \' = org.apache.pdfbox.util.operator.MoveAndShow >>>>>>>>> \" = org.apache.pdfbox.util.operator.SetMoveAndShow >>>>>>>>> >>>>>>>>> CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace >>>>>>>>> cs=org.apache.pdfbox.util.operator.SetNonStrokingColorSpace >>>>>>>>> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor >>>>>>>>> G=org.apache.pdfbox.util.operator.SetStrokingGrayColor >>>>>>>>> g=org.apache.pdfbox.util.operator.SetNonStrokingGrayColor >>>>>>>>> K=org.apache.pdfbox.util.operator.SetStrokingCMYKColor >>>>>>>>> k=org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor >>>>>>>>> RG=org.apache.pdfbox.util.operator.SetStrokingRGBColor >>>>>>>>> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor >>>>>>>>> SC=org.apache.pdfbox.util.operator.SetStrokingColor >>>>>>>>> sc=org.apache.pdfbox.util.operator.SetNonStrokingColor >>>>>>>>> SCN=org.apache.pdfbox.util.operator.SetStrokingColor >>>>>>>>> scn=org.apache.pdfbox.util.operator.SetNonStrokingColor >>>>>>>>> >>>>>>>>> # The following operators are not relevant to text extraction, >>>>>>>>> # so we can silently ignore them. >>>>>>>>> >>>>>>>>> b >>>>>>>>> B >>>>>>>>> b* >>>>>>>>> B* >>>>>>>>> BDC >>>>>>>>> BI >>>>>>>>> BMC >>>>>>>>> BX >>>>>>>>> c >>>>>>>>> d >>>>>>>>> d0 >>>>>>>>> d1 >>>>>>>>> DP >>>>>>>>> El >>>>>>>>> EMC >>>>>>>>> EX >>>>>>>>> f >>>>>>>>> F >>>>>>>>> f* >>>>>>>>> h >>>>>>>>> i >>>>>>>>> ID >>>>>>>>> j >>>>>>>>> J >>>>>>>>> l >>>>>>>>> m >>>>>>>>> M >>>>>>>>> MP >>>>>>>>> n >>>>>>>>> re >>>>>>>>> ri >>>>>>>>> s >>>>>>>>> S >>>>>>>>> sh >>>>>>>>> v >>>>>>>>> W >>>>>>>>> W* >>>>>>>>> y >>>>>>>>> >>>>>>>>> ======================= >>>>>>>>> >>>>>>>>> Tilman >>>>>>>>> >>>>>>>>> Am 27.07.2014 15:54, schrieb -A: >>>>>>>>> >>>>>>>>> Tilman; >>>>>>>>>> That is somewhat embarrassing. At one point I brought this to the >>>>>>>>>> mailing >>>>>>>>>> list (because of the following warning) and was told to remove that >>>>>>>>>> line >>>>>>>>>> because the TextStripper wasn't actually a PageDrawer. The >>>>>>>>>> functionality >>>>>>>>>> still worked after that, however. >>>>>>>>>> >>>>>>>>>> Is there a way to do this without the warning, perhaps something >>>>>>>>>> within >>>>>>>>>> PageDrawer? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thank you, >>>>>>>>>> -Aaron >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> WARNING: java.lang.ClassCastException: IncrementalPDFStripper cannot >>>>>>>>>> be >>>>>>>>>> cast to org.apache.pdfbox.pdfviewer.PageDrawer >>>>>>>>>> java.lang.ClassCastException: IncrementalPDFStripper cannot be cast >>>>>>>>>> to >>>>>>>>>> org.apache.pdfbox.pdfviewer.PageDrawer >>>>>>>>>> at >>>>>>>>>> org.apache.pdfbox.util.operator.pagedrawer.AppendRectangleToPath.process( >>>>>>>>>> AppendRectangleToPath.java:46) >>>>>>>>>> at >>>>>>>>>> org.apache.pdfbox.util.PDFStreamEngine.processOperator( >>>>>>>>>> PDFStreamEngine.java:557) >>>>>>>>>> at >>>>>>>>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream( >>>>>>>>>> PDFStreamEngine.java:268) >>>>>>>>>> at >>>>>>>>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream( >>>>>>>>>> PDFStreamEngine.java:235) >>>>>>>>>> at >>>>>>>>>> org.apache.pdfbox.util.PDFStreamEngine.processStream( >>>>>>>>>> PDFStreamEngine.java:215) >>>>>>>>>> at IncrementalPDFStripper.containsRed(IncrementalPDFStripper.java:90) >>>>>>>>>> at IncrementalPDFStripper.main(IncrementalPDFStripper.java:56) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Jul 27, 2014 at 5:47 AM, Tilman Hausherr >>>>>>>>>> <thaush...@t-online.de> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> It is even easier than I thought - replace super() with this: >>>>>>>>>>> super(ResourceLoader.loadProperties("org/apache/ >>>>>>>>>>> pdfbox/resources/PageDrawer.properties", true)); >>>>>>>>>>> >>>>>>>>>>> Tilman >>>>>>>>>>> >>>>>>>>>>> Am 27.07.2014 13:03, schrieb Tilman Hausherr: >>>>>>>>>>> >>>>>>>>>>> After having written the text below, I tested by including the "rg" >>>>>>>>>>> >>>>>>>>>>>> operator in the properties list and now it worked. I also tested >>>>>>>>>>>> deleting >>>>>>>>>>>> your println and instead adding this if the text is red: >>>>>>>>>>>> >>>>>>>>>>>> System.out.print (textPos.getCharacter()); >>>>>>>>>>>> >>>>>>>>>>>> and so I got this output: >>>>>>>>>>>> >>>>>>>>>>>> 21_Key .1295 R~Wall Prof LinP 0.003 0.004 0.000 >>>>>>>>>>>> true >>>>>>>>>>>> >>>>>>>>>>>> which is exactly what is red in the PDF. >>>>>>>>>>>> >>>>>>>>>>>> Another way (probably better) to do it would probably be to not >>>>>>>>>>>> derive >>>>>>>>>>>> PDFTextStripper but |PDFStreamEngine and construct it with|| >>>>>>>>>>>> >>>>>>>>>>>> ResourceLoader.loadProperties("org/apache/pdfbox/resources/ >>>>>>>>>>>> PageDrawer.properties")| >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> see also http://stackoverflow.com/a/9157714/535646 >>>>>>>>>>>> >>>>>>>>>>>> Tilman >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Am 27.07.2014 12:14, schrieb Tilman Hausherr: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>>> Do you still have the code that worked? >>>>>>>>>>>>> >>>>>>>>>>>>> I'm not the text extraction specialist here, but what I did was to >>>>>>>>>>>>> look >>>>>>>>>>>>> in the uncompressed source of the PDF. The stream has code like >>>>>>>>>>>>> this: >>>>>>>>>>>>> >>>>>>>>>>>>> 0 0 0 rg >>>>>>>>>>>>> 0 0.5019 0 rg >>>>>>>>>>>>> 1 0 0 rg >>>>>>>>>>>>> >>>>>>>>>>>>> The first line sets to black, the second to green, the third to >>>>>>>>>>>>> red. >>>>>>>>>>>>> And >>>>>>>>>>>>> from what I saw, it can't work at all, because the "rg" operator >>>>>>>>>>>>> isn't >>>>>>>>>>>>> processed when extracting text, because PDFTextStripper.properties >>>>>>>>>>>>> doesn't >>>>>>>>>>>>> contain the "rg" operator. (The operator is in another list, >>>>>>>>>>>>> which is >>>>>>>>>>>>> used >>>>>>>>>>>>> when rendering) >>>>>>>>>>>>> >>>>>>>>>>>>> So that is what puzzles me. I think it can't work at all. But you >>>>>>>>>>>>> said >>>>>>>>>>>>> it did work at a time. >>>>>>>>>>>>> >>>>>>>>>>>>> Tilman >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Am 27.07.2014 07:43, schrieb Tilman Hausherr: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> Please upload the PDF somewhere and post the URL, PDF files are >>>>>>>>>>>>>> removed >>>>>>>>>>>>>> from the mailing list. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Tilman >>>>>>>>>>>>>> >>>>>>>>>>>>>> Am 27.07.2014 02:35, schrieb -A: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hello again. I've been trying to figure out this issue that has >>>>>>>>>>>>>> come >>>>>>>>>>>>>>> up for me and in my research I found someone posting on >>>>>>>>>>>>>>> StackOverflow ( >>>>>>>>>>>>>>> http://stackoverflow.com/questions/10844271/how-to-get- >>>>>>>>>>>>>>> font-color-using-pdfbox) a similar issue where they could not >>>>>>>>>>>>>>> read >>>>>>>>>>>>>>> any colors from a PDF. The user posted the code and someone else >>>>>>>>>>>>>>> took it, >>>>>>>>>>>>>>> ran it, and reported that it worked. The users approach was >>>>>>>>>>>>>>> different than >>>>>>>>>>>>>>> mine, but alas. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm not sure at this point what is going on. I have stepped >>>>>>>>>>>>>>> through >>>>>>>>>>>>>>> each individual character and checked the PDGraphicsState >>>>>>>>>>>>>>> object, >>>>>>>>>>>>>>> and even >>>>>>>>>>>>>>> when I am looking at an open file with visibly red text >>>>>>>>>>>>>>> (attached) >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> debugger only reports DeviceGray. If I print out the ColorSpace >>>>>>>>>>>>>>> name >>>>>>>>>>>>>>> from >>>>>>>>>>>>>>> the PDGraphicsState this is what is printed - for every >>>>>>>>>>>>>>> character. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I would appreciate if someone could perhaps run the attached >>>>>>>>>>>>>>> text >>>>>>>>>>>>>>> stripper with the attached PDF file and report back if it >>>>>>>>>>>>>>> actually >>>>>>>>>>>>>>> prints >>>>>>>>>>>>>>> trueinstead of false, as it does for me. Since I saw this >>>>>>>>>>>>>>> occurrence >>>>>>>>>>>>>>> elsewhere I'd like to rule that out - in case an IDE setting of >>>>>>>>>>>>>>> some >>>>>>>>>>>>>>> sort >>>>>>>>>>>>>>> may be causing this? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It should be noted that I began using PDFBox with 1.8.5 and had >>>>>>>>>>>>>>> this >>>>>>>>>>>>>>> code working fine. Still with 1.8.5 yesterday it was failing. >>>>>>>>>>>>>>> Upgrading to >>>>>>>>>>>>>>> 1.8.6 yielded the same results. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If this is an actual issue I do not mind attempting to solve it >>>>>>>>>>>>>>> if >>>>>>>>>>>>>>> someone may have a general idea where to point me as to prevent >>>>>>>>>>>>>>> needless >>>>>>>>>>>>>>> meddling with graphics state objects. Or, if this should be >>>>>>>>>>>>>>> reported >>>>>>>>>>>>>>> I can >>>>>>>>>>>>>>> do that as well. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -Aaron >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Previous Message:* >>>>>>>>>>>>>>> * >>>>>>>>>>>>>>> * >>>>>>>>>>>>>>> * >>>>>>>>>>>>>>> * >>>>>>>>>>>>>>> I've attached an updated stripper file with the only addition >>>>>>>>>>>>>>> being >>>>>>>>>>>>>>> a >>>>>>>>>>>>>>> main function to test the class specifically. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> When ran with the PDF I have also attached it indeed does not >>>>>>>>>>>>>>> recognize the red text. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> At this point it seems that this issue is solely dependent on >>>>>>>>>>>>>>> PDFBox. >>>>>>>>>>>>>>> I'll stay tuned for some insight hopefully. If any other >>>>>>>>>>>>>>> information >>>>>>>>>>>>>>> is >>>>>>>>>>>>>>> needed, let me know! >> >