Right but we need to address the confusion and complexity that has been caused by .properties files which made PDFBOX-2246 so tricky to figure out.
Lets remove this wart! -- John On 29 Jul 2014, at 10:44, Tilman Hausherr <thaush...@t-online.de> wrote: > Hi, > > At this time, the problem I see and wanted to solve (PDFBOX-2246) exists > regardless whether we use a properties file or initialize directly in the > code. > > Tilman > > > Am 29.07.2014 19:41, schrieb John Hewson: >> On 29 Jul 2014, at 03:44, Andreas Lehmkühler <andr...@lehmi.de> wrote: >> >>> Hi, >>> >>> it's not a black and white issue (comments inline) >>> >>>> John Hewson <j...@jahewson.com> hat am 29. Juli 2014 um 07:44 geschrieben: >>>> >>>> >>>> Yes, really I should have said subclasses of PDFStreamEngine - that's >>>> where >>>> the .properties file originates. I'd propose replacing the properties >>>> mechanism with a simple method containing the mapping which can be >>>> overridden >>>> in subclasses. Ultimately, users expect to be able to subclass the >>>> behaviour >>>> of a class by just subclassing the class. >>> PDFStreamEngine doesn't configure any operator set itself. The subclasses >>> are >>> supposed to configure their own set of operators depending on the particular >>> usecase. E.g. to extend the text extraction one has to subclass >>> PDFTextStripper >>> and so on. >> It’s PDFStreamEngine which implements the .property mechanism though, via the >> PDFStreamEngine(Properties properties) constructor. >> >>> E.g. to extend the text extraction one has to subclass PDFTextStripper and >>> so on. >> That’s true, but it’s only half the story, don’t forget that the .properties >> files need >> to be copied and pasted elsewhere and modified along with overriding which >> .property >> file is passed in the constructor if you want to truly override the class’ >> behaviour. >> >>>> We've seen a number of incidents of confusion on the mailing list due to >>>> the >>>> current design. >>> IMHO, most of the confusion is based on the lack of knowledge of the pdf >>> spec. >>> One can't understand how pdfbox works under the hood by simply looking at >>> the >>> code. One has to understand the pdf spec as well, at least the base >>> concepts. >> I’m specifically talking about confusion surrounding how to override >> operators, and >> .properties files, this has come up before. This entire thread has been >> caused by >> PDFBox’s design and *not* the PDF spec. >> >>>> I'd say that to the modern Java developer having non-code runtime binding >>>> has >>>> become an anti-pattern, resulting in brittle code which can't easily be >>>> navigated in an IDE and which resists automated analysis and exhibits >>>> runtime >>>> failures despite compiling ok. This is one of those cases where the >>>> collective >>>> wisdom has just evolved over the years. >>> It depends on the given usecase. All solutions have advantages and >>> disadvantages. E.g. if someone wants to configure the PDFTextStripper >>> without >>> recompiling the code, it is quite handy to keep the configuration in a text >>> file. >> Has anybody *ever* wanted to change the operators which PDFTextStripper is >> processing without recompiling the code? These are internal implementation >> details that shouldn’t be exposed in the first place - it’s not a >> “configuration” at >> all, especially as 99% of possible changes would just break PDFTextStripper. >> >>> In this case I'm neither pro or con a text based config, but I tend to agree >>> with John to have the different configurations in some method within the >>> subclasses of PDFStreamEngine. >> As above, this isn’t “configuration” at all, it lacks even a basic use case. >> I don’t >> see any pros which aren’t fabricated for the sake of argument, but the cons >> are >> causing us significant problems right here, right now. >> >>> BR >>> Andreas Lehmkühler >>> >>>> -- John >>>> >>>>> On 28 Jul 2014, at 13:42, Tilman Hausherr <thaush...@t-online.de> wrote: >>>>> >>>>> I disagree - one doesn't *have* to pass a property file to PDFTextStripper >>>>> and PageDrawer. The properties file for PDFTextStripper is optional. The >>>>> property parameter was already there before it became an apache project. >>>>> >>>>> >>>>> Tilman >>>>> >>>>> >>>>> >>>>> Am 28.07.2014 22:08, schrieb John Hewson: >>>>>> We need to get rid of these .properties files, they’re causing endless >>>>>> confusion, not to mention that they hide runtime dependencies in text >>>>>> files. >>>>>> >>>>>> We should make it so that overriding a TextStripper, PageDrawer, etc. >>>>>> doesn’t require external .properties files, currently Preflight works in >>>>>> this manner and it’s much clearer. >>>>>> >>>>>> I guess this is a legacy of the “old” ways of Java XML everything. >>>>>> >>>>>> -- John >>>>>> >>>>>>> On 27 Jul 2014, at 10:09, -A <aa...@hrtmn.net> wrote: >>>>>>> >>>>>>> Thank you, that works as promised and removes the warning. I'm still >>>>>>> hoping >>>>>>> to find a resource that better explains the pieces of PDFBox and how >>>>>>> they >>>>>>> work together. Unfortunately most posts on the internet are solely how >>>>>>> and >>>>>>> not why. >>>>>>> >>>>>>> Appreciate it! >>>>>>> >>>>>>> -Aaron >>>>>>> >>>>>>> >>>>>>> On Sun, Jul 27, 2014 at 8:00 AM, Tilman Hausherr <thaush...@t-online.de> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> That didn't happen to me, but maybe it did happen to you with another >>>>>>>> file. >>>>>>>> >>>>>>>> Another solution would be to pass your own properties file, and it >>>>>>>> should >>>>>>>> have this content: >>>>>>>> >>>>>>>> ======================= >>>>>>>> # Licensed to the Apache Software Foundation (ASF) under one or more >>>>>>>> # contributor license agreements. See the NOTICE file distributed with >>>>>>>> # this work for additional information regarding copyright ownership. >>>>>>>> # The ASF licenses this file to You under the Apache License, Version >>>>>>>> 2.0 >>>>>>>> # (the "License"); you may not use this file except in compliance with >>>>>>>> # the License. You may obtain a copy of the License at >>>>>>>> # >>>>>>>> # http://www.apache.org/licenses/LICENSE-2.0 >>>>>>>> # >>>>>>>> # Unless required by applicable law or agreed to in writing, software >>>>>>>> # distributed under the License is distributed on an "AS IS" BASIS, >>>>>>>> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >>>>>>>> implied. >>>>>>>> # See the License for the specific language governing permissions and >>>>>>>> # limitations under the License. >>>>>>>> >>>>>>>> # This table is maps PDF stream operators to concrete OperatorProcessor >>>>>>>> # subclasses that are used by the PDFStreamEngine class to interpret >>>>>>>> the >>>>>>>> # PDF document. The classes configured here allow the PDFTextStripper >>>>>>>> # subclass of PDFStreamEngine to extract text content of the document. >>>>>>>> >>>>>>>> BT = org.apache.pdfbox.util.operator.BeginText >>>>>>>> cm = org.apache.pdfbox.util.operator.Concatenate >>>>>>>> Do = org.apache.pdfbox.util.operator.Invoke >>>>>>>> ET = org.apache.pdfbox.util.operator.EndText >>>>>>>> gs = org.apache.pdfbox.util.operator.SetGraphicsStateParameters >>>>>>>> q = org.apache.pdfbox.util.operator.GSave >>>>>>>> Q = org.apache.pdfbox.util.operator.GRestore >>>>>>>> T* = org.apache.pdfbox.util.operator.NextLine >>>>>>>> Tc = org.apache.pdfbox.util.operator.SetCharSpacing >>>>>>>> Td = org.apache.pdfbox.util.operator.MoveText >>>>>>>> TD = org.apache.pdfbox.util.operator.MoveTextSetLeading >>>>>>>> Tf = org.apache.pdfbox.util.operator.SetTextFont >>>>>>>> Tj = org.apache.pdfbox.util.operator.ShowText >>>>>>>> TJ = org.apache.pdfbox.util.operator.ShowTextGlyph >>>>>>>> TL = org.apache.pdfbox.util.operator.SetTextLeading >>>>>>>> Tm = org.apache.pdfbox.util.operator.SetMatrix >>>>>>>> Tr = org.apache.pdfbox.util.operator.SetTextRenderingMode >>>>>>>> Ts = org.apache.pdfbox.util.operator.SetTextRise >>>>>>>> Tw = org.apache.pdfbox.util.operator.SetWordSpacing >>>>>>>> Tz = org.apache.pdfbox.util.operator.SetHorizontalTextScaling >>>>>>>> w = org.apache.pdfbox.util.operator.SetLineWidth >>>>>>>> \' = org.apache.pdfbox.util.operator.MoveAndShow >>>>>>>> \" = org.apache.pdfbox.util.operator.SetMoveAndShow >>>>>>>> >>>>>>>> CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace >>>>>>>> cs=org.apache.pdfbox.util.operator.SetNonStrokingColorSpace >>>>>>>> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor >>>>>>>> G=org.apache.pdfbox.util.operator.SetStrokingGrayColor >>>>>>>> g=org.apache.pdfbox.util.operator.SetNonStrokingGrayColor >>>>>>>> K=org.apache.pdfbox.util.operator.SetStrokingCMYKColor >>>>>>>> k=org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor >>>>>>>> RG=org.apache.pdfbox.util.operator.SetStrokingRGBColor >>>>>>>> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor >>>>>>>> SC=org.apache.pdfbox.util.operator.SetStrokingColor >>>>>>>> sc=org.apache.pdfbox.util.operator.SetNonStrokingColor >>>>>>>> SCN=org.apache.pdfbox.util.operator.SetStrokingColor >>>>>>>> scn=org.apache.pdfbox.util.operator.SetNonStrokingColor >>>>>>>> >>>>>>>> # The following operators are not relevant to text extraction, >>>>>>>> # so we can silently ignore them. >>>>>>>> >>>>>>>> b >>>>>>>> B >>>>>>>> b* >>>>>>>> B* >>>>>>>> BDC >>>>>>>> BI >>>>>>>> BMC >>>>>>>> BX >>>>>>>> c >>>>>>>> d >>>>>>>> d0 >>>>>>>> d1 >>>>>>>> DP >>>>>>>> El >>>>>>>> EMC >>>>>>>> EX >>>>>>>> f >>>>>>>> F >>>>>>>> f* >>>>>>>> h >>>>>>>> i >>>>>>>> ID >>>>>>>> j >>>>>>>> J >>>>>>>> l >>>>>>>> m >>>>>>>> M >>>>>>>> MP >>>>>>>> n >>>>>>>> re >>>>>>>> ri >>>>>>>> s >>>>>>>> S >>>>>>>> sh >>>>>>>> v >>>>>>>> W >>>>>>>> W* >>>>>>>> y >>>>>>>> >>>>>>>> ======================= >>>>>>>> >>>>>>>> Tilman >>>>>>>> >>>>>>>> Am 27.07.2014 15:54, schrieb -A: >>>>>>>> >>>>>>>> Tilman; >>>>>>>>> That is somewhat embarrassing. At one point I brought this to the >>>>>>>>> mailing >>>>>>>>> list (because of the following warning) and was told to remove that >>>>>>>>> line >>>>>>>>> because the TextStripper wasn't actually a PageDrawer. The >>>>>>>>> functionality >>>>>>>>> still worked after that, however. >>>>>>>>> >>>>>>>>> Is there a way to do this without the warning, perhaps something >>>>>>>>> within >>>>>>>>> PageDrawer? >>>>>>>>> >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> -Aaron >>>>>>>>> >>>>>>>>> >>>>>>>>> WARNING: java.lang.ClassCastException: IncrementalPDFStripper cannot >>>>>>>>> be >>>>>>>>> cast to org.apache.pdfbox.pdfviewer.PageDrawer >>>>>>>>> java.lang.ClassCastException: IncrementalPDFStripper cannot be cast to >>>>>>>>> org.apache.pdfbox.pdfviewer.PageDrawer >>>>>>>>> at >>>>>>>>> org.apache.pdfbox.util.operator.pagedrawer.AppendRectangleToPath.process( >>>>>>>>> AppendRectangleToPath.java:46) >>>>>>>>> at >>>>>>>>> org.apache.pdfbox.util.PDFStreamEngine.processOperator( >>>>>>>>> PDFStreamEngine.java:557) >>>>>>>>> at >>>>>>>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream( >>>>>>>>> PDFStreamEngine.java:268) >>>>>>>>> at >>>>>>>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream( >>>>>>>>> PDFStreamEngine.java:235) >>>>>>>>> at >>>>>>>>> org.apache.pdfbox.util.PDFStreamEngine.processStream( >>>>>>>>> PDFStreamEngine.java:215) >>>>>>>>> at IncrementalPDFStripper.containsRed(IncrementalPDFStripper.java:90) >>>>>>>>> at IncrementalPDFStripper.main(IncrementalPDFStripper.java:56) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Jul 27, 2014 at 5:47 AM, Tilman Hausherr >>>>>>>>> <thaush...@t-online.de> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> It is even easier than I thought - replace super() with this: >>>>>>>>>> super(ResourceLoader.loadProperties("org/apache/ >>>>>>>>>> pdfbox/resources/PageDrawer.properties", true)); >>>>>>>>>> >>>>>>>>>> Tilman >>>>>>>>>> >>>>>>>>>> Am 27.07.2014 13:03, schrieb Tilman Hausherr: >>>>>>>>>> >>>>>>>>>> After having written the text below, I tested by including the "rg" >>>>>>>>>> >>>>>>>>>>> operator in the properties list and now it worked. I also tested >>>>>>>>>>> deleting >>>>>>>>>>> your println and instead adding this if the text is red: >>>>>>>>>>> >>>>>>>>>>> System.out.print (textPos.getCharacter()); >>>>>>>>>>> >>>>>>>>>>> and so I got this output: >>>>>>>>>>> >>>>>>>>>>> 21_Key .1295 R~Wall Prof LinP 0.003 0.004 0.000 true >>>>>>>>>>> >>>>>>>>>>> which is exactly what is red in the PDF. >>>>>>>>>>> >>>>>>>>>>> Another way (probably better) to do it would probably be to not >>>>>>>>>>> derive >>>>>>>>>>> PDFTextStripper but |PDFStreamEngine and construct it with|| >>>>>>>>>>> >>>>>>>>>>> ResourceLoader.loadProperties("org/apache/pdfbox/resources/ >>>>>>>>>>> PageDrawer.properties")| >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> see also http://stackoverflow.com/a/9157714/535646 >>>>>>>>>>> >>>>>>>>>>> Tilman >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Am 27.07.2014 12:14, schrieb Tilman Hausherr: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>>> Do you still have the code that worked? >>>>>>>>>>>> >>>>>>>>>>>> I'm not the text extraction specialist here, but what I did was to >>>>>>>>>>>> look >>>>>>>>>>>> in the uncompressed source of the PDF. The stream has code like >>>>>>>>>>>> this: >>>>>>>>>>>> >>>>>>>>>>>> 0 0 0 rg >>>>>>>>>>>> 0 0.5019 0 rg >>>>>>>>>>>> 1 0 0 rg >>>>>>>>>>>> >>>>>>>>>>>> The first line sets to black, the second to green, the third to >>>>>>>>>>>> red. >>>>>>>>>>>> And >>>>>>>>>>>> from what I saw, it can't work at all, because the "rg" operator >>>>>>>>>>>> isn't >>>>>>>>>>>> processed when extracting text, because PDFTextStripper.properties >>>>>>>>>>>> doesn't >>>>>>>>>>>> contain the "rg" operator. (The operator is in another list, which >>>>>>>>>>>> is >>>>>>>>>>>> used >>>>>>>>>>>> when rendering) >>>>>>>>>>>> >>>>>>>>>>>> So that is what puzzles me. I think it can't work at all. But you >>>>>>>>>>>> said >>>>>>>>>>>> it did work at a time. >>>>>>>>>>>> >>>>>>>>>>>> Tilman >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Am 27.07.2014 07:43, schrieb Tilman Hausherr: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>>> Please upload the PDF somewhere and post the URL, PDF files are >>>>>>>>>>>>> removed >>>>>>>>>>>>> from the mailing list. >>>>>>>>>>>>> >>>>>>>>>>>>> Tilman >>>>>>>>>>>>> >>>>>>>>>>>>> Am 27.07.2014 02:35, schrieb -A: >>>>>>>>>>>>> >>>>>>>>>>>>> Hello again. I've been trying to figure out this issue that has >>>>>>>>>>>>> come >>>>>>>>>>>>>> up for me and in my research I found someone posting on >>>>>>>>>>>>>> StackOverflow ( >>>>>>>>>>>>>> http://stackoverflow.com/questions/10844271/how-to-get- >>>>>>>>>>>>>> font-color-using-pdfbox) a similar issue where they could not >>>>>>>>>>>>>> read >>>>>>>>>>>>>> any colors from a PDF. The user posted the code and someone else >>>>>>>>>>>>>> took it, >>>>>>>>>>>>>> ran it, and reported that it worked. The users approach was >>>>>>>>>>>>>> different than >>>>>>>>>>>>>> mine, but alas. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm not sure at this point what is going on. I have stepped >>>>>>>>>>>>>> through >>>>>>>>>>>>>> each individual character and checked the PDGraphicsState object, >>>>>>>>>>>>>> and even >>>>>>>>>>>>>> when I am looking at an open file with visibly red text >>>>>>>>>>>>>> (attached) >>>>>>>>>>>>>> the >>>>>>>>>>>>>> debugger only reports DeviceGray. If I print out the ColorSpace >>>>>>>>>>>>>> name >>>>>>>>>>>>>> from >>>>>>>>>>>>>> the PDGraphicsState this is what is printed - for every >>>>>>>>>>>>>> character. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would appreciate if someone could perhaps run the attached text >>>>>>>>>>>>>> stripper with the attached PDF file and report back if it >>>>>>>>>>>>>> actually >>>>>>>>>>>>>> prints >>>>>>>>>>>>>> trueinstead of false, as it does for me. Since I saw this >>>>>>>>>>>>>> occurrence >>>>>>>>>>>>>> elsewhere I'd like to rule that out - in case an IDE setting of >>>>>>>>>>>>>> some >>>>>>>>>>>>>> sort >>>>>>>>>>>>>> may be causing this? >>>>>>>>>>>>>> >>>>>>>>>>>>>> It should be noted that I began using PDFBox with 1.8.5 and had >>>>>>>>>>>>>> this >>>>>>>>>>>>>> code working fine. Still with 1.8.5 yesterday it was failing. >>>>>>>>>>>>>> Upgrading to >>>>>>>>>>>>>> 1.8.6 yielded the same results. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If this is an actual issue I do not mind attempting to solve it >>>>>>>>>>>>>> if >>>>>>>>>>>>>> someone may have a general idea where to point me as to prevent >>>>>>>>>>>>>> needless >>>>>>>>>>>>>> meddling with graphics state objects. Or, if this should be >>>>>>>>>>>>>> reported >>>>>>>>>>>>>> I can >>>>>>>>>>>>>> do that as well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>> >>>>>>>>>>>>>> -Aaron >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> *Previous Message:* >>>>>>>>>>>>>> * >>>>>>>>>>>>>> * >>>>>>>>>>>>>> * >>>>>>>>>>>>>> * >>>>>>>>>>>>>> I've attached an updated stripper file with the only addition >>>>>>>>>>>>>> being >>>>>>>>>>>>>> a >>>>>>>>>>>>>> main function to test the class specifically. >>>>>>>>>>>>>> >>>>>>>>>>>>>> When ran with the PDF I have also attached it indeed does not >>>>>>>>>>>>>> recognize the red text. >>>>>>>>>>>>>> >>>>>>>>>>>>>> At this point it seems that this issue is solely dependent on >>>>>>>>>>>>>> PDFBox. >>>>>>>>>>>>>> I'll stay tuned for some insight hopefully. If any other >>>>>>>>>>>>>> information >>>>>>>>>>>>>> is >>>>>>>>>>>>>> needed, let me know! >