We need to get rid of these .properties files, they’re causing endless confusion, not to mention that they hide runtime dependencies in text files.
We should make it so that overriding a TextStripper, PageDrawer, etc. doesn’t require external .properties files, currently Preflight works in this manner and it’s much clearer. I guess this is a legacy of the “old” ways of Java XML everything. -- John On 27 Jul 2014, at 10:09, -A <aa...@hrtmn.net> wrote: > Thank you, that works as promised and removes the warning. I'm still hoping > to find a resource that better explains the pieces of PDFBox and how they > work together. Unfortunately most posts on the internet are solely how and > not why. > > Appreciate it! > > -Aaron > > > On Sun, Jul 27, 2014 at 8:00 AM, Tilman Hausherr <thaush...@t-online.de> > wrote: > >> Hi, >> >> That didn't happen to me, but maybe it did happen to you with another file. >> >> Another solution would be to pass your own properties file, and it should >> have this content: >> >> ======================= >> # Licensed to the Apache Software Foundation (ASF) under one or more >> # contributor license agreements. See the NOTICE file distributed with >> # this work for additional information regarding copyright ownership. >> # The ASF licenses this file to You under the Apache License, Version 2.0 >> # (the "License"); you may not use this file except in compliance with >> # the License. You may obtain a copy of the License at >> # >> # http://www.apache.org/licenses/LICENSE-2.0 >> # >> # Unless required by applicable law or agreed to in writing, software >> # distributed under the License is distributed on an "AS IS" BASIS, >> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >> # See the License for the specific language governing permissions and >> # limitations under the License. >> >> # This table is maps PDF stream operators to concrete OperatorProcessor >> # subclasses that are used by the PDFStreamEngine class to interpret the >> # PDF document. The classes configured here allow the PDFTextStripper >> # subclass of PDFStreamEngine to extract text content of the document. >> >> BT = org.apache.pdfbox.util.operator.BeginText >> cm = org.apache.pdfbox.util.operator.Concatenate >> Do = org.apache.pdfbox.util.operator.Invoke >> ET = org.apache.pdfbox.util.operator.EndText >> gs = org.apache.pdfbox.util.operator.SetGraphicsStateParameters >> q = org.apache.pdfbox.util.operator.GSave >> Q = org.apache.pdfbox.util.operator.GRestore >> T* = org.apache.pdfbox.util.operator.NextLine >> Tc = org.apache.pdfbox.util.operator.SetCharSpacing >> Td = org.apache.pdfbox.util.operator.MoveText >> TD = org.apache.pdfbox.util.operator.MoveTextSetLeading >> Tf = org.apache.pdfbox.util.operator.SetTextFont >> Tj = org.apache.pdfbox.util.operator.ShowText >> TJ = org.apache.pdfbox.util.operator.ShowTextGlyph >> TL = org.apache.pdfbox.util.operator.SetTextLeading >> Tm = org.apache.pdfbox.util.operator.SetMatrix >> Tr = org.apache.pdfbox.util.operator.SetTextRenderingMode >> Ts = org.apache.pdfbox.util.operator.SetTextRise >> Tw = org.apache.pdfbox.util.operator.SetWordSpacing >> Tz = org.apache.pdfbox.util.operator.SetHorizontalTextScaling >> w = org.apache.pdfbox.util.operator.SetLineWidth >> \' = org.apache.pdfbox.util.operator.MoveAndShow >> \" = org.apache.pdfbox.util.operator.SetMoveAndShow >> >> CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace >> cs=org.apache.pdfbox.util.operator.SetNonStrokingColorSpace >> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor >> G=org.apache.pdfbox.util.operator.SetStrokingGrayColor >> g=org.apache.pdfbox.util.operator.SetNonStrokingGrayColor >> K=org.apache.pdfbox.util.operator.SetStrokingCMYKColor >> k=org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor >> RG=org.apache.pdfbox.util.operator.SetStrokingRGBColor >> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor >> SC=org.apache.pdfbox.util.operator.SetStrokingColor >> sc=org.apache.pdfbox.util.operator.SetNonStrokingColor >> SCN=org.apache.pdfbox.util.operator.SetStrokingColor >> scn=org.apache.pdfbox.util.operator.SetNonStrokingColor >> >> # The following operators are not relevant to text extraction, >> # so we can silently ignore them. >> >> b >> B >> b* >> B* >> BDC >> BI >> BMC >> BX >> c >> d >> d0 >> d1 >> DP >> El >> EMC >> EX >> f >> F >> f* >> h >> i >> ID >> j >> J >> l >> m >> M >> MP >> n >> re >> ri >> s >> S >> sh >> v >> W >> W* >> y >> >> ======================= >> >> Tilman >> >> Am 27.07.2014 15:54, schrieb -A: >> >> Tilman; >>> >>> That is somewhat embarrassing. At one point I brought this to the mailing >>> list (because of the following warning) and was told to remove that line >>> because the TextStripper wasn't actually a PageDrawer. The functionality >>> still worked after that, however. >>> >>> Is there a way to do this without the warning, perhaps something within >>> PageDrawer? >>> >>> >>> Thank you, >>> -Aaron >>> >>> >>> WARNING: java.lang.ClassCastException: IncrementalPDFStripper cannot be >>> cast to org.apache.pdfbox.pdfviewer.PageDrawer >>> java.lang.ClassCastException: IncrementalPDFStripper cannot be cast to >>> org.apache.pdfbox.pdfviewer.PageDrawer >>> at >>> org.apache.pdfbox.util.operator.pagedrawer.AppendRectangleToPath.process( >>> AppendRectangleToPath.java:46) >>> at >>> org.apache.pdfbox.util.PDFStreamEngine.processOperator( >>> PDFStreamEngine.java:557) >>> at >>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream( >>> PDFStreamEngine.java:268) >>> at >>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream( >>> PDFStreamEngine.java:235) >>> at >>> org.apache.pdfbox.util.PDFStreamEngine.processStream( >>> PDFStreamEngine.java:215) >>> at IncrementalPDFStripper.containsRed(IncrementalPDFStripper.java:90) >>> at IncrementalPDFStripper.main(IncrementalPDFStripper.java:56) >>> >>> >>> >>> >>> On Sun, Jul 27, 2014 at 5:47 AM, Tilman Hausherr <thaush...@t-online.de> >>> wrote: >>> >>> It is even easier than I thought - replace super() with this: >>>> >>>> super(ResourceLoader.loadProperties("org/apache/ >>>> pdfbox/resources/PageDrawer.properties", true)); >>>> >>>> Tilman >>>> >>>> Am 27.07.2014 13:03, schrieb Tilman Hausherr: >>>> >>>> After having written the text below, I tested by including the "rg" >>>> >>>>> operator in the properties list and now it worked. I also tested >>>>> deleting >>>>> your println and instead adding this if the text is red: >>>>> >>>>> System.out.print (textPos.getCharacter()); >>>>> >>>>> and so I got this output: >>>>> >>>>> 21_Key .1295 R~Wall Prof LinP 0.003 0.004 0.000 true >>>>> >>>>> which is exactly what is red in the PDF. >>>>> >>>>> Another way (probably better) to do it would probably be to not derive >>>>> PDFTextStripper but |PDFStreamEngine and construct it with|| >>>>> >>>>> ResourceLoader.loadProperties("org/apache/pdfbox/resources/ >>>>> PageDrawer.properties")| >>>>> >>>>> >>>>> see also http://stackoverflow.com/a/9157714/535646 >>>>> >>>>> Tilman >>>>> >>>>> >>>>> Am 27.07.2014 12:14, schrieb Tilman Hausherr: >>>>> >>>>> Hi, >>>>>> >>>>>> Do you still have the code that worked? >>>>>> >>>>>> I'm not the text extraction specialist here, but what I did was to look >>>>>> in the uncompressed source of the PDF. The stream has code like this: >>>>>> >>>>>> 0 0 0 rg >>>>>> 0 0.5019 0 rg >>>>>> 1 0 0 rg >>>>>> >>>>>> The first line sets to black, the second to green, the third to red. >>>>>> And >>>>>> from what I saw, it can't work at all, because the "rg" operator isn't >>>>>> processed when extracting text, because PDFTextStripper.properties >>>>>> doesn't >>>>>> contain the "rg" operator. (The operator is in another list, which is >>>>>> used >>>>>> when rendering) >>>>>> >>>>>> So that is what puzzles me. I think it can't work at all. But you said >>>>>> it did work at a time. >>>>>> >>>>>> Tilman >>>>>> >>>>>> >>>>>> Am 27.07.2014 07:43, schrieb Tilman Hausherr: >>>>>> >>>>>> Hi, >>>>>>> >>>>>>> Please upload the PDF somewhere and post the URL, PDF files are >>>>>>> removed >>>>>>> from the mailing list. >>>>>>> >>>>>>> Tilman >>>>>>> >>>>>>> Am 27.07.2014 02:35, schrieb -A: >>>>>>> >>>>>>> Hello again. I've been trying to figure out this issue that has come >>>>>>>> up for me and in my research I found someone posting on >>>>>>>> StackOverflow ( >>>>>>>> http://stackoverflow.com/questions/10844271/how-to-get- >>>>>>>> font-color-using-pdfbox) a similar issue where they could not read >>>>>>>> any colors from a PDF. The user posted the code and someone else >>>>>>>> took it, >>>>>>>> ran it, and reported that it worked. The users approach was >>>>>>>> different than >>>>>>>> mine, but alas. >>>>>>>> >>>>>>>> I'm not sure at this point what is going on. I have stepped through >>>>>>>> each individual character and checked the PDGraphicsState object, >>>>>>>> and even >>>>>>>> when I am looking at an open file with visibly red text (attached) >>>>>>>> the >>>>>>>> debugger only reports DeviceGray. If I print out the ColorSpace name >>>>>>>> from >>>>>>>> the PDGraphicsState this is what is printed - for every character. >>>>>>>> >>>>>>>> I would appreciate if someone could perhaps run the attached text >>>>>>>> stripper with the attached PDF file and report back if it actually >>>>>>>> prints >>>>>>>> trueinstead of false, as it does for me. Since I saw this occurrence >>>>>>>> elsewhere I'd like to rule that out - in case an IDE setting of some >>>>>>>> sort >>>>>>>> may be causing this? >>>>>>>> >>>>>>>> It should be noted that I began using PDFBox with 1.8.5 and had this >>>>>>>> code working fine. Still with 1.8.5 yesterday it was failing. >>>>>>>> Upgrading to >>>>>>>> 1.8.6 yielded the same results. >>>>>>>> >>>>>>>> If this is an actual issue I do not mind attempting to solve it if >>>>>>>> someone may have a general idea where to point me as to prevent >>>>>>>> needless >>>>>>>> meddling with graphics state objects. Or, if this should be reported >>>>>>>> I can >>>>>>>> do that as well. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> -Aaron >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Previous Message:* >>>>>>>> * >>>>>>>> * >>>>>>>> * >>>>>>>> * >>>>>>>> I've attached an updated stripper file with the only addition being a >>>>>>>> main function to test the class specifically. >>>>>>>> >>>>>>>> When ran with the PDF I have also attached it indeed does not >>>>>>>> recognize the red text. >>>>>>>> >>>>>>>> At this point it seems that this issue is solely dependent on PDFBox. >>>>>>>> I'll stay tuned for some insight hopefully. If any other information >>>>>>>> is >>>>>>>> needed, let me know! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>> >>