PDFBox 3.0.1 renderer fails on certain files
I have a customer that uses a LOT of PDF files. They currently have 2 files that are failing when we try to render them. The same files can be viewed with Acrobat Reader or Foxit PDF with no errors reported. From Acrobat Reader file info: PDF Producer: PDFOut V3.8 – build 201 – Oct 28 2022 PDF Version: 1.6 (Acrobat 7.x) The stacktrace makes me suspect that the file has an error in it's image compression data - which other readers somehow ignore. Any suggestions? This is the exception trace from PDFBox 3.0.1 java.io.IOException: negative array index: -1 near offset 1 at org.apache.pdfbox.filter.LZWFilter.checkIndexBounds(LZWFilter.java:136) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.filter.LZWFilter.doLZWDecode(LZWFilter.java:110) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:70) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.filter.Filter.decode(Filter.java:96) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.filter.Filter.decode(Filter.java:238) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:73) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:172) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:166) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:188) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.pdmodel.common.PDStream.toByteArray(PDStream.java:407) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.pdmodel.common.function.PDFunctionType4.(PDFunctionType4.java:51) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.pdmodel.common.function.PDFunction.create(PDFunction.java:143) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.pdmodel.graphics.color.PDDeviceN.(PDDeviceN.java:93) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.create(PDColorSpace.java:184) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:223) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:193) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace.process(SetNonStrokingColorSpace.java:56) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:892) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:530) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:505) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:282) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:330) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:247) ~[pdfbox-3.0.1.jar:3.0.1] at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:233) ~[pdfbox-3.0.1.jar:3.0.1] at com.metrixsoftware.preview.PDFBoxRenderer.render(PDFBoxRenderer.java:79) [bin/:?] - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Odd OCG error
Thanks, that really helps. Since we are too close to release to try a newer PDFBox jar, I just added this little bit of code to our system so these PDF's will work. (the if statement before creating the "PDOptionalContentGroup".) if (!dict.getItem(COSName.TYPE).equals(COSName.OCG)) { dict.setItem(COSName.TYPE, COSName.OCG); } PDOptionalContentGroup grp = new PDOptionalContentGroup(dict); On 11/21/2023 10:52 PM, Andreas Lehmkühler wrote: Am 21.11.23 um 21:26 schrieb John Lussmyer: Ugh, formatting mess. For more info, this is the "addOCGs:OCG" log line just before the error message: 10:53:09.765 [etrix SwingWorker[0]] DEBUG ImposedPDFEngine - addOCGs: OCG COSDictionary{COSName{Name}:COSObject{COSNull{}};COSName{Type}:COSObject{COSName{OCG}};} The value for the type is an indirect object. Usally such values are direct objects. The type check fails as it expects a direct object as type value. - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Odd OCG error
Ugh, formatting mess. For more info, this is the "addOCGs:OCG" log line just before the error message: 10:53:09.765 [etrix SwingWorker[0]] DEBUG ImposedPDFEngine - addOCGs: OCG COSDictionary{COSName{Name}:COSObject{COSNull{}};COSName{Type}:COSObject{COSName{OCG}};} On 11/21/2023 10:56 AM, John Lussmyer wrote: I'm using PDFBox 3.0.0 to combine some PDF files. One of the files uses an Optional Content Group. Note that this code has been working just fine for many other files both with and without OCG's. For this file, I get this exception: java.lang.IllegalArgumentException: Provided dictionary is not of type 'COSName{OCG}' at org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup.(PDOptionalContentGroup.java:48) ~[pdfbox-3.0.0.jar:3.0.0] Code: *if*(obj*instanceof*COSDictionary) { COSDictionary dict= (COSDictionary) obj; COSName dType= dict.getCOSName(COSName.*/TYPE/*); *if*(dType== *null*) { *continue*; } *if*(dType.equals(COSName.*/OCG/*)) { */log/*.debug("addOCGs: OCG {}", dict); PDOptionalContentGroup grp= *new*PDOptionalContentGroup(dict); ocProps.addGroup(grp); ocProps.setGroupEnabled(grp, layersON.contains(grp.getName())); changed= *true*; } } It's failing on the "new PDOptionalContentGroup(dict)" call. Any ideas on why?
Odd OCG error
I'm using PDFBox 3.0.0 to combine some PDF files. One of the files uses an Optional Content Group. Note that this code has been working just fine for many other files both with and without OCG's. For this file, I get this exception: java.lang.IllegalArgumentException: Provided dictionary is not of type 'COSName{OCG}' at org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup.(PDOptionalContentGroup.java:48) ~[pdfbox-3.0.0.jar:3.0.0] Code: *if*(obj*instanceof*COSDictionary) { COSDictionary dict= (COSDictionary) obj; COSName dType= dict.getCOSName(COSName.*/TYPE/*); *if*(dType== *null*) { *continue*; } *if*(dType.equals(COSName.*/OCG/*)) { */log/*.debug("addOCGs: OCG {}", dict); PDOptionalContentGroup grp= *new*PDOptionalContentGroup(dict); ocProps.addGroup(grp); ocProps.setGroupEnabled(grp, layersON.contains(grp.getName())); changed= *true*; } } It's failing on the "new PDOptionalContentGroup(dict)" call. Any ideas on why?
Re: PDF 2.0, PDF/A-4 support
On 11/8/2023 5:28 PM, Peter Wyatt wrote: I would think supporting the following PDF 2.0 features are highly relevant, given that other implementations are already generating PDF 2.0 files today (seehttps://pdfa.org/supporting-pdf20/) A bunch of useful suggestions elided.. What I REALLY REALLY need is support for Overprint and Knockout when Rendering to an image. I run into too many PDF's that are unrecognizable when generating an image due to this.
Re: Looking for a Debugger that can show which incremental save an object belongs to
I doubt there is a way. It's most likely that the signing code makes a MD5 checksum (or similar) of the file when it is signed. If the file is changed, checking the signing will re-calculate the checksum and find that it is different. There isn't any info on what changed, just that SOMETHING changed. On 10/6/2023 8:50 PM, Tilman Hausherr wrote: On 06.10.2023 19:50, Marc Kaufman wrote: I find myself debugging PDF files where Acrobat claims "Document has been altered or corrupted since it was signed." I would dearly love to see which objects belong to the last xref (color code is OK). Has anyone added that feature to PDF Debugger, or know where I can find one? Just comparing revisions is not enough, since sometimes the "changed" object is identical to the same object in the previous revision. I don't know of any. I research such questions the hard way, with NOTEPAD++. - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: how to replace MemoryUsageSetting.setupMixed(100mb) ?
Thanks, that does help. Having an example means I'll find the relevant classes to use MUCH faster. On 10/5/2023 3:07 PM, Pados Attila wrote: I am using something like this: PDDocument a1doc = Loader.loadPDF(new RandomAccessReadBuffer(resourceAsStream), () -> new ScratchFile(MemoryUsageSetting.setupMixed(100))); (I use it with tempFileOnly, but the rest are the same) On Thu, Oct 5, 2023 at 9:50 PM John Lussmyer wrote: I'm trying to update to the latest PDFBox 3.0.0. The code was using a call to loadPDF(file,MemoryUsageSetting.setupMixed(MB100); // 100 MB I see that that no longer exists, but the only mention of it doesn't seem to provide any info on how to configure an equivalent replacement? Any suggestions? - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
how to replace MemoryUsageSetting.setupMixed(100mb) ?
I'm trying to update to the latest PDFBox 3.0.0. The code was using a call to loadPDF(file,MemoryUsageSetting.setupMixed(MB100); // 100 MB I see that that no longer exists, but the only mention of it doesn't seem to provide any info on how to configure an equivalent replacement? Any suggestions? - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
RE: Optional Content Groups
Ah, thanks. I hadn't noticed the "show internal structure" choice. The tool I used to use just did that normally. (I thought things looked a bit odd...) From: Tilman Hausherr Sent: Wednesday, January 4, 2023 10:29 AM To: users@pdfbox.apache.org Subject: Re: Optional Content Groups [EXTERNAL] On 04.01.2023 19:22, John Lussmyer wrote: I have a pdf with several Optional Content groups. I can find their definitions in the Page/Resources/Properties dictionary, but I don't see how they are enabled or disabled. Where is that controlled? This is below the document root, use PDFDebugger to look at it (first click "view", "show internal structure"). To learn more, you'll have to read the PDF specification, although some can be understood by looking at the structure below. This is from the file at https://issues.apache.org/jira/browse/PDFBOX-5524<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPDFBOX-5524&data=05%7C01%7CJohn.Lussmyer%40efi.com%7C2147e215b0b340ca7e7008daee8194fe%7C3fe4532499b245c397517034bae71475%7C0%7C1%7C638084537602543858%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=ykSglpshnvSasNgiDQmpJCt1T3t8JdYmD7Q7o9o69ow%3D&reserved=0> which has two "off" OCGs [cid:image001.png@01D92029.9F0E9020] [cid:image002.png@01D92029.9F0E9020] Tilman Confidentiality notice: This message may contain confidential information. It is intended only for the person to whom it is addressed. If you are not that person, you should not use this message. We request that you notify us by replying to this message, and then delete all copies including any contained in your reply. Thank you.
Optional Content Groups
I have a pdf with several Optional Content groups. I can find their definitions in the Page/Resources/Properties dictionary, but I don't see how they are enabled or disabled. Where is that controlled? Confidentiality notice: This message may contain confidential information. It is intended only for the person to whom it is addressed. If you are not that person, you should not use this message. We request that you notify us by replying to this message, and then delete all copies including any contained in your reply. Thank you.
Re: Possible bug with FunctionType3?
I was able to get ahold of the customers PDF file - but it (of course) works just FINE for me on my system. I have logs showing multiple identical failures for the customer - and lots of other files succeeding. I'd really like to test your possible fix - but first I have to figure out how to reproduce the problem On Tue Jun 14 21:06:02 PDT 2022 thaush...@t-online.de said: >Am 15.06.2022 um 05:42 schrieb Tilman Hausherr: >> float[] functionResult = function.eval(functionValues); >> >> eval is an abstract method, but I don't see how any of its >> implementation would return null :-( (but I just woke up) > >oops, the return of eval() is irrelevant here. > >Anyway, I fixed the possible bug below in >https://issues.apache.org/jira/browse/PDFBOX-5459 , try a snapshot in an >hour or two > >https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/ -- Tigers prowl and Dragons soar in my dreams... - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Possible bug with FunctionType3?
We are using PDFBox to render various PDF files in our product. One customer is having issues due to PDFBox throwing a NullPointerException when certain files are rendered. (No, I don't have copies of the files - yet) Any ideas on what could cause this? java.lang.NullPointerException: null at org.apache.pdfbox.pdmodel.common.function.PDFunctionType3.eval(PDFunctionType3.java:123) ~[pdfbox.jar:?] at org.apache.pdfbox.pdmodel.graphics.shading.PDShading.evalFunction(PDShading.java:410) ~[pdfbox.jar:?] at org.apache.pdfbox.pdmodel.graphics.shading.PDShading.evalFunction(PDShading.java:393) ~[pdfbox.jar:?] at org.apache.pdfbox.pdmodel.graphics.shading.AxialShadingContext.calcColorTable(AxialShadingContext.java:151) ~[pdfbox.jar:?] at org.apache.pdfbox.pdmodel.graphics.shading.AxialShadingContext.(AxialShadingContext.java:128) ~[pdfbox.jar:?] at org.apache.pdfbox.pdmodel.graphics.shading.AxialShadingPaint.createContext(AxialShadingPaint.java:62) ~[pdfbox.jar:?] at sun.java2d.pipe.AlphaPaintPipe.startSequence(Unknown Source) ~[?:?] at sun.java2d.pipe.SpanShapeRenderer$Composite.startSequence(Unknown Source) ~[?:?] at sun.java2d.pipe.SpanShapeRenderer.renderSpans(Unknown Source) ~[?:?] at sun.java2d.pipe.SpanShapeRenderer.fill(Unknown Source) ~[?:?] at sun.java2d.pipe.ValidatePipe.fill(Unknown Source) ~[?:?] at sun.java2d.SunGraphics2D.fill(Unknown Source) ~[?:?] at org.apache.pdfbox.rendering.PageDrawer.shadingFill(PageDrawer.java:1234) ~[pdfbox.jar:?] I believe the version we are using is the 3.0.0-alpha2. Confidentiality notice: This message may contain confidential information. It is intended only for the person to whom it is addressed. If you are not that person, you should not use this message. We request that you notify us by replying to this message, and then delete all copies including any contained in your reply. Thank you.
Possible PDFBox bug?
We have an app that can generate multi-page PDF Files. We recently ran into a problem where the library we were using would keep ALL the pages in memory. For a quick workaround we have it write out single-page PDF files, then use PDFBox to combine them. We recently found a bug in the way that the pages get modified when combined into a single PDF. When we generate the pages, sometimes the MediaBox starts at negative coordinates. When PDFBox adds that page to a document, it offsets it by that negative amount - which moves the page content up and to the right. Out page combining code looks like this. try (PDDocument doc = new PDDocument(MemoryUsageSetting.setupTempFileOnly())) { for (File pagFile : srcPages) { log.debug("make: page {}", pagFile.getAbsolutePath()); PDPage page = new PDPage(); doc.addPage(page); try (PDPageContentStream contents = new PDPageContentStream(doc, page)) { try (PDDocument sourceDoc = Loader.loadPDF(pagFile, MemoryUsageSetting.setupTempFileOnly())) { PDPage srcPage = sourceDoc.getPage(0); page.setUserUnit(srcPage.getUserUnit()); page.setMediaBox(srcPage.getMediaBox()); page.setCropBox(srcPage.getCropBox()); page.setTrimBox(srcPage.getTrimBox()); // Create a Form XObject from the source document using LayerUtility LayerUtility layerUtility = new LayerUtility(doc); PDFormXObject form = layerUtility.importPageAsForm(sourceDoc, 0); // draw the full form contents.drawForm(form); } } } doc.save(outPDF); } The original Page pdf has a TrimBox[0,0,1296,864], MediaBox[-72,-72,1368,936] The page in the PDFBox combined output has the same TrimBox and MediaBox, BUT the /Form1 it uses to place the contents has a BBox[-72,-72,1368,936] and a Matrix[1,0,0,1,72,72]. I'm not sure why it's adding a Matrix to offset the content. - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Problem with text extraction
On Sun Jan 23 10:02:08 PST 2022 rc...@pobox.com said: >I am using PDFBox's PDFTextStripper.getText() for a particular kind of >PDF file generated by a government agency, and the text I'm getting does >not match that displayed by Acrobat Reader for the same files. The >getText() calls occasionally get characters Reader does not display, and >in one case getText() gets an "O" instead of the "U" displayed by >Reader. I would like to know if there's some way I can get same text as >Reader displays. Have you checked for embedded Fonts in the PDF? It's quite possible to have fonts where the code for "A" is NOT the save as the ASCII "A". -- Worlds only All Electric F-250 truck! http://john.casadelgato.com/Electric-Vehicles/1995-Ford-F-250 - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: memory requirements when merging PDF files?
On Fri Jan 07 08:55:38 PST 2022 ke...@trumpetinc.com said: >If you use the temporary file memory storage, it should be possible to work >with very large files. Thanks, I was hoping there was some way to deal with this case. I just ran a quick test, generating a 2000 page PDF by placing a 1 page PDF on each output page. Using LayerUtility & PDFFormXObject as the real usage will involve placing multiple small PDFs on a large page, for many large pages. The 1 page PDF was 291K, the resulting 2000 page pdf was 168MB. (I was doing gc() just before reporting the usage.) Doing it all in memory: 7m 38s, and peaked at 424MB in use. with the setTempFileOnly on the output document: 7m 1s, 292MB. -- Try my Sensible Email package! https://sourceforge.net/projects/sensibleemail/ - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
memory requirements when merging PDF files?
I have a need to merge a couple thousand PDF's into one humongous PDF. The old tool we use for PDF manipulation runs out of memory as it builds the result PDF in memory, and only writes it out when done. Can PDFBox do something more like streaming the output as it's built? or even not load all the source pdf content streams until needed for output? - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Rending text in thumbnail images
On Thu Sep 09 10:10:52 PDT 2021 thaush...@t-online.de said: >In theory one could make separate rendering hints for fonts and for >ordinary vectors, but that would be messy and hard to understand. (And >who knows whether it will work for your file) > >I recommend that you try doing this yourself by downloading the source >code and changing PageDrawer and put some hard-coded modifications. >Search for "graphics.". I was able to find a bit of time to take a look at this. I experimented in PageDrawer.drawGlyph and found that I get pretty close to the old renderer if I changed the RenderingMode to STROKE instead of FILL. As far as I can tell, the FILL comes from the PDF interpreting code. For my use of generating tiny thumbnails, I added the following code. Not sure if it would ever be useful to anyone else. If it might, I'd need to clean it up a bit and do whatever is needed to fit with the project style and conventions. --I added this to my PageDrawer.java public static class BoxKey extends Key { public static BoxKey KEY_TEXTHINT = new BoxKey(1984); private BoxKey( final int privatekey) { super(privatekey); } @Override public boolean isCompatibleValue(final Object val) { boolean isvalid = false; try { RenderingMode.valueOf((String) val); isvalid = true; } finally { } return isvalid; } } private RenderingMode textRenderModeHint = null; -- then in the constructor, added if (renderingHints.containsKey(BoxKey.KEY_TEXTHINT)) { textRenderModeHint = RenderingMode.valueOf((String) renderingHints.get(BoxKey.KEY_TEXTHINT)); } -- and in drawGlyph, changed renderingMode to be: RenderingMode renderingMode = (textRenderModeHint != null) ? textRenderModeHint : state.getTextState().getRenderingMode(); -- and finally, where I actually use the renderer, I added: hintlist.put(BoxKey.KEY_TEXTHINT, "STROKE"); -- Try my Sensible Email package! https://sourceforge.net/projects/sensibleemail/ - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Rending text in thumbnail images
On Wed Sep 08 20:31:47 PDT 2021 thaush...@t-online.de said: >Ooops, you didn't mention that you turned antialiasing off. The image >looks as if interpolation was also turned off. If you set rendering >hints you always have to set all the hints you need. Here's the default: > > private RenderingHints createDefaultRenderingHints(Graphics2D graphics) > { > RenderingHints r = new RenderingHints(null); > r.put(RenderingHints.KEY_INTERPOLATION, isBitonal(graphics) ? >RenderingHints.VALUE_INTERPOLATION_NEAREST_NEIGHBOR : > RenderingHints.VALUE_INTERPOLATION_BICUBIC); > r.put(RenderingHints.KEY_RENDERING, >RenderingHints.VALUE_RENDER_QUALITY); > r.put(RenderingHints.KEY_ANTIALIASING, isBitonal(graphics) ? >RenderingHints.VALUE_ANTIALIAS_OFF : >RenderingHints.VALUE_ANTIALIAS_ON); > return r; > } So, setting one Rendering Hint discards all default values? What does it use for those others then? Just tried with this set: hintlist.put(RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_OFF); hintlist.put(RenderingHints.KEY_TEXT_ANTIALIASING, RenderingHints.VALUE_TEXT_ANTIALIAS_OFF); hintlist.put(RenderingHints.KEY_INTERPOLATION, RenderingHints.VALUE_INTERPOLATION_BICUBIC); hintlist.put(RenderingHints.KEY_RENDERING, RenderingHints.VALUE_RENDER_QUALITY); hintlist.put(RenderingHints.KEY_FRACTIONALMETRICS, RenderingHints.VALUE_FRACTIONALMETRICS_ON); I also tried a variation with VALUE_INTERPOLATION_NEAREST_NEIGHBOR. No change. Still looks like random pixels scattered on the page. -- Tigers prowl and Dragons soar in my dreams... - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Rending text in thumbnail images
You can see the diference in the results in the images I used in a StackOverflow posting. (Before I remembers this email list.) https://stackoverflow.com/questions/69107975/how-to-improve-text-contrast-in-pdfbox-rendered-thumbnail-image On Wed Sep 08 12:40:14 PDT 2021 cou...@casadelgato.com said: >On Wed Sep 08 12:20:59 PDT 2021 thaush...@t-online.de said: >>Am 08.09.2021 um 21:16 schrieb John Lussmyer: >>> Ok, just tried that - no change. >>> >>> We are currently trying PDFBox 3.0.0-RC1 - is that a problem? >> >>No, this is excellent; there will be a new release of another beta in a >>few days. You can try it here >> >>https://dist.apache.org/repos/dist/dev/pdfbox/3.0.0-alpha2/ >> >>Is the PDFBox code creating the same image size as the old code? What >>code are you using? Can you share a file and the result (upload to >>sharehoster)? > >Image size is within a couple pixels in. Same format (ARGB). The the older >renderer image file is about twice as many bytes of image file. >Can't really share the original as it has some names and account numbers. >The main feature of the PDF is that it is pure text, no images. 8.5 x 11. > >-- > >Try my Sensible Email package! https://sourceforge.net/projects/sensibleemail/ >- >To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >For additional commands, e-mail: users-h...@pdfbox.apache.org -- Worlds only All Electric F-250 truck! http://john.casadelgato.com/Electric-Vehicles/1995-Ford-F-250 - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Rending text in thumbnail images
On Wed Sep 08 12:20:59 PDT 2021 thaush...@t-online.de said: >Am 08.09.2021 um 21:16 schrieb John Lussmyer: >> Ok, just tried that - no change. >> >> We are currently trying PDFBox 3.0.0-RC1 - is that a problem? > >No, this is excellent; there will be a new release of another beta in a >few days. You can try it here > >https://dist.apache.org/repos/dist/dev/pdfbox/3.0.0-alpha2/ > >Is the PDFBox code creating the same image size as the old code? What >code are you using? Can you share a file and the result (upload to >sharehoster)? Image size is within a couple pixels in. Same format (ARGB). The the older renderer image file is about twice as many bytes of image file. Can't really share the original as it has some names and account numbers. The main feature of the PDF is that it is pure text, no images. 8.5 x 11. -- Try my Sensible Email package! https://sourceforge.net/projects/sensibleemail/ - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Rending text in thumbnail images
Ok, just tried that - no change. We are currently trying PDFBox 3.0.0-RC1 - is that a problem? On Wed Sep 08 11:55:56 PDT 2021 thaush...@t-online.de said: >The default rendering is high quality oder speed, although there is one >obscure option you could try, >PDFRenderer.setImageDownscalingOptimizationThreshold(0). And make sure >you're using the latest version (2.0.24). >No there is no option to prioritize text pixels. -- Bobcats and Cougars, oh my! http://john.casadelgato.com/Pets - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Rending text in thumbnail images
We are trying to switch to using PDFBox to create the thumbnail images of PDF Pages in our application. (The older product we currently use fails on OS 11). I'm running into a problem if there is text on the page, the thumbnail image makes it hard to make any sense at all of the text. (yes, these are thumbnails, and don't need to be readable - but should be recognizeable.) The older renderer created thumbnails that, while the text ws not legible, it was definitely visible, and you could tell the shapes of the words. PDFBox creates thumbnails where the text is more of a scattering of random pixels, an you have to guess that it might be text. Is there any way when generating the raster to have it treat Text pixels as the higher priority for the color of a pixel? It seems to be only allowing the text color to control the pixel if the entire pixel is part of the character. - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Parsing huge PDF (400Mb, 2700 pages)
On Thu Nov 14 08:32:20 PST 2019 sahy...@fileaffairs.de said: >well - PDF ist not really easily streamable as > >- it's organized as a random access format >- the refernce table about the objects forming the PDF is at the end of the >file to you have to read the last parts first and >then move back While the PDF file itself can't be usefully streamed, the CONTENT streams can be. Those are usually 99.99% of the file size. -- Try my Sensible Email package! https://sourceforge.net/projects/sensibleemail/ - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Exact PDF text - add it back as an annotation
On Tue Oct 29 21:59:57 PDT 2019 thaush...@t-online.de said: >IIRC tesseract can do this. Not as annotation, but as invisible font. As far as I can tell, it does it the same way that other programs do. It's added to the content stream, mixed with all the commands for positioning, font size, etc... Words are often broken up. I'm looking for something that just embeds the plain text, with NO markup. -- Bobcats and Cougars, oh my! http://john.casadelgato.com/Pets - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Exact PDF text - add it back as an annotation
I have a bunch of PDF files that have had an OCR package run against them. The problem is that it adds the text to the normal Page content, and tries to position the recognized text at the location in the image it was found. So the text is mixed with lots of positioning, etc.. information. I'd like to extract all the text as a block of text, and just add it all as a single item. Probably an annotation. There are lots of tools to extract text from a PDF - but they are all web based, or use a GUI to do one file at a time. I want to just run this against a directory full of PDF's and have it do all of them. Anyone know of such a tool? Have one written? - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: PDFRendering
On Mon Jun 27 14:34:03 PDT 2016 j...@jahewson.com said: >Right, and if it was a leak then system.gc would not have fixed it. That is only SOMETIMES true. I've run into "memory leaks" where the leak was uncleared references to objects. So the old objects just hung around forever. -- Bobcats and Cougars, oh my! http://john.casadelgato.com/Pets - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Call java file from PDF
So, install a Java app that starts when the system boots. It listens on a port, then your PDF can use http://localhost:789/didlysquat to submit the form to your java app. You java app doesn't need to be a full web server, just listen on the appropriate port and catch the data. On Sun Feb 14 13:34:20 PST 2016 bigal...@gmail.com said: >Yeah all the Java runtimes Ave virtual machines are installed I tested >it. >On 15/02/2016 10:23 am, "Olaf Drümmer" wrote: > >> But you have the rights to install a Java program? >> >> Olaf >> >> > On 14.02.2016, at 21:40, Al Grant wrote: >> > >> > I would not have the permission rights to install a web server :( >> > On 15/02/2016 9:27 am, "John Lussmyer" wrote: >> > >> >> On Sun Feb 14 12:15:12 PST 2016 bigal...@gmail.com said: >> >>> Thank you for both your answers. >> >>> >> >>> The html is very appealing, but what I did not mention is in working >> >>> within a rather rigid IT environment. >> >>> >> >>> I won't be able to install a html server. So back to Java executable >> >> (which >> >>> I can use) unless there is a better way? >> >> >> >> You can't even have your app running on the local machine? >> >> IT can be your html server. (Use a non-standard port, and ignore any >> >> requests that aren't EXACTLY what you are expecting.) >> >> >> >> >> >> >> >> -- >> >> >> >> Bobcats and Cougars, oh my! http://john.casadelgato.com/Pets >> >> >> >> >> >> - >> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >> >> >> >> - >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >> -- Tigers prowl and Dragons soar in my dreams... - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Call java file from PDF
On Sun Feb 14 12:15:12 PST 2016 bigal...@gmail.com said: >Thank you for both your answers. > >The html is very appealing, but what I did not mention is in working >within a rather rigid IT environment. > >I won't be able to install a html server. So back to Java executable (which >I can use) unless there is a better way? You can't even have your app running on the local machine? IT can be your html server. (Use a non-standard port, and ignore any requests that aren't EXACTLY what you are expecting.) -- Bobcats and Cougars, oh my! http://john.casadelgato.com/Pets - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Creating a page from a block of CCITTG42D data?
In this case I'm converting some proprietary image files from a program I wrote 18 years ago. On Thu Feb 19 13:41:30 PST 2015 thaush...@t-online.de said: >Glad it works. Where did you get the raw G4 files from / is this >something that you think might be useful to many, or was it just >something unique for you? I'm just wondering if I should add such code >to the 2.0 or 2.1 version. > >Tilman > >Am 19.02.2015 um 19:09 schrieb John Lussmyer: >> On Wed Feb 18 23:34:09 PST 2015 thaush...@t-online.de said: >>> Assuming you are using 1.8.8, put the ccitt stream into a PDStream >>> object, then call the PDCcitt constructor with that PDStream. >>> >>> PDStream pd =new PDStream(doc, new >>> ByteArrayInputStream(data), true); >> >> >> Thanks, that worked! (with a few tweaks and typo corrections of course!) >> >> -- >> >> Try my Sensible Email package! >> https://sourceforge.net/projects/sensibleemail/ >> >> >> - >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org -- Worlds only All Electric F-250 truck! http://john.casadelgato.com/Electric-Vehicles/1995-Ford-F-250 - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Creating a page from a block of CCITTG42D data?
On Wed Feb 18 23:34:09 PST 2015 thaush...@t-online.de said: >Assuming you are using 1.8.8, put the ccitt stream into a PDStream >object, then call the PDCcitt constructor with that PDStream. > >PDStream pd =new PDStream(doc, new >ByteArrayInputStream(data), true); Thanks, that worked! (with a few tweaks and typo corrections of course!) -- Try my Sensible Email package! https://sourceforge.net/projects/sensibleemail/ - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Creating a page from a block of CCITTG42D data?
So, I have a block of data (byte[]) that represents a scanned image, compressed using CCITTG4. I'm new to PDFBox. (of course) So far, I haven't been able to figure out how I can create a page that consists of just that image. All the examples want to read the image from a file, and decompress it. Since I already have it as a compressed block, I'd prefer to just use it as-is. The last time I did much work with PDF's, I was working directly with the dictionaries. I don't see a simple way of even getting to the Page dictionary in PDFbox. Anyone have a suggestion on how to do this? -- Worlds only All Electric F-250 truck! http://john.casadelgato.com/Electric-Vehicles/1995-Ford-F-250 - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org