Re: PDFA extension value type case sensitivity

2025-04-04 Thread Tim Allison
Not that Tilman needs it, but to support his point, we use JempBox in Apache Tika because it is much more relaxed. On Fri, Apr 4, 2025 at 5:55 AM Tilman Hausherr wrote: > Hi, > > It gets weirder, I found the XMP specification, sometimes they write > "closed Choice", sometimes "Closed Choice". >

Re: Does preflight check for "character encoding"?

2023-06-27 Thread Tim Allison
Over on Apache Tika (via PDFBox!), we report the number of characters without Unicode mappings, and, if you add our tika-eval jar, you can also get an "out of vocabulary" statistic that is an indicator that extracted text is garbage. Happy to chat over on u...@tika.apache.org on either of those top

Re: Non-embedded fonts

2022-10-13 Thread Tim Allison
2022 at 6:21 AM Tim Allison wrote: > > Thank you! > > On Wed, Oct 12, 2022 at 2:01 PM Tilman Hausherr wrote: > > > > On 12.10.2022 19:21, Tim Allison wrote: > > > Hi All, > > >Is there an easy-ish way for me to figure out if a PDF has > &

Re: Non-embedded fonts

2022-10-13 Thread Tim Allison
Thank you! On Wed, Oct 12, 2022 at 2:01 PM Tilman Hausherr wrote: > > On 12.10.2022 19:21, Tim Allison wrote: > > Hi All, > >Is there an easy-ish way for me to figure out if a PDF has > > non-embedded fonts during TextStripping or otherwise? > > > If you h

Non-embedded fonts

2022-10-12 Thread Tim Allison
Hi All, Is there an easy-ish way for me to figure out if a PDF has non-embedded fonts during TextStripping or otherwise? Thank you. Best, Tim - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apa

Detecting black (or any?) highlighting?

2021-04-29 Thread Tim Allison
All, Tilman recently tweeted about a failed redaction with black highlighter [0]. I realize there are probably 18 different ways of failing with redaction, but is there a fairly straightforward way to look for black highlighting with PDFBox? In the particular file referenced from Tilman's tw

Re: Reverting incremental updates?

2021-03-16 Thread Tim Allison
Thank you, Marc. Is this using XrefTrailerResolver...do you happen to have any example code? On Tue, Mar 16, 2021 at 9:53 AM Marc Kaufman wrote: > > You can track back through the XREF chain to make sure the %%EOF you > found is correct. > > On 3/15/2021 1:40 PM, Tim Allison

Re: Reverting incremental updates?

2021-03-16 Thread Tim Allison
Got it. Thank you! On Tue, Mar 16, 2021 at 3:13 AM Tilman Hausherr wrote: > I don't have a better idea. > > Tilman > > Am 15.03.2021 um 21:40 schrieb Tim Allison: > > All, > > > >Is there an easy-ish/programmatic way to extract earlier versio

Reverting incremental updates?

2021-03-15 Thread Tim Allison
All, Is there an easy-ish/programmatic way to extract earlier versions of a PDF in PDFBox if there are incremental updates. I found [1], and that's easy enough, but I worry about %%EOF that might show up in uncompressed streams or comments. It looks like you actually have to do a full parse to

Re: ExceptionInInitializationError - PDDocument

2020-06-25 Thread Tim Allison
Is there anything we can do at the Tika level to work around this bit of joy? On Thu, Jun 25, 2020 at 2:28 AM Tilman Hausherr wrote: > Hi, > > Sadly, this is a longtime PITA - this code segment was put there because > of a (different) problem in multithreaded code. > > I suggest you find a way t

Incremental updates

2020-02-21 Thread Tim Allison
All, I’d like to add a flag in Tika to allow users to find PDFs w incremental updates. Is there an easy way to do this? Thank you! Best, Tim

Re: Parsing order issue

2019-12-17 Thread Tim Allison
PDFBox Colleagues, Any recommendations? On Mon, Dec 16, 2019 at 7:05 AM Lu Sun wrote: > Dear Tika Dev Team, > > > > Hope this email finds you well. > > > > I have been actively using Tika for pdf file reading. One issue I found is > the parsing order. As shown in attached image, the parsing or

Re: Parsing huge PDF (400Mb, 2700 pages)

2019-11-14 Thread Tim Allison
CC'ing colleagues on PDFBox...any recommendations? Sergey's recommendation is great for documents that can be parsed via streaming. However, PDFBox does not currently parse PDFs in a streaming mode. It builds the full document tree -- PDFBox colleagues let me know if I'm wrong. On Thu, Nov 14,

low level parsing example?

2019-10-17 Thread Tim Allison
All, Apologies for not digging into our codebase more before asking this... If I wanted a low level SAX-like parser where an event is a COSObject, where would I start? Should I start with the new on-demand parser in master/trunk or should I go back to 1.8.x? I'm interested in finding: * Objec

Extract actual /Table /TD /TR markup info?

2019-06-04 Thread Tim Allison
All, I have some pdfs with actual /Table /TD /TR markup. How much effort would it be to extend PDFTextStripper to add, e.g. startTable(), endTable(), startTD(), endTD(), etc...? If I do have time to work on this (uncertain at this point), would there be interest in putting this into PDFBox.

Re: Corrupted PDF file causing severe OOM

2019-05-15 Thread Tim Allison
.inflate(Inflater.java:259) > at java.util.zip.Inflater.inflate(Inflater.java:280) > at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83) > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50) > ... 17 more > > > On Wed, May 15, 2019 at 4:54 PM Tim Allison wrote: > > >

Re: Corrupted PDF file causing severe OOM

2019-05-15 Thread Tim Allison
Sounds like it might be a bug. PDFBox colleagues, any recs? Slava, if you’re able to share the file even if only privately, that’ll help. On Wed, May 15, 2019 at 9:49 AM Slava G wrote: > I have small pdf file (142kb) while I'm trying to parse it with TIKA my > entire app is crashing on OOM wit

Re: No Unicode mapping for xx (xx) in font null

2019-04-04 Thread Tim Allison
> > Many of these fonts are proprietary and so impossible to obtain. > > I'd be happy to hear of others prepared to help with managing these - I've > spent months... > > > > On Wed, Apr 3, 2019 at 5:52 PM Tilman Hausherr > wrote: > > > Am 02.04.2

Re: No Unicode mapping for xx (xx) in font null

2019-04-04 Thread Tim Allison
a parallel array -> parallel arrays -j -> -J (tika-app commandline options) On Thu, Apr 4, 2019 at 7:06 AM Tim Allison wrote: > > And with TIKA-2846 (thanks to Tilman), you will now be able to see how > many unmapped chars there were per page. If there's more than one

Re: No Unicode mapping for xx (xx) in font null

2019-04-04 Thread Tim Allison
And with TIKA-2846 (thanks to Tilman), you will now be able to see how many unmapped chars there were per page. If there's more than one page, you'll get a parallel array of ints. These were the results on your doc: 0: pdf:unmappedUnicodeCharsPerPage : 3242 0: pdf:charsPerPage : 3242 Note, you'

Re: hasContents in PDFTextStripper?

2019-04-02 Thread Tim Allison
Thank you! And, right, I see that Text is in the class name :D On Tue, Apr 2, 2019 at 1:26 PM Andreas Lehmkuehler wrote: > Am 02.04.19 um 13:32 schrieb Tim Allison: > > All, > >I just noticed this in PDFTextStripper's processPages(): > >

hasContents in PDFTextStripper?

2019-04-02 Thread Tim Allison
All, I just noticed this in PDFTextStripper's processPages(): if (page.hasContents()) { processPage(page); } If a page has an embedded file, inline images, annotations etc, but no text content, does this mean we're skipping the page by accident? In short, do we need to override processPage

Re: No Unicode mapping for xx (xx) in font null

2019-04-01 Thread Tim Allison
I defer to my colleagues on PDFBox, but the unicode mapping warning means what it says -- there is no way (short of nlp/language modeling/ai) to reconstruct how to map the characters as stored in the document to the correct unicode equivalents. The electronic text stored within the PDF may or may

Re: Fwd: Very slow PDF parsing.

2019-02-28 Thread Tim Allison
Thank you, Tilman! On Thu, Feb 28, 2019 at 2:19 PM Tilman Hausherr wrote: > > Thanks, I got the file. It has about 1000 objects but much more objects > are created. So I think this is a bug and not related to the size. > > The hashmap in decryption seems suspicious to me... Coincidentally, > toda

Re: Fwd: Very slow PDF parsing.

2019-02-28 Thread Tim Allison
14 it's 40 minutes running, no result, still working... >> Seems that issue is still there. >> Thanks >> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G wrote: >> >>> Checking with 2.0.14. Started as an app. Will update soon. >>> >>> On Wed

Re: Fwd: Very slow PDF parsing.

2019-02-27 Thread Tim Allison
()); >>>> PDFParser tmpPdf = new PDFParser(); >>>> PDFParserConfig config = tmpPdf.getPDFParserConfig(); >>>> config.setMaxMainMemoryBytes(31457280); >>>> config.setExtractAcroFormContent(false); >>>> config.setExtractBookmarksText(false); >>>> config.setCatchInter

Re: Fwd: Very slow PDF parsing.

2019-02-26 Thread Tim Allison
t is used to avoid decrypting objects twice. > > The "not encrypted" file is likely encrypted with an empty user password. > > It would also be interesting to hear what parameter is passed to > MemoryUsageSetting when load() is called. > > Tilman > > > > Am 26

Fwd: Very slow PDF parsing.

2019-02-26 Thread Tim Allison
PDFBox Colleagues, Any ideas? -- Forwarded message - From: Tim Allison Date: Tue, Feb 26, 2019 at 12:13 PM Subject: Re: Very slow PDF parsing. To: Sorry...that's an OCR tool. One thing that can slow down processing dramatically is if you have tesseract installed (try t

Fwd: Memory Errors with PDFBOX

2019-01-30 Thread Tim Allison
forwarding to the correct pdfbox address... sorry for the noise... -- Forwarded message - From: Tim Allison Date: Wed, Jan 30, 2019 at 10:29 AM Subject: Re: Memory Errors with PDFBOX To: , Jim , @PDFBox colleagues, Any thoughts/recommendations? On Wed, Jan 30, 2019 at 9:43