Not that Tilman needs it, but to support his point, we use JempBox in
Apache Tika because it is much more relaxed.
On Fri, Apr 4, 2025 at 5:55 AM Tilman Hausherr
wrote:
> Hi,
>
> It gets weirder, I found the XMP specification, sometimes they write
> "closed Choice", sometimes "Closed Choice".
>
Over on Apache Tika (via PDFBox!), we report the number of characters
without Unicode mappings, and, if you add our tika-eval jar, you can also
get an "out of vocabulary" statistic that is an indicator that extracted
text is garbage. Happy to chat over on u...@tika.apache.org on either of
those top
2022 at 6:21 AM Tim Allison wrote:
>
> Thank you!
>
> On Wed, Oct 12, 2022 at 2:01 PM Tilman Hausherr wrote:
> >
> > On 12.10.2022 19:21, Tim Allison wrote:
> > > Hi All,
> > >Is there an easy-ish way for me to figure out if a PDF has
> &
Thank you!
On Wed, Oct 12, 2022 at 2:01 PM Tilman Hausherr wrote:
>
> On 12.10.2022 19:21, Tim Allison wrote:
> > Hi All,
> >Is there an easy-ish way for me to figure out if a PDF has
> > non-embedded fonts during TextStripping or otherwise?
>
>
> If you h
Hi All,
Is there an easy-ish way for me to figure out if a PDF has
non-embedded fonts during TextStripping or otherwise?
Thank you.
Best,
Tim
-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apa
All,
Tilman recently tweeted about a failed redaction with black
highlighter [0]. I realize there are probably 18 different ways of
failing with redaction, but is there a fairly straightforward way to
look for black highlighting with PDFBox?
In the particular file referenced from Tilman's tw
Thank you, Marc. Is this using XrefTrailerResolver...do you happen to
have any example code?
On Tue, Mar 16, 2021 at 9:53 AM Marc Kaufman wrote:
>
> You can track back through the XREF chain to make sure the %%EOF you
> found is correct.
>
> On 3/15/2021 1:40 PM, Tim Allison
Got it. Thank you!
On Tue, Mar 16, 2021 at 3:13 AM Tilman Hausherr
wrote:
> I don't have a better idea.
>
> Tilman
>
> Am 15.03.2021 um 21:40 schrieb Tim Allison:
> > All,
> >
> >Is there an easy-ish/programmatic way to extract earlier versio
All,
Is there an easy-ish/programmatic way to extract earlier versions of
a PDF in PDFBox if there are incremental updates. I found [1], and
that's easy enough, but I worry about %%EOF that might show up in
uncompressed streams or comments. It looks like you actually have to
do a full parse to
Is there anything we can do at the Tika level to work around this bit of
joy?
On Thu, Jun 25, 2020 at 2:28 AM Tilman Hausherr
wrote:
> Hi,
>
> Sadly, this is a longtime PITA - this code segment was put there because
> of a (different) problem in multithreaded code.
>
> I suggest you find a way t
All,
I’d like to add a flag in Tika to allow users to find PDFs w incremental
updates. Is there an easy way to do this? Thank you!
Best,
Tim
PDFBox Colleagues,
Any recommendations?
On Mon, Dec 16, 2019 at 7:05 AM Lu Sun wrote:
> Dear Tika Dev Team,
>
>
>
> Hope this email finds you well.
>
>
>
> I have been actively using Tika for pdf file reading. One issue I found is
> the parsing order. As shown in attached image, the parsing or
CC'ing colleagues on PDFBox...any recommendations?
Sergey's recommendation is great for documents that can be parsed via
streaming. However, PDFBox does not currently parse PDFs in a streaming
mode. It builds the full document tree -- PDFBox colleagues let me know if
I'm wrong.
On Thu, Nov 14,
All,
Apologies for not digging into our codebase more before asking this...
If I wanted a low level SAX-like parser where an event is a COSObject,
where would I start?
Should I start with the new on-demand parser in master/trunk or should I
go back to 1.8.x?
I'm interested in finding:
* Objec
All,
I have some pdfs with actual /Table /TD /TR markup.
How much effort would it be to extend PDFTextStripper to add, e.g.
startTable(), endTable(), startTD(), endTD(), etc...?
If I do have time to work on this (uncertain at this point), would
there be interest in putting this into PDFBox.
.inflate(Inflater.java:259)
> at java.util.zip.Inflater.inflate(Inflater.java:280)
> at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
> ... 17 more
>
>
> On Wed, May 15, 2019 at 4:54 PM Tim Allison wrote:
>
> >
Sounds like it might be a bug.
PDFBox colleagues, any recs?
Slava, if you’re able to share the file even if only privately, that’ll
help.
On Wed, May 15, 2019 at 9:49 AM Slava G wrote:
> I have small pdf file (142kb) while I'm trying to parse it with TIKA my
> entire app is crashing on OOM wit
>
> Many of these fonts are proprietary and so impossible to obtain.
>
> I'd be happy to hear of others prepared to help with managing these - I've
> spent months...
>
>
>
> On Wed, Apr 3, 2019 at 5:52 PM Tilman Hausherr
> wrote:
>
> > Am 02.04.2
a parallel array -> parallel arrays
-j -> -J (tika-app commandline options)
On Thu, Apr 4, 2019 at 7:06 AM Tim Allison wrote:
>
> And with TIKA-2846 (thanks to Tilman), you will now be able to see how
> many unmapped chars there were per page. If there's more than one
And with TIKA-2846 (thanks to Tilman), you will now be able to see how
many unmapped chars there were per page. If there's more than one
page, you'll get a parallel array of ints. These were the results on
your doc:
0: pdf:unmappedUnicodeCharsPerPage : 3242
0: pdf:charsPerPage : 3242
Note, you'
Thank you! And, right, I see that Text is in the class name :D
On Tue, Apr 2, 2019 at 1:26 PM Andreas Lehmkuehler wrote:
> Am 02.04.19 um 13:32 schrieb Tim Allison:
> > All,
> >I just noticed this in PDFTextStripper's processPages():
> >
All,
I just noticed this in PDFTextStripper's processPages():
if (page.hasContents())
{
processPage(page);
}
If a page has an embedded file, inline images, annotations etc, but no
text content, does this mean we're skipping the page by accident? In
short, do we need to override processPage
I defer to my colleagues on PDFBox, but the unicode mapping warning
means what it says -- there is no way (short of nlp/language
modeling/ai) to reconstruct how to map the characters as stored in the
document to the correct unicode equivalents. The electronic text
stored within the PDF may or may
Thank you, Tilman!
On Thu, Feb 28, 2019 at 2:19 PM Tilman Hausherr wrote:
>
> Thanks, I got the file. It has about 1000 objects but much more objects
> are created. So I think this is a bug and not related to the size.
>
> The hashmap in decryption seems suspicious to me... Coincidentally,
> toda
14 it's 40 minutes running, no result, still working...
>> Seems that issue is still there.
>> Thanks
>>
>> On Wed, Feb 27, 2019 at 2:52 PM Slava G wrote:
>>
>>> Checking with 2.0.14. Started as an app. Will update soon.
>>>
>>> On Wed
());
>>>> PDFParser tmpPdf = new PDFParser();
>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>> config.setMaxMainMemoryBytes(31457280);
>>>> config.setExtractAcroFormContent(false);
>>>> config.setExtractBookmarksText(false);
>>>> config.setCatchInter
t is used to avoid decrypting objects twice.
>
> The "not encrypted" file is likely encrypted with an empty user password.
>
> It would also be interesting to hear what parameter is passed to
> MemoryUsageSetting when load() is called.
>
> Tilman
>
>
>
> Am 26
PDFBox Colleagues,
Any ideas?
-- Forwarded message -
From: Tim Allison
Date: Tue, Feb 26, 2019 at 12:13 PM
Subject: Re: Very slow PDF parsing.
To:
Sorry...that's an OCR tool. One thing that can slow down processing
dramatically is if you have tesseract installed (try t
forwarding to the correct pdfbox address... sorry for the noise...
-- Forwarded message -
From: Tim Allison
Date: Wed, Jan 30, 2019 at 10:29 AM
Subject: Re: Memory Errors with PDFBOX
To: , Jim ,
@PDFBox colleagues,
Any thoughts/recommendations?
On Wed, Jan 30, 2019 at 9:43
29 matches
Mail list logo