> setExtractAcroFormContent(false) Lol... that's what I was looking for but grepping for xfa didn't find that... need more coffee.
Yes, of course, that would solve it. You can configure that via tika-config.xml. On Fri, Aug 22, 2025 at 9:24 AM Tilman Hausherr <[email protected]> wrote: > > Am 22.08.2025 um 14:38 schrieb Tim Allison: > > Unfortunately, there's no way via configuration to tell Tika to avoid > > parsing XFA. > > I've been trying to research this but somehow I messed up my IDE while > working on TIKA-4470 so I can't properly test right now. I was wondering > whether disabling acroform (setExtractAcroFormContent(false)) would work > (although we'd lose the classic form content as well), or if we could > exclude the XMP parser. (There are two occurences of XFA usage in > AbstractPDF2XHTML.java) > > Another solution would be to check PDF files with PDFBox (easy), and > also check for attachments (less easy because there are two types of > attachments). > > Tilman >
