Sorry. That answered your second question, which will enable a solution to the first. :D
For the first, we do have MetadataFilters that remove content by mime type: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181309141#MetadataFilters-ClearByMImeMetadataFilter when you're using the RecursiveParserWrapper (RPW)=-J=/rmeta We could add a filter by file name, but I have worries about the robustness of that. Moving to the RPW takes a bit of a lift (especially for end users), but I strongly recommend using it. Please do not hesitate to ask further questions as you move to integrate it. :D On Sat, Sep 14, 2024 at 6:58 AM Tim Allison <[email protected]> wrote: > Try the RecursiveParserWrapper. With tika-app: java -jar tika-app.jar -J > -t myZip.zip or the /rmeta endpoint with tika-server. Or, coming soon, the > grpc server will effectively give that output. > > On Fri, Sep 13, 2024 at 10:01 AM David Pilato <[email protected]> wrote: > >> Hey team, >> >> >> I'm wondering if there is a way to filter the content being extracted by >> Tika using filenames for example. >> Let say I have a zip file with foo.js, foo.pdf, foo.html, foo.png and I >> only want to extract text from the pdf and html files. >> >> Also, I can see that a Zip is extracted this way as a full String: >> >> """ >> doc/ab1.js >> CONTENT1 >> abc/abc2.pdf >> CONTENT2 >> ... >> """ >> >> Would it be possible to extract the content as separated Objects, >> something like: >> >> ``` >> [ >> { "name": "doc/ab1.js", "content": "CONTENT1", "metadata": [ /* ... */ ] >> }, >> { "name": "abc/abc2.pdf", "content": "CONTENT2", "metadata": [ /* ... */ >> ] }, >> ... >> ] >> ``` >> >> Thanks! >> >
