Try the RecursiveParserWrapper. With tika-app: java -jar tika-app.jar -J -t myZip.zip or the /rmeta endpoint with tika-server. Or, coming soon, the grpc server will effectively give that output.
On Fri, Sep 13, 2024 at 10:01 AM David Pilato <[email protected]> wrote: > Hey team, > > > I'm wondering if there is a way to filter the content being extracted by > Tika using filenames for example. > Let say I have a zip file with foo.js, foo.pdf, foo.html, foo.png and I > only want to extract text from the pdf and html files. > > Also, I can see that a Zip is extracted this way as a full String: > > """ > doc/ab1.js > CONTENT1 > abc/abc2.pdf > CONTENT2 > ... > """ > > Would it be possible to extract the content as separated Objects, > something like: > > ``` > [ > { "name": "doc/ab1.js", "content": "CONTENT1", "metadata": [ /* ... */ ] }, > { "name": "abc/abc2.pdf", "content": "CONTENT2", "metadata": [ /* ... */ ] > }, > ... > ] > ``` > > Thanks! >
