Try the RecursiveParserWrapper. With tika-app: java -jar tika-app.jar -J -t
myZip.zip or the /rmeta endpoint with tika-server. Or, coming soon, the
grpc server will effectively give that output.

On Fri, Sep 13, 2024 at 10:01 AM David Pilato <[email protected]> wrote:

> Hey team,
>
>
> I'm wondering if there is a way to filter the content being extracted by
> Tika using filenames for example.
> Let say I have a zip file with foo.js, foo.pdf, foo.html, foo.png and I
> only want to extract text from the pdf and html files.
>
> Also, I can see that a Zip is extracted this way as a full String:
>
> """
> doc/ab1.js
> CONTENT1
> abc/abc2.pdf
> CONTENT2
> ...
> """
>
> Would it be possible to extract the content as separated Objects,
> something like:
>
> ```
> [
> { "name": "doc/ab1.js", "content": "CONTENT1", "metadata": [ /* ... */ ] },
> { "name": "abc/abc2.pdf", "content": "CONTENT2", "metadata": [ /* ... */ ]
> },
> ...
> ]
> ```
>
> Thanks!
>

Reply via email to