Hey team,
I'm wondering if there is a way to filter the content being extracted by Tika
using filenames for example.
Let say I have a zip file with foo.js, foo.pdf, foo.html, foo.png and I only
want to extract text from the pdf and html files.
Also, I can see that a Zip is extracted this way as a full String:
"""
doc/ab1.js
CONTENT1
abc/abc2.pdf
CONTENT2
...
"""
Would it be possible to extract the content as separated Objects, something
like:
```
[
{ "name": "doc/ab1.js", "content": "CONTENT1", "metadata": [ /* ... */ ] },
{ "name": "abc/abc2.pdf", "content": "CONTENT2", "metadata": [ /* ... */ ] },
...
]
```
Thanks!