Sorry. That answered your second question, which will enable a solution to
the first. :D

For the first, we do have MetadataFilters that remove content by mime type:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181309141#MetadataFilters-ClearByMImeMetadataFilter
when you're using the RecursiveParserWrapper (RPW)=-J=/rmeta

We could add a filter by file name, but I have worries about the robustness
of that.

Moving to the RPW takes a bit of a lift (especially for end users), but I
strongly recommend using it. Please do not hesitate to ask further
questions as you move to integrate it. :D

On Sat, Sep 14, 2024 at 6:58 AM Tim Allison <[email protected]> wrote:

> Try the RecursiveParserWrapper. With tika-app: java -jar tika-app.jar -J
> -t myZip.zip or the /rmeta endpoint with tika-server. Or, coming soon, the
> grpc server will effectively give that output.
>
> On Fri, Sep 13, 2024 at 10:01 AM David Pilato <[email protected]> wrote:
>
>> Hey team,
>>
>>
>> I'm wondering if there is a way to filter the content being extracted by
>> Tika using filenames for example.
>> Let say I have a zip file with foo.js, foo.pdf, foo.html, foo.png and I
>> only want to extract text from the pdf and html files.
>>
>> Also, I can see that a Zip is extracted this way as a full String:
>>
>> """
>> doc/ab1.js
>> CONTENT1
>> abc/abc2.pdf
>> CONTENT2
>> ...
>> """
>>
>> Would it be possible to extract the content as separated Objects,
>> something like:
>>
>> ```
>> [
>> { "name": "doc/ab1.js", "content": "CONTENT1", "metadata": [ /* ... */ ]
>> },
>> { "name": "abc/abc2.pdf", "content": "CONTENT2", "metadata": [ /* ... */
>> ] },
>> ...
>> ]
>> ```
>>
>> Thanks!
>>
>

Reply via email to