First, I've never seen a pipeline that parses more than a few dozen
files (e.g. a proof of concept) usefully trust a file's extension.  I
wouldn't recommend trusting a file's extension...ever.  I'd recommend
sending everything to Tika.  Perhaps send the first few thousand bytes
for detect() and then make your decision on whether or not to parse.

That said, I've started a trial of documenting file types and
extensions: 
https://cwiki.apache.org/confluence/display/TIKA/File+Types+and+Dependencies

Please offer feedback on that.  And, yes, Marc asked for this forever ago.

We don't have documentation on which parsers are resource hogs.  This
changes with every update to the various parsers.  In general,
docx/xlsx/pptx and PDF can raise eyebrows.  But every file set has
different challenges.  You need to isolate parsing so that problems
with parsing will not cause problems to your application:
https://cwiki.apache.org/confluence/display/TIKA/The+Robustness+of+Apache+Tika

This is our regression set: https://corpora.tika.apache.org/base/docs/

On Tue, May 16, 2023 at 3:02 AM Neha Kamat via user
<user@tika.apache.org> wrote:
>
> Hi team
>
>
>
> Is there a documentation available with Apache TIKA which clearly describes 
> list of file extensions supported by a particular TIKA version? I can see 
> file formats supported by tika under 
> https://tika.apache.org/2.8.0/formats.html but this page doesn’t give clarity 
> around extensions covered under a particular file format.
>
> Based on supported extension list, we plan to implement filters in our 
> application so that right set of extensions(supported) are sent to TIKA for 
> extraction and non-supported extensions are not even sent to TIKA for 
> processing. I am also looking for documentation which captures performance 
> statistics and recommendations for different type of parsers currently 
> supported by TIKA e.g. <x> parser is resource intensive and <y> parser is 
> time consuming and so on with right set of statistics published.
>
>
>
> Is there any common shared testdata location(something similar to govdocs or 
> testdata maintained by TIKA) against which parser testing is done?
>
>
>
>
>
>
>
>
>
>

Reply via email to