Hi team,

I am currently working on an application wherein I would like to whitelist the 
filetypes supported by TIKA And discard rest of the files to avoid unknown 
behaviour/memory leaks. I am currently referring to 
https://cwiki.apache.org/confluence/display/TIKA/File+Types+and+Dependencies. 
But, when I used json, log files, I see that the content is getting extracted 
even when it is not listed under the confluence. Is file extension list 
mentioned under this confluence for standard package complete or it is partial?

Also, I came across a function which list down supported MIME types for a 
particular parser. How would this approach behave if I submit 
untrusted/unsupported file type to TIKA for parser and supported MIME types 
detection? Would it try to load file contents in memory? Would there be a 
chance of memory leak when we try to just detect MIME type of a file using TIKA 
detect method?

Thanks,
Neha





Reply via email to