Checking for 0 byte files is one option.  The other option is to configure the 
logs to capture exceptions.  I’ve attached the config files and the shell 
script that I use when running our large scale regression testing here: 
https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip

To run those, unzip the folder, put the tika-app.jar in the bin/ directory, 
update the shell script for your <input_dir> and your <output_dir> and you 
should be good to go.  You may need to create a “logs” directory.  Exceptions 
will be recorded in the batch-process-warn.log, and original file names are 
included along with stack traces.

From: kostali hassan [mailto:med.has.kost...@gmail.com]
Sent: Friday, July 15, 2016 5:17 AM
To: user@tika.apache.org
Subject: detect corrupt file and build a list of them before indexing in solr

I'am looking to index ms word and pdf using uploading data with solr cell using 
apache tika;
 I just hope use tika to detect corrupt files before indexing and get a list of 
corrupted file. if its possible.
I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the 
output_dir all the files of <input_dir> in format xml and all the corrupt file 
with size 0ko (empty)

Reply via email to