Y, the log tells you that the input directory wasn’t specified correctly:
1375 2016-07-15 17:33:17,354 [Thread-2] INFO org.apache.tika.batch.BatchProcessDriverCLI - BatchProcess: java.lang.RuntimeException: Crawler couldn't find this directory:D:\tika_batch_config\test From: kostali hassan [mailto:med.has.kost...@gmail.com] Sent: Friday, July 15, 2016 12:40 PM To: user@tika.apache.org Subject: Re: detect corrupt file and build a list of them before indexing in solr only JXmx1g work AND the inputDIR is empty AND I get this files empty in logs : batch-driver-warn.log batch-process-warn.log tika-batch-pdfbox.log AND this attached files 2016-07-15 16:36 GMT+01:00 Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>>: Try changing the max heap to something that will work on your computer: -JXmx5g To (say): -JXmx1g From: kostali hassan [mailto:med.has.kost...@gmail.com<mailto:med.has.kost...@gmail.com>] Sent: Friday, July 15, 2016 11:27 AM To: user@tika.apache.org<mailto:user@tika.apache.org> Subject: Re: detect corrupt file and build a list of them before indexing in solr I get this files in the logs ; AND when I run the script he dont finich he restart all the time 2016-07-15 13:19 GMT+01:00 Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>>: Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a restart (hang/oom); and depending on cause, you may get an error logged in batch-process-error.xml. If your OS kills the process or something truly catastrophic happens, the only trace you have is the 0 byte file. For regular caught exceptions, you can look in the .json file (key: TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX+"runtime") for the stack trace, or you can look in the logs as described below. From: Allison, Timothy B. [mailto:talli...@mitre.org<mailto:talli...@mitre.org>] Sent: Friday, July 15, 2016 8:11 AM To: user@tika.apache.org<mailto:user@tika.apache.org> Subject: RE: detect corrupt file and build a list of them before indexing in solr Checking for 0 byte files is one option. The other option is to configure the logs to capture exceptions. I’ve attached the config files and the shell script that I use when running our large scale regression testing here: https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip To run those, unzip the folder, put the tika-app.jar in the bin/ directory, update the shell script for your <input_dir> and your <output_dir> and you should be good to go. You may need to create a “logs” directory. Exceptions will be recorded in the batch-process-warn.log, and original file names are included along with stack traces. From: kostali hassan [mailto:med.has.kost...@gmail.com] Sent: Friday, July 15, 2016 5:17 AM To: user@tika.apache.org<mailto:user@tika.apache.org> Subject: detect corrupt file and build a list of them before indexing in solr I'am looking to index ms word and pdf using uploading data with solr cell using apache tika; I just hope use tika to detect corrupt files before indexing and get a list of corrupted file. if its possible. I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the output_dir all the files of <input_dir> in format xml and all the corrupt file with size 0ko (empty)