All, I recently blogged about some of the work we're doing with a large scale regression corpus to make Tika, POI and PDFBox more robust and to identify regressions before release. If you'd like to chip in with recommendations, requests or Hadoop/Spark clusters (why not shoot for the stars), please do!
http://openpreservation.org/blog/2016/10/04/apache-tikas-regression-corpus-tika-1302/ Many thanks, again, to Rackspace for our vm and to Common Crawl and govdocs1 for most of our files! Cheers, Tim