On Thu, 2 Aug 2012, Alexander Cougarman wrote:
Hi. Does the latest version of Tika index text in these file types? - Office 2007/2010 file types of DOCX, XLSX, PPTX
Yes (thought the few tiny bits of new functionality introduced in 2010 will be skipped over)
- MHT file (MHTML Document)
Not sure, how close is this to a regular html file?
This page helped on many of the file formats, but wanted to clarify: http://tika.apache.org/1.2/formats.html
Often the best way to check is to grab the tika-app jar, and try a few sample files with it
Nick
