Phillip - You may be interested to start with the example/files that ships with Solr. It is specifically designed as a configuration (and UI!) that deals with indexing rich files with a bit more than other examples - it pulls out acronyms, e-mail addresses, and URLs from text, as well as what you’ve asked about, mapping content types to more friendly human types (“image” instead of the whole gamut of image/* content-types).
Erik > On Sep 24, 2017, at 10:55 PM, Phillip Wu <phillip...@unsw.edu.au> wrote: > > > Hi, > I'm starting out with Solr on a Windows box. > > I want to index the following documents: > doc;docx > xls;xlsx > ppt > vsd > > pdf > txt > > gif;jpeg;tiff > > I undersand that solr uses Apache Tika to read these file types and return an > xml stream back to Solr. > For Tika image processing, I've loaded Tesseract. > > To be able to search the documents, I need to define "fields" in a file > called meta-schema. > > How do I get a list of all valid field names based on the file type? For > example *.doc, what "fields" exist so I choose what to store? > > I'm assuming that for example, *.doc files there is metadata put into the > file by Microsoft Word eg.author,date and "free form" text. > > So where is the list of valid fields per file type? > > Also how do I search the "free form" text for a word/pattern in the Solr > search tool? > > > >