Phillip - You may be interested to start with the example/files that ships with 
Solr.   It is specifically designed as a configuration (and UI!) that deals 
with indexing rich files with a bit more than other examples - it pulls out 
acronyms, e-mail addresses, and URLs from text, as well as what you’ve asked 
about, mapping content types to more friendly human types (“image” instead of 
the whole gamut of image/* content-types).

        Erik

> On Sep 24, 2017, at 10:55 PM, Phillip Wu <phillip...@unsw.edu.au> wrote:
> 
> 
> Hi,
> I'm starting out with Solr on a Windows box.
> 
> I want to index the following documents:
> doc;docx
> xls;xlsx
> ppt
> vsd
> 
> pdf
> txt
> 
> gif;jpeg;tiff
> 
> I undersand that solr uses Apache Tika to read these file types and return an 
> xml stream back to Solr.
> For Tika image processing, I've loaded Tesseract.
> 
> To be able to search the documents, I need to define "fields" in a file 
> called meta-schema.
> 
> How do I get a list of all valid field names based on the file type? For 
> example *.doc, what "fields" exist so I choose what to store?
> 
> I'm assuming that for example, *.doc files there is metadata put into the 
> file by Microsoft Word eg.author,date and "free form" text.
> 
> So where is the list of valid fields per file type?
> 
> Also how do I search the "free form" text for a word/pattern in the Solr 
> search tool?
> 
> 
> 
> 

Reply via email to