Solr fields for Microsoft files, image files, PDF, text files

Phillip Wu Sun, 24 Sep 2017 19:56:50 -0700

 Hi,
I'm starting out with Solr on a Windows box.

I want to index the following documents:
doc;docx
xls;xlsx
ppt
vsd


pdf
txt

gif;jpeg;tiff

I undersand that solr uses Apache Tika to read these file types and return an 
xml stream back to Solr.
For Tika image processing, I've loaded Tesseract.

To be able to search the documents, I need to define "fields" in a file called 
meta-schema.

How do I get a list of all valid field names based on the file type? For 
example *.doc, what "fields" exist so I choose what to store?

I'm assuming that for example, *.doc files there is metadata put into the file 
by Microsoft Word eg.author,date and "free form" text.

So where is the list of valid fields per file type?

Also how do I search the "free form" text for a word/pattern in the Solr search 
tool?

Solr fields for Microsoft files, image files, PDF, text files

Reply via email to