RE: Solr fields for Microsoft files, image files, PDF, text files
bq: How do I get a list of all valid field names based on the file type bq: You don't. At least I've never found any. Plus various document formats will allow custom meta-data fields so there's no definitive list. It would be trivial to add field counts per mime to tika-eval. If you're interested in this, please open a ticket on Tika's JIRA.
Re: Solr fields for Microsoft files, image files, PDF, text files
Phillip - You may be interested to start with the example/files that ships with Solr. It is specifically designed as a configuration (and UI!) that deals with indexing rich files with a bit more than other examples - it pulls out acronyms, e-mail addresses, and URLs from text, as well as what you’ve asked about, mapping content types to more friendly human types (“image” instead of the whole gamut of image/* content-types). Erik > On Sep 24, 2017, at 10:55 PM, Phillip Wuwrote: > > > Hi, > I'm starting out with Solr on a Windows box. > > I want to index the following documents: > doc;docx > xls;xlsx > ppt > vsd > > pdf > txt > > gif;jpeg;tiff > > I undersand that solr uses Apache Tika to read these file types and return an > xml stream back to Solr. > For Tika image processing, I've loaded Tesseract. > > To be able to search the documents, I need to define "fields" in a file > called meta-schema. > > How do I get a list of all valid field names based on the file type? For > example *.doc, what "fields" exist so I choose what to store? > > I'm assuming that for example, *.doc files there is metadata put into the > file by Microsoft Word eg.author,date and "free form" text. > > So where is the list of valid fields per file type? > > Also how do I search the "free form" text for a word/pattern in the Solr > search tool? > > > >
Re: Solr fields for Microsoft files, image files, PDF, text files
bq: How do I get a list of all valid field names based on the file type You don't. At least I've never found any. Plus various document formats will allow custom meta-data fields so there's no definitive list. bq: Also how do I search the "free form" text for a word/pattern in the Solr search tool? you put the extracted text (as opposed to meta-data) into an analyzed field and search that. NOTE: Solr is a search engine. The closest thing to an OOB "Solr Search Tool" is the admin UI, which isn't intended to be an end-user facing app. Here's some SolrJ code that'll let you explore the meta-data fields in various document types: https://lucidworks.com/2012/02/14/indexing-with-solrj/ You can pull out the RDBMS bits pretty easily. Best, Erick On Sun, Sep 24, 2017 at 7:55 PM, Phillip Wuwrote: > > Hi, > I'm starting out with Solr on a Windows box. > > I want to index the following documents: > doc;docx > xls;xlsx > ppt > vsd > > pdf > txt > > gif;jpeg;tiff > > I undersand that solr uses Apache Tika to read these file types and return an > xml stream back to Solr. > For Tika image processing, I've loaded Tesseract. > > To be able to search the documents, I need to define "fields" in a file > called meta-schema. > > How do I get a list of all valid field names based on the file type? For > example *.doc, what "fields" exist so I choose what to store? > > I'm assuming that for example, *.doc files there is metadata put into the > file by Microsoft Word eg.author,date and "free form" text. > > So where is the list of valid fields per file type? > > Also how do I search the "free form" text for a word/pattern in the Solr > search tool? > > > >
Solr fields for Microsoft files, image files, PDF, text files
Hi, I'm starting out with Solr on a Windows box. I want to index the following documents: doc;docx xls;xlsx ppt vsd pdf txt gif;jpeg;tiff I undersand that solr uses Apache Tika to read these file types and return an xml stream back to Solr. For Tika image processing, I've loaded Tesseract. To be able to search the documents, I need to define "fields" in a file called meta-schema. How do I get a list of all valid field names based on the file type? For example *.doc, what "fields" exist so I choose what to store? I'm assuming that for example, *.doc files there is metadata put into the file by Microsoft Word eg.author,date and "free form" text. So where is the list of valid fields per file type? Also how do I search the "free form" text for a word/pattern in the Solr search tool?