Hi, I want to search Solr for server names in a set of Microsoft Word documents, PDF, and image files like jpg,gif. Server names are given by the regular expression(regex) INFP[a-zA-z0-9]{3,9} TRKP[a-zA-z0-9]{3,9} PLCP[a-zA-z0-9]{3,9} SQRP[a-zA-z0-9]{3,9} ....
Problem ======= I want to get the text in the documents matching the regex. eg. INFPWSV01, PLCPLDB01 I've index the files using Solr/Tikka/Tesseract using the default schema. I've used the highlight search tool hl ticked hl.usePhraseHighlighter ticked Solr only returns the metadata (presumably) like filename for the file containing the pattern(s). Questions ========= 1. Would I have to modify the managed schema? 2. If so would I have to save the file content in the schema 3. If so is this the way to do it: a. solrconfig.xml <- inside my "core" <requestHandler class="solr.extraction.ExtractingRequestHandler" name="/update/extract" startup="lazy"> <lst name="defaults"> <str name="lowernames">true</str> <str name="fmap.meta">ignored_</str> <str name="fmap.content">_text_</str> </lst> ... b. Remove line <str name="fmap.meta">ignored_</str> as I want meta data c. Change this to the managed schema <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/> stored to "true" curl -X POST -H 'Content-type:application/json' --data-binary '{ "replace-field":{ "name":"_text_", "type":"text_general", "multiValued":true, "indexed":true "stored":true } }' http://localhost:8983/api/cores/gettingstarted/schema