I just tried it (admittedly using just a simple input obviously not a PDF file) and it works perfectly as I'd expect.
So a couple of things: 1> what happens if you highlight the content field? The text field should be fine. 2> Did you completely blow away your index whenever you changed the schema file? As in "rm -rf data" where the "data" directory is the parent of "index"? 3> I'd consider backing off a bit and start with the standard "techproducts" example and get highlighting to work _there_ first. My guess is that there's something you're doing that I don't know to ask about specifically with the PDF conversions. er...@baffled.com On Thu, Dec 17, 2015 at 3:00 AM, Evert R. <evert.ra...@gmail.com> wrote: > Hello Teague, > > Thanks for your reply and tip! I think Solr will give me a better result > than just using Tika to read up my files and send to a Fulltext Index in my > MySQL, which has the precise point of not highlighting the text snippets... > > So, I will keep on trying to fix Solr to my needs, and sure it works... I > am missing something. > > Thanks again and I will keep on track. > > When I find the solution I will post all files and configs here for future > references. > > Best regards, > > *Evert* > > 2015-12-17 6:11 GMT-02:00 Teague James <teag...@insystechinc.com>: > >> Erik's comments not withstanding, there are some gaps in my understanding >> of your precise situation. Here's a few things that weren't necessarily >> obvious to me when I took my first try with Solr. >> >> Highlighting is the end result of a good hit. It is essentially formatting >> applied to your hit. It is possible to get a hit without a highlight if >> certain conditions exist. >> >> First, start by making sure you are indexing your target (a PDF file?) >> correctly. Assuming you are indexing PDFs, are you extracting meta data >> only or are you parsing the document with Tika? If you want hits on the >> contents of your PDF, then you have to parse it at index time and store >> that.That was why I suggested just running some queries through the >> interface and the URL to see what Solr actually captured from your indexed >> PDF before worrying about how it looks on the screen. >> >> Next, you should look carefully at the Analyzer's output. Notice the >> abbreviations to the left of the columns? Hover over those to see what >> filter factory it is. When words are split into multiple columns at one of >> those points, it indicates that the filter factory broke apart the word >> while analyzing it. Do a search for the filter filter factories that you >> find and read up on them. In my case "1a" was being split into 4 by a word >> delimiter filter factory - "1a", "1", "a", "1a" which caused highlighting >> to fail in my case while still getting a hit. It also caused erroneous hits >> elsewhere. Adding some switches to the schema is all it took to correct >> that for me. However, every case is different based on your needs. That is >> why it is important to go through the analyzer and see if Solr's indexing >> and querying are doing what you expect. >> >> If that looks good and you've got solid hits all the way down, then it is >> time to start looking at your highlighter implementation in the index and >> query analyzers that you are using. My original issue of not being able to >> highlight phrases with one set of tags necessitated me switching to the >> fast vector highlighter - which had its own requirements for certain >> parameters to be set. Here again - going to the Solr docs and reading up on >> the various highlighters will be helpful in most cases. >> >> Solr has a very steep learning curve. I've been using it for several years >> and I still consider myself a noob. It can be a deep dive, but don't be >> discouraged. Keep at it. Cheers! >> >> -Teague >> >> On Wed, Dec 16, 2015 at 8:54 PM, Evert R. <evert.ra...@gmail.com> wrote: >> >> > Hi Erick and Teague, >> > >> > >> > I found that when using the field 'text' it shows the pdf file result >> > id:pdf1 in this case, like: >> > >> > http://localhost:8983/solr/techproducts/select?fq=id:pdf1&q=nietava >> > >> > but when highlight, using the text field...nothing comes up... >> > >> > >> > >> http://localhost:8983/solr/techproducts/select?q=text:nietava&fq=id:pdf1&wt=json&indent=true&hl=true&hl.fl=text&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C%2Fem%3E >> > >> > of even with the option >> > >> > f.text.hl.snippets=2 under the hl.fl field. >> > >> > >> > I tried as well with the standard configuration, did it all over, >> reindexed >> > a couple times... and still did not work. >> > >> > Also, >> > >> > Using the Analysis, it brings below information: >> > >> > ST >> > textraw_bytesstartendpositionLengthtypeposition >> > nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1 >> > SF >> > textraw_bytesstartendpositionLengthtypeposition >> > nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1 >> > LCF >> > textraw_bytesstartendpositionLengthtypeposition >> > nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1 >> > >> > >> > Alphanumeric I think... so, it´s 'string', right? would that be a >> problem? >> > Should be some other indication? >> > >> > >> > Thanks again! >> > >> > >> > *Evert* >> > >> > 2015-12-16 21:09 GMT-02:00 Erick Erickson <erickerick...@gmail.com>: >> > >> > > I think you're still missing the critical bit. Highlighting is >> > > completely separate from searching. In other words, you can search on >> > > one field and highlight another. What field is searched is governed by >> > > the "qf" parameter when using edismax and by the the "df" parameter >> > > configured in your request handler in solrconfig.xml. These defaults >> > > are overridden when you do a "fielded search" like >> > > >> > > q=content:nietava >> > > >> > > So this: q=content:nietava&hl=true&hl.fl=content >> > > is searching the "content" field. The word you're looking for isn't in >> > > the content field so naturally no docs are returned. And no >> > > highlighting either. >> > > >> > > This: q=nietava&hl=true&hl.fl=content >> > > >> > > is searching somewhere else, thus getting the hit. We already know >> > > that "nietava" is not in the content field because the first search >> > > failed. You need to find out what field is being matched (probably >> > > something like "text") and then try highlighting on _that_ field. Try >> > > adding "debug=query" to the URL and look at the "parsed_query" section >> > > of the return and you'll see what field(s) is/are actually being >> > > searched against. >> > > >> > > NOTE: The field you highlight on _must_ have stored="true" in >> schema.xml. >> > > >> > > As to why "nietava" isn't being found in the content field, probably >> > > you have some kind of analysis chain configured for that field that >> > > isn't searching as you expect. See the admin/analysis page for some >> > > insight into why that would be. The most frequent reason is that the >> > > field is a "string" type which is not broken up into words. Another >> > > possibility is that your analysis chain is leaving in the quotes or >> > > something similar. As James says, looking at admin/analysis is a good >> > > way to figure this out. >> > > >> > > I still strongly recommend you go from the stock techproducts example >> > > and get familiar with how Solr (and highlighting) work before jumping >> > > in and changing things. There are a number of ways things can be >> > > mis-configured and trying to change several things at once is a fine >> > > way to go mad. The admin UI>>schema browser is another way you can see >> > > what kind of terms are _actually_ in your index in a particular field. >> > > >> > > Best, >> > > Erick >> > > >> > > >> > > >> > > >> > > On Wed, Dec 16, 2015 at 12:26 PM, Teague James < >> teag...@insystechinc.com >> > > >> > > wrote: >> > > > Sorry to hear that didn't work! Let me ask a couple of questions... >> > > > >> > > > Have you tried the analyzer inside of the Admin Interface? It has >> > helped >> > > me sort out a number of highlighting issues in the past. To access it, >> go >> > > to your Admin interface, select your core, then select Analysis from >> the >> > > list of options on the left. In the analyzer, enter the term you are >> > > indexing in the top left (in other words the term in the document you >> are >> > > indexing that you expect to get a hit on) and right input fields. >> Select >> > > the field that it is destined for (in your case that would be >> 'content'), >> > > then hit analyze. Helps if you have a big screen! >> > > > >> > > > This will show you the impact of the various filter factories that >> you >> > > have engaged and their effect on whether or not a 'hit' is being >> > generated. >> > > Hits are idietified by a very feint highlight. (PSST... Developers... >> It >> > > would be really cool if the highlight color were more visible or >> > > customizable... Thanks y'all) If it looks like you're getting hits, but >> > not >> > > getting highlighting, then open up a new tab with the Admin's query >> > > interface. Same place on the left as the analyzer. Replace the "*:*" >> with >> > > your search term (assuming you already indexed your document) and if >> > > necessary you can put something in the FQ like "id:123456" to target a >> > > specific record. >> > > > >> > > > Did you get a hit? If no, then it's not highlighting that's the >> issue. >> > > If yes, then try dumping this in your address bar (using your URL/IP, >> > > search term, and core name of course. The fq= is an example) : >> > > > http:// >> [URL/IP]/solr/[CORE-NAME]/select?fq=id:123456&q="[SEARCH-TERM]" >> > > > >> > > > That will dump Solr's output to your browser where you can see >> exactly >> > > what is getting hit. >> > > > >> > > > Hope that helps! Let me know how it goes. Good luck. >> > > > >> > > > -Teague >> > > > >> > > > -----Original Message----- >> > > > From: Evert R. [mailto:evert.ra...@gmail.com] >> > > > Sent: Wednesday, December 16, 2015 1:46 PM >> > > > To: solr-user <solr-user@lucene.apache.org> >> > > > Subject: Re: Solr Basic Configuration - Highlight - Begginer >> > > > >> > > > Hi Teague! >> > > > >> > > > I configured the solrconf.xml and schema.xml exactly the way you did, >> > > only substituting the word 'documentText' per 'content' used by the >> > > techproducts sample, I reindex through : >> > > > >> > > > curl ' >> > > > >> > > >> > >> http://localhost:8983/solr/techproducts/update/extract?literal.id=pdf1&commit=true >> > > ' >> > > > -F "Emmanuel=@/home/solr/dados/teste/Emmanuel.pdf" >> > > > >> > > > with the same result.... no highlight in the respond as below: >> > > > >> > > > "highlighting": { "pdf1": {} } >> > > > >> > > > =( >> > > > >> > > > Really... do not know what to do... >> > > > >> > > > Thanks for your time, if you have any more suggestion where I could >> be >> > > missing something... please let me know. >> > > > >> > > > >> > > > Best regards, >> > > > >> > > > *Evert* >> > > > >> > > > 2015-12-16 15:30 GMT-02:00 Teague James <teag...@insystechinc.com>: >> > > > >> > > >> Hi Evert, >> > > >> >> > > >> I recently needed help with phrase highlighting and was pointed to >> the >> > > >> FastVectorHighlighter which worked out great. I just made a change >> to >> > > >> the configuration to add generateWordParts="0" and >> > > >> generateNumberParts="0" so that searches for things like "1a" would >> > > >> get highlighted correctly. You may or may not need that feature. You >> > > >> can always remove them or change the value to "1" to switch them on >> > > explicitly. Anyway, hope this helps! >> > > >> >> > > >> solrconfig.xml (partial snip) >> > > >> <requestHandler name="/select" class="solr.SearchHandler"> >> > > >> <lst name="defaults"> >> > > >> <str name="wt">xml</str> >> > > >> <str name="echoParams">explicit</str> >> > > >> <int name="rows">10</int> >> > > >> <str name="df">documentText</str> >> > > >> <str name="hl">on</str> >> > > >> <str name="hl.fl">text</str> >> > > >> <str >> > > name="hl.useFastVectorHighlighter">true</str> >> > > >> <str name="hl.snippets">100</str> >> > > >> <str name="hl.tag.pre"><b></str> >> > > >> <str name="hl.tag.post"></b></str> >> > > >> </lst> >> > > >> </requestHandler> >> > > >> >> > > >> schema.xml (partial snip) >> > > >> <field name="id" type="string" indexed="true" stored="true" >> > > >> required="true" multiValued="false" /> >> > > >> <field name="documentText" type="text_general" indexed="true" >> > > >> multivalued="true" termVectors="true" termOffsets="true" >> > > >> termPositions="true" /> >> > > >> >> > > >> <fieldType name="text_general" class="solr.TextField" >> > > >> positionIncrementGap="100"> >> > > >> <analyzer type="index"> >> > > >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> > > >> <filter class="solr.StopFilterFactory" >> > ignoreCase="true" >> > > >> words="stopwords.txt" /> >> > > >> <filter class="solr.WordDelimiterFilterFactory" >> > > >> catenateAll="1" preserveOriginal="1" generateNumberParts="0" >> > > >> generateWordParts="0" /> >> > > >> <filter class="solr.SynonymFilterFactory" >> > > >> synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/> >> > > >> <filter class="solr.LowerCaseFilterFactory"/> >> > > >> <filter class="solr.PorterStemFilterFactory"/> >> > > >> <filter class="solr.ApostropheFilterFactory"/> >> > > >> </analyzer> >> > > >> <analyzer type="query"> >> > > >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> > > >> <filter class="solr.WordDelimiterFilterFactory" >> > > >> catenateAll="1" preserveOriginal="1" generateWordParts="0" /> >> > > >> <filter class="solr.StopFilterFactory" >> > ignoreCase="true" >> > > >> words="stopwords.txt" /> >> > > >> <filter class="solr.LowerCaseFilterFactory"/> >> > > >> <filter class="solr.ApostropheFilterFactory"/> >> > > >> </analyzer> >> > > >> </fieldType> >> > > >> >> > > >> -Teague >> > > >> >> > > >> From: Evert R. [mailto:evert.ra...@gmail.com] >> > > >> Sent: Tuesday, December 15, 2015 6:25 AM >> > > >> To: solr-user@lucene.apache.org >> > > >> Subject: Solr Basic Configuration - Highlight - Begginer >> > > >> >> > > >> Hi there! >> > > >> >> > > >> It´s my first installation, not sure if here is the right channel... >> > > >> >> > > >> Here is my steps: >> > > >> >> > > >> 1. Set up a basic install of solr 5.4.0 >> > > >> >> > > >> 2. Create a new core through command line (bin/solr create -c test) >> > > >> >> > > >> 3. Post 2 files: 1 .docx and 2 .pdf (bin/post -c test /docs/test/) >> > > >> >> > > >> 4. Query over the browser and it brings the correct search, but it >> > > >> does not show the part of the text I am querying, the highlight. >> > > >> >> > > >> I have already flagled the 'hl' option. But still it does not >> > word... >> > > >> >> > > >> Exemple: I am looking for the word 'peace' in my pdf file (book) I >> > > >> have 4 matches for this word, it shows me the book name (pdf file) >> but >> > > >> does not bring which part of the text it has the word peace on it. >> > > >> >> > > >> >> > > >> I am problably missing some configuration in schema.xml, which is >> > > >> missing from my folder.... /solr/server/solr/test/conf/ >> > > >> >> > > >> Or even the solrconfig.xml... >> > > >> >> > > >> I have read a bunch of things about highlight check these files, >> > > >> copied the standard schema.xml to my core/conf folder, but still it >> > > >> does not bring the highlight. >> > > >> >> > > >> >> > > >> Attached a copy of my solrconfig.xml file. >> > > >> >> > > >> >> > > >> I am very sorry for this, probably, dumb and too basic question... >> > > >> First time I see solr in live. >> > > >> >> > > >> >> > > >> Any help will be appreciated. >> > > >> >> > > >> >> > > >> >> > > >> Best regards, >> > > >> >> > > >> >> > > >> Evert Ramos >> > > >> >> > > >> mailto:evert.ra...@gmail.com >> > > >> >> > > >> >> > > >> >> > > > >> > > >> > >> >> >> >> -- >> Kind regards, >> >> -Teague James >> *Senior Web Applications Developer* >> Insystech Inc. >> teag...@insystechinc.com >> (703) 508-0008 (Cell) >>