Re: Solr Basic Configuration - Highlight - Begginer

Erick Erickson Thu, 17 Dec 2015 09:06:11 -0800

I just tried it (admittedly using just a simple input obviously not a
PDF file) and
it works perfectly as I'd expect.


So a couple of things:
1> what happens if you highlight the content field? The text field
should be fine.
2> Did you completely blow away your index whenever you changed the
schema file? As in "rm -rf data" where the "data" directory is the
parent of "index"?
3> I'd consider backing off a bit and start with the standard
"techproducts" example and get highlighting to work _there_ first. My
guess is that there's something you're doing that I don't know to ask
about specifically with the PDF conversions.

er...@baffled.com

On Thu, Dec 17, 2015 at 3:00 AM, Evert R. <evert.ra...@gmail.com> wrote:
> Hello Teague,
>
> Thanks for your reply and tip! I think Solr will give me a better result
> than just using Tika to read up my files and send to a Fulltext Index in my
> MySQL, which has the precise point of not highlighting the text snippets...
>
> So, I will keep on trying to fix Solr to my needs, and sure it works... I
> am missing something.
>
> Thanks again and I will keep on track.
>
> When I find the solution I will post all files and configs here for future
> references.
>
> Best regards,
>
> *Evert*
>
> 2015-12-17 6:11 GMT-02:00 Teague James <teag...@insystechinc.com>:
>
>> Erik's comments not withstanding, there are some gaps in my understanding
>> of your precise situation. Here's a few things that weren't necessarily
>> obvious to me when I took my first try with Solr.
>>
>> Highlighting is the end result of a good hit. It is essentially formatting
>> applied to your hit. It is possible to get a hit without a highlight if
>> certain conditions exist.
>>
>> First, start by making sure you are indexing your target (a PDF file?)
>> correctly. Assuming you are indexing PDFs, are you extracting meta data
>> only or are you parsing the document with Tika? If you want hits on the
>> contents of your PDF, then you have to parse it at index time and store
>> that.That was why I suggested just running some queries through the
>> interface and the URL to see what Solr actually captured from your indexed
>> PDF before worrying about how it looks on the screen.
>>
>> Next, you should look carefully at the Analyzer's output. Notice the
>> abbreviations to the left of the columns? Hover over those to see what
>> filter factory it is. When words are split into multiple columns at one of
>> those points, it indicates that the filter factory broke apart the word
>> while analyzing it. Do a search for the filter filter factories that you
>> find and read up on them. In my case "1a" was being split into 4 by a word
>> delimiter filter factory - "1a", "1", "a", "1a" which caused highlighting
>> to fail in my case while still getting a hit. It also caused erroneous hits
>> elsewhere. Adding some switches to the schema is all it took to correct
>> that for me. However, every case is different based on your needs. That is
>> why it is important to go through the analyzer and see if Solr's indexing
>> and querying are doing what you expect.
>>
>> If that looks good and you've got solid hits all the way down, then it is
>> time to start looking at your highlighter implementation in the index and
>> query analyzers that you are using. My original issue of not being able to
>> highlight phrases with one set of tags necessitated me switching to the
>> fast vector highlighter - which had its own requirements for certain
>> parameters to be set. Here again - going to the Solr docs and reading up on
>> the various highlighters will be helpful in most cases.
>>
>> Solr has a very steep learning curve. I've been using it for several years
>> and I still consider myself a noob. It can be a deep dive, but don't be
>> discouraged. Keep at it. Cheers!
>>
>> -Teague
>>
>> On Wed, Dec 16, 2015 at 8:54 PM, Evert R. <evert.ra...@gmail.com> wrote:
>>
>> > Hi Erick and Teague,
>> >
>> >
>> > I found that when using the field 'text' it shows the pdf file result
>> > id:pdf1 in this case, like:
>> >
>> > http://localhost:8983/solr/techproducts/select?fq=id:pdf1&q=nietava
>> >
>> > but when highlight, using the text field...nothing comes up...
>> >
>> >
>> >
>> http://localhost:8983/solr/techproducts/select?q=text:nietava&fq=id:pdf1&wt=json&indent=true&hl=true&hl.fl=text&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C%2Fem%3E
>> >
>> > of even with the option
>> >
>> > f.text.hl.snippets=2 under the hl.fl field.
>> >
>> >
>> > I tried as well with the standard configuration, did it all over,
>> reindexed
>> > a couple times... and still did not work.
>> >
>> > Also,
>> >
>> > Using the Analysis, it brings below information:
>> >
>> > ST
>> > textraw_bytesstartendpositionLengthtypeposition
>> > nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1
>> > SF
>> > textraw_bytesstartendpositionLengthtypeposition
>> > nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1
>> > LCF
>> > textraw_bytesstartendpositionLengthtypeposition
>> > nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1
>> >
>> >
>> > Alphanumeric I think... so, it´s 'string', right? would that be a
>> problem?
>> > Should be some other indication?
>> >
>> >
>> > Thanks again!
>> >
>> >
>> > *Evert*
>> >
>> > 2015-12-16 21:09 GMT-02:00 Erick Erickson <erickerick...@gmail.com>:
>> >
>> > > I think you're still missing the critical bit. Highlighting is
>> > > completely separate from searching. In other words, you can search on
>> > > one field and highlight another. What field is searched is governed by
>> > > the "qf" parameter when using edismax and by the the "df" parameter
>> > > configured in your request handler in solrconfig.xml. These defaults
>> > > are overridden when you do a "fielded search" like
>> > >
>> > > q=content:nietava
>> > >
>> > > So this: q=content:nietava&hl=true&hl.fl=content
>> > > is searching the "content" field. The word you're looking for isn't in
>> > > the content field so naturally no docs are returned. And no
>> > > highlighting either.
>> > >
>> > > This: q=nietava&hl=true&hl.fl=content
>> > >
>> > > is searching somewhere else, thus getting the hit. We already know
>> > > that "nietava" is not in the content field because the first search
>> > > failed. You need to find out what field is being matched (probably
>> > > something like "text") and then try highlighting on _that_ field. Try
>> > > adding "debug=query" to the URL and look at the "parsed_query" section
>> > > of the return and you'll see what field(s) is/are actually being
>> > > searched against.
>> > >
>> > > NOTE: The field you highlight on _must_ have stored="true" in
>> schema.xml.
>> > >
>> > > As to why "nietava" isn't being found in the content field, probably
>> > > you have some kind of analysis chain configured for that field that
>> > > isn't searching as you expect. See the admin/analysis page for some
>> > > insight into why that would be. The most frequent reason is that the
>> > > field is a "string" type which is not broken up into words. Another
>> > > possibility is that your analysis chain is leaving in the quotes or
>> > > something similar. As James says, looking at admin/analysis is a good
>> > > way to figure this out.
>> > >
>> > > I still strongly recommend you go from the stock techproducts example
>> > > and get familiar with how Solr (and highlighting) work before jumping
>> > > in and changing things. There are a number of ways things can be
>> > > mis-configured and trying to change several things at once is a fine
>> > > way to go mad. The admin UI>>schema browser is another way you can see
>> > > what kind of terms are _actually_ in your index in a particular field.
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, Dec 16, 2015 at 12:26 PM, Teague James <
>> teag...@insystechinc.com
>> > >
>> > > wrote:
>> > > > Sorry to hear that didn't work! Let me ask a couple of questions...
>> > > >
>> > > > Have you tried the analyzer inside of the Admin Interface? It has
>> > helped
>> > > me sort out a number of highlighting issues in the past. To access it,
>> go
>> > > to your Admin interface, select your core, then select Analysis from
>> the
>> > > list of options on the left. In the analyzer, enter the term you are
>> > > indexing in the top left (in other words the term in the document you
>> are
>> > > indexing that you expect to get a hit on) and right input fields.
>> Select
>> > > the field that it is destined for (in your case that would be
>> 'content'),
>> > > then hit analyze. Helps if you have a big screen!
>> > > >
>> > > > This will show you the impact of the various filter factories that
>> you
>> > > have engaged and their effect on whether or not a 'hit' is being
>> > generated.
>> > > Hits are idietified by a very feint highlight. (PSST... Developers...
>> It
>> > > would be really cool if the highlight color were more visible or
>> > > customizable... Thanks y'all) If it looks like you're getting hits, but
>> > not
>> > > getting highlighting, then open up a new tab with the Admin's query
>> > > interface. Same place on the left as the analyzer. Replace the "*:*"
>> with
>> > > your search term (assuming you already indexed your document) and if
>> > > necessary you can put something in the FQ like "id:123456" to target a
>> > > specific record.
>> > > >
>> > > > Did you get a hit? If no, then it's not highlighting that's the
>> issue.
>> > > If yes, then try dumping this in your address bar (using your URL/IP,
>> > > search term, and core name of course. The fq= is an example) :
>> > > > http://
>> [URL/IP]/solr/[CORE-NAME]/select?fq=id:123456&q="[SEARCH-TERM]"
>> > > >
>> > > > That will dump Solr's output to your browser where you can see
>> exactly
>> > > what is getting hit.
>> > > >
>> > > > Hope that helps! Let me know how it goes. Good luck.
>> > > >
>> > > > -Teague
>> > > >
>> > > > -----Original Message-----
>> > > > From: Evert R. [mailto:evert.ra...@gmail.com]
>> > > > Sent: Wednesday, December 16, 2015 1:46 PM
>> > > > To: solr-user <solr-user@lucene.apache.org>
>> > > > Subject: Re: Solr Basic Configuration - Highlight - Begginer
>> > > >
>> > > > Hi Teague!
>> > > >
>> > > > I configured the solrconf.xml and schema.xml exactly the way you did,
>> > > only substituting the word 'documentText' per 'content' used by the
>> > > techproducts sample, I reindex through :
>> > > >
>> > > >  curl '
>> > > >
>> > >
>> >
>> http://localhost:8983/solr/techproducts/update/extract?literal.id=pdf1&commit=true
>> > > '
>> > > > -F "Emmanuel=@/home/solr/dados/teste/Emmanuel.pdf"
>> > > >
>> > > > with the same result.... no highlight in the respond as below:
>> > > >
>> > > > "highlighting": { "pdf1": {} }
>> > > >
>> > > > =(
>> > > >
>> > > > Really... do not know what to do...
>> > > >
>> > > > Thanks for your time, if you have any more suggestion where I could
>> be
>> > > missing something... please let me know.
>> > > >
>> > > >
>> > > > Best regards,
>> > > >
>> > > > *Evert*
>> > > >
>> > > > 2015-12-16 15:30 GMT-02:00 Teague James <teag...@insystechinc.com>:
>> > > >
>> > > >> Hi Evert,
>> > > >>
>> > > >> I recently needed help with phrase highlighting and was pointed to
>> the
>> > > >> FastVectorHighlighter which worked out great. I just made a change
>> to
>> > > >> the configuration to add generateWordParts="0" and
>> > > >> generateNumberParts="0" so that searches for things like "1a" would
>> > > >> get highlighted correctly. You may or may not need that feature. You
>> > > >> can always remove them or change the value to "1" to switch them on
>> > > explicitly. Anyway, hope this helps!
>> > > >>
>> > > >> solrconfig.xml (partial snip)
>> > > >> <requestHandler name="/select" class="solr.SearchHandler">
>> > > >>                 <lst name="defaults">
>> > > >>                         <str name="wt">xml</str>
>> > > >>                         <str name="echoParams">explicit</str>
>> > > >>                         <int name="rows">10</int>
>> > > >>                         <str name="df">documentText</str>
>> > > >>                         <str name="hl">on</str>
>> > > >>                         <str name="hl.fl">text</str>
>> > > >>                         <str
>> > > name="hl.useFastVectorHighlighter">true</str>
>> > > >>                         <str name="hl.snippets">100</str>
>> > > >>                         <str name="hl.tag.pre"><b></str>
>> > > >>                         <str name="hl.tag.post"></b></str>
>> > > >>                 </lst>
>> > > >> </requestHandler>
>> > > >>
>> > > >> schema.xml (partial snip)
>> > > >>    <field name="id" type="string" indexed="true" stored="true"
>> > > >> required="true" multiValued="false" />
>> > > >>    <field name="documentText" type="text_general" indexed="true"
>> > > >> multivalued="true" termVectors="true" termOffsets="true"
>> > > >> termPositions="true" />
>> > > >>
>> > > >> <fieldType name="text_general" class="solr.TextField"
>> > > >> positionIncrementGap="100">
>> > > >>         <analyzer type="index">
>> > > >>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> > > >>                 <filter class="solr.StopFilterFactory"
>> > ignoreCase="true"
>> > > >> words="stopwords.txt" />
>> > > >>                 <filter class="solr.WordDelimiterFilterFactory"
>> > > >> catenateAll="1" preserveOriginal="1" generateNumberParts="0"
>> > > >> generateWordParts="0" />
>> > > >>                 <filter class="solr.SynonymFilterFactory"
>> > > >> synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
>> > > >>                 <filter class="solr.LowerCaseFilterFactory"/>
>> > > >>                 <filter class="solr.PorterStemFilterFactory"/>
>> > > >>                 <filter class="solr.ApostropheFilterFactory"/>
>> > > >>         </analyzer>
>> > > >>         <analyzer type="query">
>> > > >>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> > > >>                 <filter class="solr.WordDelimiterFilterFactory"
>> > > >> catenateAll="1" preserveOriginal="1" generateWordParts="0" />
>> > > >>                 <filter class="solr.StopFilterFactory"
>> > ignoreCase="true"
>> > > >> words="stopwords.txt" />
>> > > >>                 <filter class="solr.LowerCaseFilterFactory"/>
>> > > >>                 <filter class="solr.ApostropheFilterFactory"/>
>> > > >>         </analyzer>
>> > > >> </fieldType>
>> > > >>
>> > > >> -Teague
>> > > >>
>> > > >> From: Evert R. [mailto:evert.ra...@gmail.com]
>> > > >> Sent: Tuesday, December 15, 2015 6:25 AM
>> > > >> To: solr-user@lucene.apache.org
>> > > >> Subject: Solr Basic Configuration - Highlight - Begginer
>> > > >>
>> > > >> Hi there!
>> > > >>
>> > > >> It´s my first installation, not sure if here is the right channel...
>> > > >>
>> > > >> Here is my steps:
>> > > >>
>> > > >> 1. Set up a basic install of solr 5.4.0
>> > > >>
>> > > >> 2. Create a new core through command line (bin/solr create -c test)
>> > > >>
>> > > >> 3. Post 2 files: 1 .docx and 2 .pdf (bin/post -c test /docs/test/)
>> > > >>
>> > > >> 4. Query over the browser and it brings the correct search, but it
>> > > >> does not show the part of the text I am querying, the highlight.
>> > > >>
>> > > >>   I have already flagled the 'hl' option. But still it does not
>> > word...
>> > > >>
>> > > >> Exemple: I am looking for the word 'peace' in my pdf file (book) I
>> > > >> have 4 matches for this word, it shows me the book name (pdf file)
>> but
>> > > >> does not bring which part of the text it has the word peace on it.
>> > > >>
>> > > >>
>> > > >> I am problably missing some configuration in schema.xml, which is
>> > > >> missing from my folder.... /solr/server/solr/test/conf/
>> > > >>
>> > > >> Or even the solrconfig.xml...
>> > > >>
>> > > >> I have read a bunch of things about highlight check these files,
>> > > >> copied the standard schema.xml to my core/conf folder, but still it
>> > > >> does not bring the highlight.
>> > > >>
>> > > >>
>> > > >> Attached a copy of my solrconfig.xml file.
>> > > >>
>> > > >>
>> > > >> I am very sorry for this, probably, dumb and too basic question...
>> > > >> First time I see solr in live.
>> > > >>
>> > > >>
>> > > >> Any help will be appreciated.
>> > > >>
>> > > >>
>> > > >>
>> > > >> Best regards,
>> > > >>
>> > > >>
>> > > >> Evert Ramos
>> > > >>
>> > > >> mailto:evert.ra...@gmail.com
>> > > >>
>> > > >>
>> > > >>
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Kind regards,
>>
>> -Teague James
>> *Senior Web Applications Developer*
>> Insystech Inc.
>> teag...@insystechinc.com
>> (703) 508-0008 (Cell)
>>

Re: Solr Basic Configuration - Highlight - Begginer

Reply via email to