Re: both way synonyms with ManagedSynonymFilterFactory
Thanks a lot for following up on this and creating the patch! On Thu, Feb 25, 2016 at 2:49 PM, Jan Høydahl wrote: > Created https://issues.apache.org/jira/browse/SOLR-8737 to handle this > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > > 22. feb. 2016 kl. 11.21 skrev Jan Høydahl : > > > > Hi > > > > Did you get any Further with this? > > I reproduced your situation with Solr 5.5. > > > > Think the issue here is that when the SynonymFilter is created based on > the managed map, option “expand” is always set to “false”, while the > default for file-based synonym dictionary is “true”. > > > > So with expand=false, what happens is that the input word (e.g. “mb”) is > *replaced* with the synonym “megabytes”. Confusingly enough, when synonyms > are applied both on index and query side, your document will contain > “megabytes” instead of “mb”, but when you query for “mb”, the same happens > on query side, so you will actually match :-) > > > > I think what we need is to switch default to expand=true, and make it > configurable also in the managed factory. > > > > -- > > Jan Høydahl, search solution architect > > Cominvent AS - www.cominvent.com > > > >> 11. feb. 2016 kl. 10.16 skrev Bjørn Hjelle : > >> > >> Hi, > >> > >> one-way managed synonyms seems to work fine, but I cannot make both-way > >> synonyms work. > >> > >> Steps to reproduce with Solr 5.4.1: > >> > >> 1. create a core: > >> $ bin/solr create_core -c test -d server/solr/configsets/basic_configs > >> > >> 2. edit schema.xml so fieldType text_general looks like this: > >> > >>>> positionIncrementGap="100"> > >> > >> > >>>> /> > >> > >> > >> > >> > >> 3. reload the core: > >> > >> $ curl -X GET " > >> http://localhost:8983/solr/admin/cores?action=RELOAD&core=test"; > >> > >> 4. add synonyms, one one-way synonym, one two-way, reload the core > again: > >> > >> $ curl -X PUT -H 'Content-type:application/json' --data-binary > >> '{"mad":["angry","upset"]}' " > >> http://localhost:8983/solr/test/schema/analysis/synonyms/english"; > >> $ curl -X PUT -H 'Content-type:application/json' --data-binary > >> '["mb","megabytes"]' " > >> http://localhost:8983/solr/test/schema/analysis/synonyms/english"; > >> $ curl -X GET " > >> http://localhost:8983/solr/admin/cores?action=RELOAD&core=test"; > >> > >> 5. list the synonyms: > >> { > >> "responseHeader":{ > >> "status":0, > >> "QTime":0}, > >> "synonymMappings":{ > >> "initArgs":{"ignoreCase":false}, > >> "initializedOn":"2016-02-11T09:00:50.354Z", > >> "managedMap":{ > >> "mad":["angry", > >> "upset"], > >> "mb":["megabytes"], > >> "megabytes":["mb"]}}} > >> > >> > >> 6. add two documents: > >> > >> $ bin/post -c test -type 'application/json' -d '[{"id" : "1", "title_t" > : > >> "10 megabytes makes me mad" },{"id" : "2", "title_t" : "100 mb should be > >> sufficient" }]' > >> $ bin/post -c test -type 'application/json' -d '[{"id" : "2", "title_t" > : > >> "100 mb should be sufficient" }]' > >> > >> 7. search for the documents: > >> > >> - all these return the first document, so one-way synonyms work: > >> $ curl -X GET " > >> http://localhost:8983/solr/test/select?q=title_t:angry&indent=true"; > >> $ curl -X GET " > >> http://localhost:8983/solr/test/select?q=title_t:upset&indent=true"; > >> $ curl -X GET " > >> http://localhost:8983/solr/test/select?q=title_t:mad&indent=true"; > >> > >> - this only returns the document with "mb": > >> > >> $ curl -X GET " > >> http://localhost:8983/solr/test/select?q=title_t:mb&indent=true"; > >> > >> - this only returns the document with "megabytes" > >> > >> $ curl -X GET " > >> http://localhost:8983/solr/test/select?q=title_t:megabytes&indent=true"; > >> > >> > >> Any input on how to make this work would be appreciated. > >> > >> Thanks, > >> Bjørn > > > >
both way synonyms with ManagedSynonymFilterFactory
Hi, one-way managed synonyms seems to work fine, but I cannot make both-way synonyms work. Steps to reproduce with Solr 5.4.1: 1. create a core: $ bin/solr create_core -c test -d server/solr/configsets/basic_configs 2. edit schema.xml so fieldType text_general looks like this: 3. reload the core: $ curl -X GET " http://localhost:8983/solr/admin/cores?action=RELOAD&core=test"; 4. add synonyms, one one-way synonym, one two-way, reload the core again: $ curl -X PUT -H 'Content-type:application/json' --data-binary '{"mad":["angry","upset"]}' " http://localhost:8983/solr/test/schema/analysis/synonyms/english"; $ curl -X PUT -H 'Content-type:application/json' --data-binary '["mb","megabytes"]' " http://localhost:8983/solr/test/schema/analysis/synonyms/english"; $ curl -X GET " http://localhost:8983/solr/admin/cores?action=RELOAD&core=test"; 5. list the synonyms: { "responseHeader":{ "status":0, "QTime":0}, "synonymMappings":{ "initArgs":{"ignoreCase":false}, "initializedOn":"2016-02-11T09:00:50.354Z", "managedMap":{ "mad":["angry", "upset"], "mb":["megabytes"], "megabytes":["mb"]}}} 6. add two documents: $ bin/post -c test -type 'application/json' -d '[{"id" : "1", "title_t" : "10 megabytes makes me mad" },{"id" : "2", "title_t" : "100 mb should be sufficient" }]' $ bin/post -c test -type 'application/json' -d '[{"id" : "2", "title_t" : "100 mb should be sufficient" }]' 7. search for the documents: - all these return the first document, so one-way synonyms work: $ curl -X GET " http://localhost:8983/solr/test/select?q=title_t:angry&indent=true"; $ curl -X GET " http://localhost:8983/solr/test/select?q=title_t:upset&indent=true"; $ curl -X GET " http://localhost:8983/solr/test/select?q=title_t:mad&indent=true"; - this only returns the document with "mb": $ curl -X GET " http://localhost:8983/solr/test/select?q=title_t:mb&indent=true"; - this only returns the document with "megabytes" $ curl -X GET " http://localhost:8983/solr/test/select?q=title_t:megabytes&indent=true"; Any input on how to make this work would be appreciated. Thanks, Bjørn
Solr 5.4, NGramFilterFactory highlighting
Hi, I have problems getting hit highlighting to work in NGram-fields, with search terms longer than 8 characters. Without the luceneMatchVersion="4.3" parameter in the field type definition, the whole word is highlighted, not just the search term. Here are the exact steps to reproduce the issue: Download Solr 5.4.0: $ wget http://archive.apache.org/dist/lucene/solr/5.4.0/solr-5.4.0.tgz $ tar xvfx solr-5.4.0.tgz Start solr: $ cd solr-5.4.0 $ bin/solr start In another command prompt, create a core: $ bin/solr create_core -c test -d server/solr/configsets/sample_techproducts_configs Add to server/solr/test/conf/schema.xml: Reload the core to pick up config changes: $ curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=test"; Create file doc.xml with contents: DOC2 thisisalongword in the document Index the document: $ bin/post -c test doc.xml Perform a search that shows that we find the document and the search term is highlighted: http://localhost:8983/solr/test/select?q=name_ngram%3Athis&wt=json&indent=true&hl=true&hl.fl=name_ngram&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C%2Fem%3E "highlighting":{ "DOC2":{ "name_ngram":["thisisalongword in the document"]}}} Add more characters to the search term, we still find the document, but the search term is now NOT highlighted: http://localhost:8983/solr/test/select?q=name_ngram%3Athisisalong&wt=json&indent=true&hl=true&hl.fl=name_ngram&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C%2Fem%3E "highlighting":{ "DOC2":{ "name_ngram":["thisisalongword in the document"]}}} Thank you, Bjørn Hjelle
Re: variable length ngramfilter highlights
Dan, you could try do add luceneMatchVersion= "4.3" to your fieldType, like so: That worked for me with Solr versions prior to Solr 5. Bjørn On Thu, Apr 9, 2015 at 2:19 PM, Dan Sullivan wrote: > Hi, > > > I apologize if this question is redundant. I've spent a few days on it and > scoured the Internet; I know that this question has been asked and answered > in various capacities for different versions of Solr; the reason I am > inquiring to this mailing list is because what I am attempting to do seems > to be supported in the Solr API documentation at the following URL: > > > > > https://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/anal > ysis/ngram/NGramTokenFilter.html > > > > Here is what I am trying to do; I have a single text field that contains a > large amount of data (it's not huge, but it may contain more than 2048 > characters of data for example). What I would like to do is have full > search capabilities (for a single input term that is a word, i.e. 'a' or > 'queue') via a variable length NGramFilter with a size of 1..10 (for > example). I've read various posts that partial highlighting on variable > length NGramFilters is 'broken' or that fast vector highlighting cannot be > used. Basically, it seems that I can search using NGramFilters, however > the > highlights that are being returned are inaccurate. > > > > I think my question is fundamental in nature; should I be able to get > accurate partial highlights of a variable length NGramfilter with any > version of Solr (using any highlighter, standard fast vector or otherwise)? > The documentation I linked above suggests it is possible. > > > > I appreciate you taking the time to help me. > > > > I have tried numerous configurations to no avail, so it might be moot to > post my configuration, however here it is. > > > > schema.xml - https://gist.github.com/dsulli99/c1d8f3536ade65e8eb35 > > solrconfig.xml https://gist.github.com/dsulli99/10e2af507cde4373adba > > > > Thank you, > > > > Dan > > > > > > > >
Solr 5: hit highlight with NGram/EdgeNgram-fields
with Solr 4.10.3 I was advised to set luceneMatchVersion to "4.3" to make hit highlight work with NGram/EdgeNgram- fields, like this: In Solr 5 and 5.1 this seems to not work any more. The complete word is highlighted, not just the part that matches the search term. In Solr admin analysis page it again does not show the proper end-offset positions. What is shows is this: LENGTF textt te tes test raw_bytes [74][74 65] [74 65 73] [74 65 73 74] start 0 0 0 0 end 4 4 4 4 positionLength 1 1 1 1 typewordwordwordword position1 1 1 1 In Solr 4.10.3 with LuceneMatchVersion set to "4.3" end offset would be: 1, 2, 3, 4 and hit higlight would work. Any advise on making hit highlight with (Edge)NGram -fields would be highly appreciated! Thanks, Bjørn
Re: (Edge)NGramFilterFactory and highlight
Mingchun, yes, that is better, and it works fine. Thank you! Bjørn On Sat, Dec 20, 2014 at 1:26 PM, Mingchun Zhao wrote: > Hi Bjørn, > > From solr4.4, the behavior of end offsets in EdgeNGramFilterFactory > was changed due to the following issue, > https://issues.apache.org/jira/browse/LUCENE-3907 > The related source code in this patch as below, > == > + if (version.onOrAfter(Version.LUCENE_44)) { > +// Never update offsets > +updateOffsets = false; > + } else { > +// if length by start + end offsets doesn't match the > term text then assume > +// this is a synonym and don't adjust the offsets. > +updateOffsets = (tokStart + curTermLength) == tokEnd; > + } > == > > It seems that there is no any property for specifying the previous > behavior of offsets as in LUCENE_43. > Therefore, you might have to set luceneMatchVersion to deal with it as > you mentioned. > However, it would be better to apply luceneMatchVersion just on the > EdgeNGramFilterFactory as below, > == > maxGramSize="20" minGramSize="1" luceneMatchVersion="4.3"/> > == > The setting of LUCENE_43 in > solrconfig.xml > will also affect other configurations. > > Regards, > Mingchun > > > 2014-12-19 23:26 GMT+09:00 Bjørn Hjelle : >> Hi, >> >> based on this example: >> http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/ >> I have earlier successfully implemented highlight of terms in >> (Edge)NGram-analyzed fields. >> >> In a new project, however, with Solr 4.10.2 it does not work. >> >> In the Solr admin analysis page I see the following in Solr 4.10.2 >> (simplified): >> >> ENGTF text t te tes test >>start 0 0 00 >>end 4 4 44 >> >> But if I change to LUCENE_43 in solrconfig.xml, and reload the >> analysis page I get this: >> >> ENGTF text t te tes test >>start 0 0 00 >>end 1 2 34 >> >> So, in 4.10.2 it is not able to find the correct end-positions and the >> highlighter will instead highlight the complete word ("test" in this >> case). >> >> >> To reproduce this: >> 1. download Solr 4.10.2 >> 2. In the collection1 schema.xml, add field type: >> >> >> >> >> > mapping="mapping-ISOLatin1Accent.txt"/> >> >> > generateWordParts="1" generateNumberParts="1" catenateWords="0" >> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> >> >> > maxGramSize="20" minGramSize="1"/> >> > pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/> >> >> >> > mapping="mapping-ISOLatin1Accent.txt"/> >> >> > generateWordParts="0" generateNumberParts="0" catenateWords="0" >> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> >> >> > pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/> >> > pattern="^(.{20})(.*)?" replacement="$1" replace="all"/> >> >> >> >> 3. Start solr and in analysis page add "Test" to Field Value (Index) >> -field and check the output. >> 4. Then change to this in solrconfig.xml >> >> LUCENE_43 >> >> 5. reload the core and reload the analyis page. >> 6. you will now see that the end-positions are correct. >> >> >> >> Any ideas on how to make this work with Solr 4.10.2 without resorting >> to changing lucene version in solrconfig.xml? >> >> >> Thanks, >> Bjørn
(Edge)NGramFilterFactory and highlight
Hi, based on this example: http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/ I have earlier successfully implemented highlight of terms in (Edge)NGram-analyzed fields. In a new project, however, with Solr 4.10.2 it does not work. In the Solr admin analysis page I see the following in Solr 4.10.2 (simplified): ENGTF text t te tes test start 0 0 00 end 4 4 44 But if I change to LUCENE_43 in solrconfig.xml, and reload the analysis page I get this: ENGTF text t te tes test start 0 0 00 end 1 2 34 So, in 4.10.2 it is not able to find the correct end-positions and the highlighter will instead highlight the complete word ("test" in this case). To reproduce this: 1. download Solr 4.10.2 2. In the collection1 schema.xml, add field type: 3. Start solr and in analysis page add "Test" to Field Value (Index) -field and check the output. 4. Then change to this in solrconfig.xml LUCENE_43 5. reload the core and reload the analyis page. 6. you will now see that the end-positions are correct. Any ideas on how to make this work with Solr 4.10.2 without resorting to changing lucene version in solrconfig.xml? Thanks, Bjørn