Re: both way synonyms with ManagedSynonymFilterFactory

2016-03-01 Thread Bjørn Hjelle
Thanks a lot for following up on this and creating the patch!

On Thu, Feb 25, 2016 at 2:49 PM, Jan Høydahl  wrote:

> Created https://issues.apache.org/jira/browse/SOLR-8737 to handle this
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 22. feb. 2016 kl. 11.21 skrev Jan Høydahl :
> >
> > Hi
> >
> > Did you get any Further with this?
> > I reproduced your situation with Solr 5.5.
> >
> > Think the issue here is that when the SynonymFilter is created based on
> the managed map, option “expand” is always set to “false”, while the
> default for file-based synonym dictionary is “true”.
> >
> > So with expand=false, what happens is that the input word (e.g. “mb”) is
> *replaced* with the synonym “megabytes”. Confusingly enough, when synonyms
> are applied both on index and query side, your document will contain
> “megabytes” instead of “mb”, but when you query for “mb”, the same happens
> on query side, so you will actually match :-)
> >
> > I think what we need is to switch default to expand=true, and make it
> configurable also in the managed factory.
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> >
> >> 11. feb. 2016 kl. 10.16 skrev Bjørn Hjelle :
> >>
> >> Hi,
> >>
> >> one-way managed synonyms seems to work fine, but I cannot make both-way
> >> synonyms work.
> >>
> >> Steps to reproduce with Solr 5.4.1:
> >>
> >> 1. create a core:
> >> $ bin/solr create_core -c test -d server/solr/configsets/basic_configs
> >>
> >> 2. edit schema.xml so fieldType text_general looks like this:
> >>
> >>>> positionIncrementGap="100">
> >> 
> >>   
> >>>> />
> >>   
> >> 
> >>   
> >>
> >> 3. reload the core:
> >>
> >> $ curl -X GET "
> >> http://localhost:8983/solr/admin/cores?action=RELOAD&core=test";
> >>
> >> 4. add synonyms, one one-way synonym, one two-way, reload the core
> again:
> >>
> >> $ curl -X PUT -H 'Content-type:application/json' --data-binary
> >> '{"mad":["angry","upset"]}' "
> >> http://localhost:8983/solr/test/schema/analysis/synonyms/english";
> >> $ curl -X PUT -H 'Content-type:application/json' --data-binary
> >> '["mb","megabytes"]' "
> >> http://localhost:8983/solr/test/schema/analysis/synonyms/english";
> >> $ curl -X GET "
> >> http://localhost:8983/solr/admin/cores?action=RELOAD&core=test";
> >>
> >> 5. list the synonyms:
> >> {
> >> "responseHeader":{
> >>   "status":0,
> >>   "QTime":0},
> >> "synonymMappings":{
> >>   "initArgs":{"ignoreCase":false},
> >>   "initializedOn":"2016-02-11T09:00:50.354Z",
> >>   "managedMap":{
> >> "mad":["angry",
> >>   "upset"],
> >> "mb":["megabytes"],
> >> "megabytes":["mb"]}}}
> >>
> >>
> >> 6. add two documents:
> >>
> >> $ bin/post -c test -type 'application/json' -d '[{"id" : "1", "title_t"
> :
> >> "10 megabytes makes me mad" },{"id" : "2", "title_t" : "100 mb should be
> >> sufficient" }]'
> >> $ bin/post -c test -type 'application/json' -d '[{"id" : "2", "title_t"
> :
> >> "100 mb should be sufficient" }]'
> >>
> >> 7. search for the documents:
> >>
> >> - all these return the first document, so one-way synonyms work:
> >> $ curl -X GET "
> >> http://localhost:8983/solr/test/select?q=title_t:angry&indent=true";
> >> $ curl -X GET "
> >> http://localhost:8983/solr/test/select?q=title_t:upset&indent=true";
> >> $ curl -X GET "
> >> http://localhost:8983/solr/test/select?q=title_t:mad&indent=true";
> >>
> >> - this only returns the document with "mb":
> >>
> >> $ curl -X GET "
> >> http://localhost:8983/solr/test/select?q=title_t:mb&indent=true";
> >>
> >> - this only returns the document with "megabytes"
> >>
> >> $ curl -X GET "
> >> http://localhost:8983/solr/test/select?q=title_t:megabytes&indent=true";
> >>
> >>
> >> Any input on how to make this work would be appreciated.
> >>
> >> Thanks,
> >> Bjørn
> >
>
>


both way synonyms with ManagedSynonymFilterFactory

2016-02-11 Thread Bjørn Hjelle
Hi,

one-way managed synonyms seems to work fine, but I cannot make both-way
synonyms work.

Steps to reproduce with Solr 5.4.1:

1. create a core:
$ bin/solr create_core -c test -d server/solr/configsets/basic_configs

2. edit schema.xml so fieldType text_general looks like this:


  



  


3. reload the core:

$ curl -X GET "
http://localhost:8983/solr/admin/cores?action=RELOAD&core=test";

4. add synonyms, one one-way synonym, one two-way, reload the core again:

$ curl -X PUT -H 'Content-type:application/json' --data-binary
'{"mad":["angry","upset"]}' "
http://localhost:8983/solr/test/schema/analysis/synonyms/english";
$ curl -X PUT -H 'Content-type:application/json' --data-binary
'["mb","megabytes"]' "
http://localhost:8983/solr/test/schema/analysis/synonyms/english";
 $ curl -X GET "
http://localhost:8983/solr/admin/cores?action=RELOAD&core=test";

5. list the synonyms:
{
  "responseHeader":{
"status":0,
"QTime":0},
  "synonymMappings":{
"initArgs":{"ignoreCase":false},
"initializedOn":"2016-02-11T09:00:50.354Z",
"managedMap":{
  "mad":["angry",
"upset"],
  "mb":["megabytes"],
  "megabytes":["mb"]}}}


6. add two documents:

$ bin/post -c test -type 'application/json' -d '[{"id" : "1", "title_t" :
"10 megabytes makes me mad" },{"id" : "2", "title_t" : "100 mb should be
sufficient" }]'
$ bin/post -c test -type 'application/json' -d '[{"id" : "2", "title_t" :
"100 mb should be sufficient" }]'

7. search for the documents:

- all these return the first document, so one-way synonyms work:
$ curl -X GET "
http://localhost:8983/solr/test/select?q=title_t:angry&indent=true";
$ curl -X GET "
http://localhost:8983/solr/test/select?q=title_t:upset&indent=true";
$ curl -X GET "
http://localhost:8983/solr/test/select?q=title_t:mad&indent=true";

- this only returns the document with "mb":

$ curl -X GET "
http://localhost:8983/solr/test/select?q=title_t:mb&indent=true";

- this only returns the document with "megabytes"

$ curl -X GET "
http://localhost:8983/solr/test/select?q=title_t:megabytes&indent=true";


Any input on how to make this work would be appreciated.

Thanks,
Bjørn


Solr 5.4, NGramFilterFactory highlighting

2015-12-21 Thread Bjørn Hjelle
Hi,

I have problems getting hit highlighting to work in NGram-fields, with
search terms longer than 8 characters.
Without the luceneMatchVersion="4.3" parameter in the field type
definition, the whole word is highlighted, not just the search term.


Here are the exact steps to reproduce the issue:

Download Solr 5.4.0:

$ wget http://archive.apache.org/dist/lucene/solr/5.4.0/solr-5.4.0.tgz
$ tar xvfx solr-5.4.0.tgz

Start solr:

$ cd solr-5.4.0
$ bin/solr start

In another command prompt, create a core:

$ bin/solr create_core -c test -d
server/solr/configsets/sample_techproducts_configs


Add to server/solr/test/conf/schema.xml:





 










Reload the core to pick up config changes:
$ curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=test";


Create file doc.xml with contents:


  
DOC2
thisisalongword in the document
  



Index the document:

$ bin/post -c test doc.xml


Perform a search that shows that we find the document and the search term
is highlighted:
http://localhost:8983/solr/test/select?q=name_ngram%3Athis&wt=json&indent=true&hl=true&hl.fl=name_ngram&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C%2Fem%3E

  "highlighting":{
"DOC2":{
  "name_ngram":["thisisalongword in the document"]}}}


Add more characters to the search term, we still find the document, but the
search term is now NOT highlighted:

http://localhost:8983/solr/test/select?q=name_ngram%3Athisisalong&wt=json&indent=true&hl=true&hl.fl=name_ngram&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C%2Fem%3E

  "highlighting":{
"DOC2":{
  "name_ngram":["thisisalongword in the document"]}}}


Thank you,
Bjørn Hjelle


Re: variable length ngramfilter highlights

2015-04-20 Thread Bjørn Hjelle
Dan, you could try do add luceneMatchVersion= "4.3" to your fieldType, like
so:



That worked for me with Solr versions prior to Solr 5.

Bjørn

On Thu, Apr 9, 2015 at 2:19 PM, Dan Sullivan  wrote:

> Hi,
>
>
> I apologize if this question is redundant.  I've spent a few days on it and
> scoured the Internet; I know that this question has been asked and answered
> in various capacities for different versions of Solr; the reason I am
> inquiring to this mailing list is because what I am attempting to do seems
> to be supported in the Solr API documentation at the following URL:
>
>
>
>
> https://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/anal
> ysis/ngram/NGramTokenFilter.html
>
>
>
> Here is what I am trying to do; I have a single text field that contains a
> large amount of data (it's not huge, but it may contain more than 2048
> characters of data for example).  What I would like to do is have full
> search capabilities (for a single input term that is a word, i.e. 'a' or
> 'queue') via a variable length NGramFilter with a size of 1..10 (for
> example).   I've read various posts that partial highlighting on variable
> length NGramFilters is 'broken' or that fast vector highlighting cannot be
> used.  Basically, it seems that I can search using NGramFilters, however
> the
> highlights that are being returned are inaccurate.
>
>
>
> I think my question is fundamental  in nature; should I be able to get
> accurate partial highlights of a variable length NGramfilter with any
> version of Solr (using any highlighter, standard fast vector or otherwise)?
> The documentation I linked above suggests it is possible.
>
>
>
> I appreciate you taking the time to help me.
>
>
>
> I have tried numerous configurations to no avail, so it might be moot to
> post my configuration, however here it is.
>
>
>
> schema.xml - https://gist.github.com/dsulli99/c1d8f3536ade65e8eb35
>
> solrconfig.xml https://gist.github.com/dsulli99/10e2af507cde4373adba
>
>
>
> Thank you,
>
>
>
> Dan
>
>
>
>
>
>
>
>


Solr 5: hit highlight with NGram/EdgeNgram-fields

2015-04-20 Thread Bjørn Hjelle
with Solr 4.10.3 I was advised to set luceneMatchVersion to "4.3" to make
hit highlight work with NGram/EdgeNgram- fields, like this:

 

In Solr 5 and 5.1 this seems to not work any more.
The complete word is  highlighted, not just the part that matches the
search term.

In Solr admin analysis page it again does not show the proper end-offset
positions. What is shows is this:

LENGTF
textt   te  tes test
raw_bytes   [74][74 65] [74 65 73]  [74 65 73 74]
start   0   0   0   0
end 4   4   4   4
positionLength  1   1   1   1
typewordwordwordword
position1   1   1   1

In Solr 4.10.3 with LuceneMatchVersion set to "4.3" end offset would be: 1,
2, 3, 4 and hit higlight would work.

Any advise on making hit highlight with (Edge)NGram -fields would be highly
appreciated!

Thanks,
Bjørn


Re: (Edge)NGramFilterFactory and highlight

2014-12-21 Thread Bjørn Hjelle
Mingchun,

yes, that is better, and it works fine.

Thank you!

Bjørn

On Sat, Dec 20, 2014 at 1:26 PM, Mingchun Zhao
 wrote:
> Hi Bjørn,
>
> From solr4.4, the behavior of end offsets in EdgeNGramFilterFactory
> was changed due to the following issue,
> https://issues.apache.org/jira/browse/LUCENE-3907
> The related source code in this patch as below,
> ==
> +  if (version.onOrAfter(Version.LUCENE_44)) {
> +// Never update offsets
> +updateOffsets = false;
> +  } else {
> +// if length by start + end offsets doesn't match the
> term text then assume
> +// this is a synonym and don't adjust the offsets.
> +updateOffsets = (tokStart + curTermLength) == tokEnd;
> +  }
> ==
>
> It seems that there is no any property for specifying the previous
> behavior of offsets as in LUCENE_43.
> Therefore, you might have to set luceneMatchVersion to deal with it as
> you mentioned.
> However, it would be better to apply luceneMatchVersion just on the
> EdgeNGramFilterFactory as below,
> ==
>  maxGramSize="20" minGramSize="1" luceneMatchVersion="4.3"/>
> ==
> The setting of LUCENE_43 in
> solrconfig.xml
> will also affect other configurations.
>
> Regards,
> Mingchun
>
>
> 2014-12-19 23:26 GMT+09:00 Bjørn Hjelle :
>> Hi,
>>
>> based on this example:
>> http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
>> I have earlier successfully implemented highlight of terms in
>> (Edge)NGram-analyzed fields.
>>
>> In a new project, however, with Solr 4.10.2 it does not work.
>>
>> In the Solr admin analysis page I see the following in Solr 4.10.2 
>> (simplified):
>>
>> ENGTF  text  t  te  tes  test
>>start 0  0   00
>>end   4  4   44
>>
>> But if I change to LUCENE_43 in solrconfig.xml, and reload the
>> analysis page I get this:
>>
>> ENGTF  text  t  te  tes  test
>>start 0  0   00
>>end   1  2   34
>>
>> So, in 4.10.2 it is not able to find the correct end-positions and the
>> highlighter will instead highlight the complete word ("test" in this
>> case).
>>
>>
>> To reproduce  this:
>> 1. download Solr 4.10.2
>> 2. In the collection1 schema.xml, add field type:
>>
>>
>> 
>> 
>> > mapping="mapping-ISOLatin1Accent.txt"/>
>> 
>> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>> 
>> > maxGramSize="20" minGramSize="1"/>
>> > pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
>> 
>> 
>> > mapping="mapping-ISOLatin1Accent.txt"/>
>> 
>> > generateWordParts="0" generateNumberParts="0" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>> 
>> > pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
>> > pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
>> 
>> 
>>
>> 3. Start solr and in analysis page add "Test" to Field Value (Index)
>> -field and check the output.
>> 4. Then change to this in solrconfig.xml
>>
>>   LUCENE_43
>>
>> 5. reload the core and reload the analyis page.
>> 6. you will now see that the end-positions are correct.
>>
>>
>>
>> Any ideas on how to make this work with Solr 4.10.2 without resorting
>> to changing lucene version in solrconfig.xml?
>>
>>
>> Thanks,
>> Bjørn


(Edge)NGramFilterFactory and highlight

2014-12-19 Thread Bjørn Hjelle
Hi,

based on this example:
http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
I have earlier successfully implemented highlight of terms in
(Edge)NGram-analyzed fields.

In a new project, however, with Solr 4.10.2 it does not work.

In the Solr admin analysis page I see the following in Solr 4.10.2 (simplified):

ENGTF  text  t  te  tes  test
   start 0  0   00
   end   4  4   44

But if I change to LUCENE_43 in solrconfig.xml, and reload the
analysis page I get this:

ENGTF  text  t  te  tes  test
   start 0  0   00
   end   1  2   34

So, in 4.10.2 it is not able to find the correct end-positions and the
highlighter will instead highlight the complete word ("test" in this
case).


To reproduce  this:
1. download Solr 4.10.2
2. In the collection1 schema.xml, add field type:





















3. Start solr and in analysis page add "Test" to Field Value (Index)
-field and check the output.
4. Then change to this in solrconfig.xml

  LUCENE_43

5. reload the core and reload the analyis page.
6. you will now see that the end-positions are correct.



Any ideas on how to make this work with Solr 4.10.2 without resorting
to changing lucene version in solrconfig.xml?


Thanks,
Bjørn