Re: Phrase query no hits when stopwords and FlattenGraphFilterFactory used

2020-11-11 Thread Edward Turner
Many thanks Walter, that's useful information. And yes, if we are able to
keep stopwords, then we will. We have been exploring it because we've
noticed its use leads to a sizable drop in index size (5%, in some of our
tests), which then had the knock on effect of better performance. (Also,
unfortunately, we do not have the luxury of using super big
machines/storage -- so it's always a balancing act for us.)

Best,
Edd

Edward Turner


On Tue, 10 Nov 2020 at 16:22, Walter Underwood 
wrote:

> By far the simplest solution is to leave stopwords in the index. That also
> improves
> relevance, because it becomes possible to search for “vitamin a” or “to be
> or not to be”.
>
> Stopword remove was a performance and disk space hack from the 1960s. It
> is no
> longer needed. We were keeping stopwords in the index at Infoseek, back in
> 1996.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 10, 2020, at 1:16 AM, Edward Turner  wrote:
> >
> > Hi all,
> >
> > Okay, I've been doing more research about this problem and from what I
> > understand, phrase queries + stopwords are known to have some
> difficulties
> > working together in some circumstances.
> >
> > E.g.,
> >
> https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1
> > https://issues.apache.org/jira/browse/SOLR-6468
> >
> > I was thinking about workarounds, but each solution I've attempted
> doesn't
> > quite work.
> >
> > Therefore, maybe one possible solution is to take a step back and
> > preprocess index/query data going to Solr, something like:
> >
> > String wordsForSolr = removeStopWordsFrom("This is pretend index or query
> > data")
> > // wordsForSolr = "pretend index query data"
> >
> > Off the top of my head, this will by-pass position issues.
> >
> > I will give this a go, but was wondering whether this is something others
> > have done?
> >
> > Best wishes,
> > Edd
> >
> > 
> > Edward Turner
> >
> >
> > On Fri, 6 Nov 2020 at 13:58, Edward Turner  wrote:
> >
> >> Hi all,
> >>
> >> We are experiencing some unexpected behaviour for phrase queries which
> we
> >> believe might be related to the FlattenGraphFilterFactory and stopwords.
> >>
> >> Brief description: when performing a phrase query
> >> "Molecular cloning and evolution of the" => we get expected hits
> >> "Molecular cloning and evolution of the genes" => we get no hits
> >> (unexpected behaviour)
> >>
> >> I think it's worthwhile adding the analyzers we use to help you see what
> >> we're doing:
> >>  Analyzers 
> >>  >>   sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
> >>   
> >>   >> pattern="[- /()]+" />
> >>   >> ignoreCase="true" />
> >>   >> preserveOriginal="false" />
> >>  
> >>   >> generateNumberParts="1" splitOnCaseChange="0"
> preserveOriginal="0"
> >> splitOnNumerics="0" stemEnglishPossessive="1"
> >> generateWordParts="1"
> >> catenateNumbers="0" catenateWords="1" catenateAll="1" />
> >>  
> >>   
> >>   
> >>   >> pattern="[- /()]+" />
> >>   >> ignoreCase="true" />
> >>   >> preserveOriginal="false" />
> >>  
> >>   >> generateNumberParts="1" splitOnCaseChange="0"
> preserveOriginal="0"
> >> splitOnNumerics="0" stemEnglishPossessive="1"
> >> generateWordParts="1"
> >> catenateNumbers="0" catenateWords="0" catenateAll="0" />
> >>   
> >> 
> >>  End of Analyzers 
> >>
> >>  Stopwords 
> >> We use the following stopwords:
> >> a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no,
> not,
> >> of, on, or, such, that, the, their, then, there, these, they, this, to,
> >> was, will, with, which
> >>  End of Stopwords 
> >>
> >>  Analysis Admin page output ---
> >> ... And to see what's going on when we're indexing/querying, I created a
> >> gist with an image of the (non-verbose) output of the analysis admin
> page
> >> for, index data/query, "Molecular cloning and evolution of the genes":
> >>
> >>
> https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png
> >>
> >> Hopefully this link works, and you can see that the resulting terms and
> >> positions are identical until the FlattenGraphFilterFactory step in the
> >> "index" phase.
> >>
> >> Final stage of index analysis:
> >> (1)molecular (2)cloning (3) (4)evolution (5) (6)genes
> >>
> >> Final stage of query analysis:
> >> (1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes
> >>
> >> The empty positions are because of stopwords (presumably)
> >>  End of Analysis Admin page output ---
> >>
> >> Main question:
> >> Could someone explain why the FlattenGraphFilterFactory changes the
> >> position of the "genes" token? From what 

Re: Phrase query no hits when stopwords and FlattenGraphFilterFactory used

2020-11-10 Thread Walter Underwood
By far the simplest solution is to leave stopwords in the index. That also 
improves
relevance, because it becomes possible to search for “vitamin a” or “to be or 
not to be”.

Stopword remove was a performance and disk space hack from the 1960s. It is no 
longer needed. We were keeping stopwords in the index at Infoseek, back in 1996.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 10, 2020, at 1:16 AM, Edward Turner  wrote:
> 
> Hi all,
> 
> Okay, I've been doing more research about this problem and from what I
> understand, phrase queries + stopwords are known to have some difficulties
> working together in some circumstances.
> 
> E.g.,
> https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1
> https://issues.apache.org/jira/browse/SOLR-6468
> 
> I was thinking about workarounds, but each solution I've attempted doesn't
> quite work.
> 
> Therefore, maybe one possible solution is to take a step back and
> preprocess index/query data going to Solr, something like:
> 
> String wordsForSolr = removeStopWordsFrom("This is pretend index or query
> data")
> // wordsForSolr = "pretend index query data"
> 
> Off the top of my head, this will by-pass position issues.
> 
> I will give this a go, but was wondering whether this is something others
> have done?
> 
> Best wishes,
> Edd
> 
> 
> Edward Turner
> 
> 
> On Fri, 6 Nov 2020 at 13:58, Edward Turner  wrote:
> 
>> Hi all,
>> 
>> We are experiencing some unexpected behaviour for phrase queries which we
>> believe might be related to the FlattenGraphFilterFactory and stopwords.
>> 
>> Brief description: when performing a phrase query
>> "Molecular cloning and evolution of the" => we get expected hits
>> "Molecular cloning and evolution of the genes" => we get no hits
>> (unexpected behaviour)
>> 
>> I think it's worthwhile adding the analyzers we use to help you see what
>> we're doing:
>>  Analyzers 
>> >   sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>>   
>>  > pattern="[- /()]+" />
>>  > ignoreCase="true" />
>>  > preserveOriginal="false" />
>>  
>>  > generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>> splitOnNumerics="0" stemEnglishPossessive="1"
>> generateWordParts="1"
>> catenateNumbers="0" catenateWords="1" catenateAll="1" />
>>  
>>   
>>   
>>  > pattern="[- /()]+" />
>>  > ignoreCase="true" />
>>  > preserveOriginal="false" />
>>  
>>  > generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>> splitOnNumerics="0" stemEnglishPossessive="1"
>> generateWordParts="1"
>> catenateNumbers="0" catenateWords="0" catenateAll="0" />
>>   
>> 
>>  End of Analyzers 
>> 
>>  Stopwords 
>> We use the following stopwords:
>> a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
>> of, on, or, such, that, the, their, then, there, these, they, this, to,
>> was, will, with, which
>>  End of Stopwords 
>> 
>>  Analysis Admin page output ---
>> ... And to see what's going on when we're indexing/querying, I created a
>> gist with an image of the (non-verbose) output of the analysis admin page
>> for, index data/query, "Molecular cloning and evolution of the genes":
>> 
>> https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png
>> 
>> Hopefully this link works, and you can see that the resulting terms and
>> positions are identical until the FlattenGraphFilterFactory step in the
>> "index" phase.
>> 
>> Final stage of index analysis:
>> (1)molecular (2)cloning (3) (4)evolution (5) (6)genes
>> 
>> Final stage of query analysis:
>> (1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes
>> 
>> The empty positions are because of stopwords (presumably)
>>  End of Analysis Admin page output ---
>> 
>> Main question:
>> Could someone explain why the FlattenGraphFilterFactory changes the
>> position of the "genes" token? From what we see, this happens after a,
>> "the" (but we've not checked exhaustively, and continue to test).
>> 
>> Perhaps, we are doing something wrong in our analysis setup?
>> 
>> Any help would be much appreciated -- getting phrase queries to work is an
>> important use-case of ours.
>> 
>> Kind regards and thank you in advance,
>> Edd
>> 
>> Edward Turner
>> 



Re: Phrase query no hits when stopwords and FlattenGraphFilterFactory used

2020-11-10 Thread Edward Turner
Hi all,

Okay, I've been doing more research about this problem and from what I
understand, phrase queries + stopwords are known to have some difficulties
working together in some circumstances.

E.g.,
https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1
https://issues.apache.org/jira/browse/SOLR-6468

I was thinking about workarounds, but each solution I've attempted doesn't
quite work.

Therefore, maybe one possible solution is to take a step back and
preprocess index/query data going to Solr, something like:

String wordsForSolr = removeStopWordsFrom("This is pretend index or query
data")
// wordsForSolr = "pretend index query data"

Off the top of my head, this will by-pass position issues.

I will give this a go, but was wondering whether this is something others
have done?

Best wishes,
Edd


Edward Turner


On Fri, 6 Nov 2020 at 13:58, Edward Turner  wrote:

> Hi all,
>
> We are experiencing some unexpected behaviour for phrase queries which we
> believe might be related to the FlattenGraphFilterFactory and stopwords.
>
> Brief description: when performing a phrase query
> "Molecular cloning and evolution of the" => we get expected hits
> "Molecular cloning and evolution of the genes" => we get no hits
> (unexpected behaviour)
>
> I think it's worthwhile adding the analyzers we use to help you see what
> we're doing:
>  Analyzers 
> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>
> pattern="[- /()]+" />
> ignoreCase="true" />
> preserveOriginal="false" />
>   
> generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>  splitOnNumerics="0" stemEnglishPossessive="1"
> generateWordParts="1"
>  catenateNumbers="0" catenateWords="1" catenateAll="1" />
>   
>
>
> pattern="[- /()]+" />
> ignoreCase="true" />
> preserveOriginal="false" />
>   
> generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>  splitOnNumerics="0" stemEnglishPossessive="1"
> generateWordParts="1"
>  catenateNumbers="0" catenateWords="0" catenateAll="0" />
>
> 
>  End of Analyzers 
>
>  Stopwords 
> We use the following stopwords:
> a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
> of, on, or, such, that, the, their, then, there, these, they, this, to,
> was, will, with, which
>  End of Stopwords 
>
>  Analysis Admin page output ---
> ... And to see what's going on when we're indexing/querying, I created a
> gist with an image of the (non-verbose) output of the analysis admin page
> for, index data/query, "Molecular cloning and evolution of the genes":
>
> https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png
>
> Hopefully this link works, and you can see that the resulting terms and
> positions are identical until the FlattenGraphFilterFactory step in the
> "index" phase.
>
> Final stage of index analysis:
> (1)molecular (2)cloning (3) (4)evolution (5) (6)genes
>
> Final stage of query analysis:
> (1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes
>
> The empty positions are because of stopwords (presumably)
>  End of Analysis Admin page output ---
>
> Main question:
> Could someone explain why the FlattenGraphFilterFactory changes the
> position of the "genes" token? From what we see, this happens after a,
> "the" (but we've not checked exhaustively, and continue to test).
>
> Perhaps, we are doing something wrong in our analysis setup?
>
> Any help would be much appreciated -- getting phrase queries to work is an
> important use-case of ours.
>
> Kind regards and thank you in advance,
> Edd
> 
> Edward Turner
>


Phrase query no hits when stopwords and FlattenGraphFilterFactory used

2020-11-06 Thread Edward Turner
Hi all,

We are experiencing some unexpected behaviour for phrase queries which we
believe might be related to the FlattenGraphFilterFactory and stopwords.

Brief description: when performing a phrase query
"Molecular cloning and evolution of the" => we get expected hits
"Molecular cloning and evolution of the genes" => we get no hits
(unexpected behaviour)

I think it's worthwhile adding the analyzers we use to help you see what
we're doing:
 Analyzers 

   
  
  
  
  
  
  
   
   
  
  
  
  
  
   

 End of Analyzers 

 Stopwords 
We use the following stopwords:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
of, on, or, such, that, the, their, then, there, these, they, this, to,
was, will, with, which
 End of Stopwords 

 Analysis Admin page output ---
... And to see what's going on when we're indexing/querying, I created a
gist with an image of the (non-verbose) output of the analysis admin page
for, index data/query, "Molecular cloning and evolution of the genes":
https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png

Hopefully this link works, and you can see that the resulting terms and
positions are identical until the FlattenGraphFilterFactory step in the
"index" phase.

Final stage of index analysis:
(1)molecular (2)cloning (3) (4)evolution (5) (6)genes

Final stage of query analysis:
(1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes

The empty positions are because of stopwords (presumably)
 End of Analysis Admin page output ---

Main question:
Could someone explain why the FlattenGraphFilterFactory changes the
position of the "genes" token? From what we see, this happens after a,
"the" (but we've not checked exhaustively, and continue to test).

Perhaps, we are doing something wrong in our analysis setup?

Any help would be much appreciated -- getting phrase queries to work is an
important use-case of ours.

Kind regards and thank you in advance,
Edd

Edward Turner