Re: Phrase query no hits when stopwords and FlattenGraphFilterFactory used

2020-11-11 Thread Edward Turner
Many thanks Walter, that's useful information. And yes, if we are able to
keep stopwords, then we will. We have been exploring it because we've
noticed its use leads to a sizable drop in index size (5%, in some of our
tests), which then had the knock on effect of better performance. (Also,
unfortunately, we do not have the luxury of using super big
machines/storage -- so it's always a balancing act for us.)

Best,
Edd

Edward Turner


On Tue, 10 Nov 2020 at 16:22, Walter Underwood 
wrote:

> By far the simplest solution is to leave stopwords in the index. That also
> improves
> relevance, because it becomes possible to search for “vitamin a” or “to be
> or not to be”.
>
> Stopword remove was a performance and disk space hack from the 1960s. It
> is no
> longer needed. We were keeping stopwords in the index at Infoseek, back in
> 1996.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 10, 2020, at 1:16 AM, Edward Turner  wrote:
> >
> > Hi all,
> >
> > Okay, I've been doing more research about this problem and from what I
> > understand, phrase queries + stopwords are known to have some
> difficulties
> > working together in some circumstances.
> >
> > E.g.,
> >
> https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1
> > https://issues.apache.org/jira/browse/SOLR-6468
> >
> > I was thinking about workarounds, but each solution I've attempted
> doesn't
> > quite work.
> >
> > Therefore, maybe one possible solution is to take a step back and
> > preprocess index/query data going to Solr, something like:
> >
> > String wordsForSolr = removeStopWordsFrom("This is pretend index or query
> > data")
> > // wordsForSolr = "pretend index query data"
> >
> > Off the top of my head, this will by-pass position issues.
> >
> > I will give this a go, but was wondering whether this is something others
> > have done?
> >
> > Best wishes,
> > Edd
> >
> > 
> > Edward Turner
> >
> >
> > On Fri, 6 Nov 2020 at 13:58, Edward Turner  wrote:
> >
> >> Hi all,
> >>
> >> We are experiencing some unexpected behaviour for phrase queries which
> we
> >> believe might be related to the FlattenGraphFilterFactory and stopwords.
> >>
> >> Brief description: when performing a phrase query
> >> "Molecular cloning and evolution of the" => we get expected hits
> >> "Molecular cloning and evolution of the genes" => we get no hits
> >> (unexpected behaviour)
> >>
> >> I think it's worthwhile adding the analyzers we use to help you see what
> >> we're doing:
> >>  Analyzers 
> >>  >>   sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
> >>   
> >>   >> pattern="[- /()]+" />
> >>   >> ignoreCase="true" />
> >>   >> preserveOriginal="false" />
> >>  
> >>   >> generateNumberParts="1" splitOnCaseChange="0"
> preserveOriginal="0"
> >> splitOnNumerics="0" stemEnglishPossessive="1"
> >> generateWordParts="1"
> >> catenateNumbers="0" catenateWords="1" catenateAll="1" />
> >>  
> >>   
> >>   
> >>   >> pattern="[- /()]+" />
> >>   >> ignoreCase="true" />
> >>   >> preserveOriginal="false" />
> >>  
> >>   >> generateNumberParts="1" splitOnCaseChange="0"
> preserveOriginal="0"
> >> splitOnNumerics="0" stemEnglishPossessive="1"
> >> generateWordParts="1"
> >> catenateNumbers="0" catenateWords="0" catenateAll="0" />
> >>   
> >> 
> >>  End of Analyzers 
> >>
> >>  Stopwords 
> >> We use the following stopwords:
> >> a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no,
> not,
> >> of, on, or, such, that, the, their, then, there, these, they, this, to,
> >> was, will, with, which
> >>  End of Stopwords 

Re: Phrase query no hits when stopwords and FlattenGraphFilterFactory used

2020-11-10 Thread Edward Turner
Hi all,

Okay, I've been doing more research about this problem and from what I
understand, phrase queries + stopwords are known to have some difficulties
working together in some circumstances.

E.g.,
https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1
https://issues.apache.org/jira/browse/SOLR-6468

I was thinking about workarounds, but each solution I've attempted doesn't
quite work.

Therefore, maybe one possible solution is to take a step back and
preprocess index/query data going to Solr, something like:

String wordsForSolr = removeStopWordsFrom("This is pretend index or query
data")
// wordsForSolr = "pretend index query data"

Off the top of my head, this will by-pass position issues.

I will give this a go, but was wondering whether this is something others
have done?

Best wishes,
Edd

--------
Edward Turner


On Fri, 6 Nov 2020 at 13:58, Edward Turner  wrote:

> Hi all,
>
> We are experiencing some unexpected behaviour for phrase queries which we
> believe might be related to the FlattenGraphFilterFactory and stopwords.
>
> Brief description: when performing a phrase query
> "Molecular cloning and evolution of the" => we get expected hits
> "Molecular cloning and evolution of the genes" => we get no hits
> (unexpected behaviour)
>
> I think it's worthwhile adding the analyzers we use to help you see what
> we're doing:
>  Analyzers 
> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>
> pattern="[- /()]+" />
> ignoreCase="true" />
> preserveOriginal="false" />
>   
> generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>  splitOnNumerics="0" stemEnglishPossessive="1"
> generateWordParts="1"
>  catenateNumbers="0" catenateWords="1" catenateAll="1" />
>   
>
>
> pattern="[- /()]+" />
> ignoreCase="true" />
> preserveOriginal="false" />
>   
> generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>  splitOnNumerics="0" stemEnglishPossessive="1"
> generateWordParts="1"
>  catenateNumbers="0" catenateWords="0" catenateAll="0" />
>
> 
>  End of Analyzers 
>
>  Stopwords 
> We use the following stopwords:
> a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
> of, on, or, such, that, the, their, then, there, these, they, this, to,
> was, will, with, which
>  End of Stopwords 
>
>  Analysis Admin page output ---
> ... And to see what's going on when we're indexing/querying, I created a
> gist with an image of the (non-verbose) output of the analysis admin page
> for, index data/query, "Molecular cloning and evolution of the genes":
>
> https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png
>
> Hopefully this link works, and you can see that the resulting terms and
> positions are identical until the FlattenGraphFilterFactory step in the
> "index" phase.
>
> Final stage of index analysis:
> (1)molecular (2)cloning (3) (4)evolution (5) (6)genes
>
> Final stage of query analysis:
> (1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes
>
> The empty positions are because of stopwords (presumably)
>  End of Analysis Admin page output ---
>
> Main question:
> Could someone explain why the FlattenGraphFilterFactory changes the
> position of the "genes" token? From what we see, this happens after a,
> "the" (but we've not checked exhaustively, and continue to test).
>
> Perhaps, we are doing something wrong in our analysis setup?
>
> Any help would be much appreciated -- getting phrase queries to work is an
> important use-case of ours.
>
> Kind regards and thank you in advance,
> Edd
> 
> Edward Turner
>


Phrase query no hits when stopwords and FlattenGraphFilterFactory used

2020-11-06 Thread Edward Turner
Hi all,

We are experiencing some unexpected behaviour for phrase queries which we
believe might be related to the FlattenGraphFilterFactory and stopwords.

Brief description: when performing a phrase query
"Molecular cloning and evolution of the" => we get expected hits
"Molecular cloning and evolution of the genes" => we get no hits
(unexpected behaviour)

I think it's worthwhile adding the analyzers we use to help you see what
we're doing:
 Analyzers 

   
  
  
  
  
  
  
   
   
  
  
  
  
  
   

 End of Analyzers 

 Stopwords 
We use the following stopwords:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
of, on, or, such, that, the, their, then, there, these, they, this, to,
was, will, with, which
 End of Stopwords 

 Analysis Admin page output ---
... And to see what's going on when we're indexing/querying, I created a
gist with an image of the (non-verbose) output of the analysis admin page
for, index data/query, "Molecular cloning and evolution of the genes":
https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png

Hopefully this link works, and you can see that the resulting terms and
positions are identical until the FlattenGraphFilterFactory step in the
"index" phase.

Final stage of index analysis:
(1)molecular (2)cloning (3) (4)evolution (5) (6)genes

Final stage of query analysis:
(1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes

The empty positions are because of stopwords (presumably)
 End of Analysis Admin page output ---

Main question:
Could someone explain why the FlattenGraphFilterFactory changes the
position of the "genes" token? From what we see, this happens after a,
"the" (but we've not checked exhaustively, and continue to test).

Perhaps, we are doing something wrong in our analysis setup?

Any help would be much appreciated -- getting phrase queries to work is an
important use-case of ours.

Kind regards and thank you in advance,
Edd

Edward Turner


Re: Solr storage of fields <-> indexed data

2020-09-28 Thread Edward Turner
That's really good and helpful info, thank you. Perfect.

Best wishes,

Edd

On Mon, 28 Sep 2020, 5:53 pm Shawn Heisey,  wrote:

> On 9/28/2020 8:56 AM, Edward Turner wrote:
> > By removing the copyfields, we've found that our index sizes have reduced
> > by ~40% in some cases, which is great! We're just curious now as to
> exactly
> > how this can be ...
>
> That's not surprising.
>
> > My question is, given the following two schemas, if we index some data to
> > the "description" field, will the index for schema1 be twice as large as
> > the index of schema2? (I guess this relates to how, internally, Solr
> stores
> > field + index data)
> >
> > Old way -- schema1:
> > ===
> >  > multiValued="false"/>
> >  > multiValued="false" />
> >  > multiValued="false"/>
>
> If the only field in the indexed documents is "description", the index
> built with schema2 will be half the size of the index built with
> schema1.  Both fields referenced by "copyField" are the same type and
> have the same settings, so they would contain exactly the same data at
> the Lucene level.
>
> Having the same type for a source and destination field is normally only
> useful if multiple sources are copied to a destination, which requires
> multiValued="true" on the destination -- NOT the case in your example.
>
> There is one other use case for a copyField -- using the same data
> differently, with different type values.  For example you might have one
> type for faceting and one for searching.
>
> Thanks,
> Shawn
>


Solr storage of fields <-> indexed data

2020-09-28 Thread Edward Turner
Hi all,

We have recently switched to using edismax + qf fields, and no longer use
copyfields to allow us to easily search over values in multiple fields (by
copying multiple fields' values to the copyfield destinations, and then
performing queries over the destination field).

By removing the copyfields, we've found that our index sizes have reduced
by ~40% in some cases, which is great! We're just curious now as to exactly
how this can be ...

My question is, given the following two schemas, if we index some data to
the "description" field, will the index for schema1 be twice as large as
the index of schema2? (I guess this relates to how, internally, Solr stores
field + index data)

Old way -- schema1:
===




Many thanks and kind regards,

Edd


Re: Manipulating client's query using a Query object

2020-08-17 Thread Edward Turner
Hi Markus,

Many thanks, I see what you are saying. My question was:

Question: is it possible to get a Lucene Query representation of the
client's query, which we can then navigate and manipulate -- before we then
send the String representation of this Query to Solr for evaluation?

... and from your answer, I think the last part of this question, "before
we then send the String representation of this Query to Solr for
evaluation" perhaps makes no sense when doing things in the proper Solr
way. I had originally thought about doing this manipulation in, say, a REST
application, before sending it to Solr, which I suppose isn't the way to do
it.

Many thanks,

Edd
--------
Edward Turner


On Mon, 17 Aug 2020 at 21:23, Markus Jelsma 
wrote:

> Hello Edward,
>
> You asked for the 'Lucene Query representation of the client's query'
> which is already inside Solr and needs no forwarding to anything. Just
> return in parse() and you are good to go.
>
> The Query object contains the analyzed form of your query string.
> ExtendedDismax has some variable (i think it was qstr) that contains the
> original input string. In there you have access to that too.
>
> Regards,
> Markus
>
>
> -Original message-
> > From:Edward Turner 
> > Sent: Monday 17th August 2020 21:25
> > To: solr-user@lucene.apache.org
> > Subject: Re: Manipulating client's query using a Query object
> >
> > Hi Markus,
> >
> > That's really great info. Thank you.
> >
> > Supposing we've now modified the Query object, do you know how we would
> get
> > the corresponding query String, which we could then forward to our
> > Solrcloud via SolrClient?
> >
> > (Or should we be using this extended ExtendedDisMaxQParser class server
> > side in Solr?)
> >
> > Kind regards,
> >
> > Edd
> >
> > 
> > Edward Turner
> >
> >
> > On Mon, 17 Aug 2020 at 15:06, Markus Jelsma 
> > wrote:
> >
> > > Hello Edward,
> > >
> > > Yes you can by extending ExtendedDismaxQParser [1] and override its
> > > parse() method. You get the main Query object through super.parse().
> > >
> > > If you need even more fine grained control on how Query objects are
> > > created you can extend ExtendedSolrQueryParser's [2] (inner class)
> > > newFieldQuery() method.
> > >
> > > Regards,
> > > Markus
> > >
> > > [1]
> > >
> https://lucene.apache.org/solr/8_6_0/solr-core/org/apache/solr/search/ExtendedDismaxQParser.html
> > > [2]
> > >
> https://lucene.apache.org/solr/8_6_0/solr-core/org/apache/solr/search/ExtendedDismaxQParser.ExtendedSolrQueryParser.html
> > >
> > > -Original message-
> > > > From:Edward Turner 
> > > > Sent: Monday 17th August 2020 15:53
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Manipulating client's query using a Query object
> > > >
> > > > Hi all,
> > > >
> > > > Thanks for all your help recently. We're now using the edismax query
> > > parser
> > > > and are happy with its behaviour. We have another question which
> maybe
> > > > someone can help with.
> > > >
> > > > We have one use case where we optimise our query before sending it to
> > > Solr,
> > > > and we do this by manipulating the client's input query string.
> However,
> > > > we're slightly uncomfortable using String manipulation to do this as
> > > > there's always the possibility we parse their string wrongly. (We
> have a
> > > > large test suite to check if we're doing the right thing, but even
> then,
> > > we
> > > > String manipulation doesn't feel right ...).
> > > >
> > > > Question: is it possible to get a Lucene Query representation of the
> > > > client's query, which we can then navigate and manipulate -- before
> we
> > > then
> > > > send the String representation of this Query to Solr for evaluation?
> > > >
> > > > Kind regards and thank you for your help in advance,
> > > >
> > > > Edd
> > > >
> > >
> >
>


Re: Manipulating client's query using a Query object

2020-08-17 Thread Edward Turner
Hi Markus,

That's really great info. Thank you.

Supposing we've now modified the Query object, do you know how we would get
the corresponding query String, which we could then forward to our
Solrcloud via SolrClient?

(Or should we be using this extended ExtendedDisMaxQParser class server
side in Solr?)

Kind regards,

Edd


Edward Turner


On Mon, 17 Aug 2020 at 15:06, Markus Jelsma 
wrote:

> Hello Edward,
>
> Yes you can by extending ExtendedDismaxQParser [1] and override its
> parse() method. You get the main Query object through super.parse().
>
> If you need even more fine grained control on how Query objects are
> created you can extend ExtendedSolrQueryParser's [2] (inner class)
> newFieldQuery() method.
>
> Regards,
> Markus
>
> [1]
> https://lucene.apache.org/solr/8_6_0/solr-core/org/apache/solr/search/ExtendedDismaxQParser.html
> [2]
> https://lucene.apache.org/solr/8_6_0/solr-core/org/apache/solr/search/ExtendedDismaxQParser.ExtendedSolrQueryParser.html
>
> -Original message-
> > From:Edward Turner 
> > Sent: Monday 17th August 2020 15:53
> > To: solr-user@lucene.apache.org
> > Subject: Manipulating client's query using a Query object
> >
> > Hi all,
> >
> > Thanks for all your help recently. We're now using the edismax query
> parser
> > and are happy with its behaviour. We have another question which maybe
> > someone can help with.
> >
> > We have one use case where we optimise our query before sending it to
> Solr,
> > and we do this by manipulating the client's input query string. However,
> > we're slightly uncomfortable using String manipulation to do this as
> > there's always the possibility we parse their string wrongly. (We have a
> > large test suite to check if we're doing the right thing, but even then,
> we
> > String manipulation doesn't feel right ...).
> >
> > Question: is it possible to get a Lucene Query representation of the
> > client's query, which we can then navigate and manipulate -- before we
> then
> > send the String representation of this Query to Solr for evaluation?
> >
> > Kind regards and thank you for your help in advance,
> >
> > Edd
> >
>


Manipulating client's query using a Query object

2020-08-17 Thread Edward Turner
Hi all,

Thanks for all your help recently. We're now using the edismax query parser
and are happy with its behaviour. We have another question which maybe
someone can help with.

We have one use case where we optimise our query before sending it to Solr,
and we do this by manipulating the client's input query string. However,
we're slightly uncomfortable using String manipulation to do this as
there's always the possibility we parse their string wrongly. (We have a
large test suite to check if we're doing the right thing, but even then, we
String manipulation doesn't feel right ...).

Question: is it possible to get a Lucene Query representation of the
client's query, which we can then navigate and manipulate -- before we then
send the String representation of this Query to Solr for evaluation?

Kind regards and thank you for your help in advance,

Edd


Re: Multiple "df" fields

2020-08-13 Thread Edward Turner
Goodness me, woops, yes, it was a typo -- sorry fo the confusion. We're
indeed exploring qf, rather than pf! :). So far it's looking promising!

Thanks for your eagle-eye spotting!

Best,
Edd

Edward Turner


On Wed, 12 Aug 2020 at 13:15, Erick Erickson 
wrote:

> Probably a typo but I think you mean qf rather than pf?
>
> They’re both actually valid, but pf is “phrase field” which will give
> different results….
>
>  Best,
> Erick
>
> > On Aug 12, 2020, at 5:26 AM, Edward Turner  wrote:
> >
> > Many thanks for your suggestions.
> >
> > We do use edismax and bq fields to help with our result ranking, but we'd
> > never thought about using it for this purpose (we were stuck on the
> > copyfield pattern + df pattern). This is a good suggestion though thank
> you.
> >
> > We're now exploring the use of the pf field (thanks to Alexandre R. for
> > this) to automatically search on multiple fields, rather than relying on
> df.
> >
> > Kind regards,
> >
> > Edd
> > 
> > Edward Turner
> >
> >
> > On Tue, 11 Aug 2020 at 15:44, Erick Erickson 
> > wrote:
> >
> >> Have you explored edismax?
> >>
> >>> On Aug 11, 2020, at 10:34 AM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> >> wrote:
> >>>
> >>> I can't remember if field aliasing works with df but it may be worth a
> >> try:
> >>>
> >>>
> >>
> https://lucene.apache.org/solr/guide/8_1/the-extended-dismax-query-parser.html#field-aliasing-using-per-field-qf-overrides
> >>>
> >>> Another example:
> >>>
> >>
> https://github.com/arafalov/solr-indexing-book/blob/master/published/languages/conf/solrconfig.xml
> >>>
> >>> Regards,
> >>>   Alex
> >>>
> >>> On Tue., Aug. 11, 2020, 9:59 a.m. Edward Turner, 
> >>> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> Is it possible to have multiple "df" fields? (We think the answer is
> no
> >>>> because our experiments did not work when adding multiple "df" values
> to
> >>>> solrconfig.xml -- but we just wanted to double check with those who
> know
> >>>> better.) The reason we would like to do this is that we have two main
> >> field
> >>>> types (with different analyzers) and we'd like queries without a field
> >> to
> >>>> be searched over both of them. We could also use copyfields, but this
> >> would
> >>>> require us to have a common analyzer, which isn't exactly what we
> want.
> >>>>
> >>>> An alternative solution is to pre-process the query prior to sending
> it
> >> to
> >>>> Solr, so that queries with no field are changed as follows:
> >>>>
> >>>> q=value -> q=(field1:value OR field2:value)
> >>>>
> >>>> ... however, we feel a bit uncomfortable doing this though via String
> >>>> manipulation.
> >>>>
> >>>> Is there an obvious way we should tackle this problem that we are
> >> missing
> >>>> (e.g., which would be cleaner/safer and perhaps works at the Query
> >> object
> >>>> level)?
> >>>>
> >>>> Many thanks and best wishes,
> >>>>
> >>>> Edd
> >>>>
> >>
> >>
>
>


Re: Multiple "df" fields

2020-08-12 Thread Edward Turner
Many thanks for your suggestions.

We do use edismax and bq fields to help with our result ranking, but we'd
never thought about using it for this purpose (we were stuck on the
copyfield pattern + df pattern). This is a good suggestion though thank you.

We're now exploring the use of the pf field (thanks to Alexandre R. for
this) to automatically search on multiple fields, rather than relying on df.

Kind regards,

Edd

Edward Turner


On Tue, 11 Aug 2020 at 15:44, Erick Erickson 
wrote:

> Have you explored edismax?
>
> > On Aug 11, 2020, at 10:34 AM, Alexandre Rafalovitch 
> wrote:
> >
> > I can't remember if field aliasing works with df but it may be worth a
> try:
> >
> >
> https://lucene.apache.org/solr/guide/8_1/the-extended-dismax-query-parser.html#field-aliasing-using-per-field-qf-overrides
> >
> > Another example:
> >
> https://github.com/arafalov/solr-indexing-book/blob/master/published/languages/conf/solrconfig.xml
> >
> > Regards,
> >Alex
> >
> > On Tue., Aug. 11, 2020, 9:59 a.m. Edward Turner, 
> > wrote:
> >
> >> Hi all,
> >>
> >> Is it possible to have multiple "df" fields? (We think the answer is no
> >> because our experiments did not work when adding multiple "df" values to
> >> solrconfig.xml -- but we just wanted to double check with those who know
> >> better.) The reason we would like to do this is that we have two main
> field
> >> types (with different analyzers) and we'd like queries without a field
> to
> >> be searched over both of them. We could also use copyfields, but this
> would
> >> require us to have a common analyzer, which isn't exactly what we want.
> >>
> >> An alternative solution is to pre-process the query prior to sending it
> to
> >> Solr, so that queries with no field are changed as follows:
> >>
> >> q=value -> q=(field1:value OR field2:value)
> >>
> >> ... however, we feel a bit uncomfortable doing this though via String
> >> manipulation.
> >>
> >> Is there an obvious way we should tackle this problem that we are
> missing
> >> (e.g., which would be cleaner/safer and perhaps works at the Query
> object
> >> level)?
> >>
> >> Many thanks and best wishes,
> >>
> >> Edd
> >>
>
>


Re: Multiple "df" fields

2020-08-11 Thread Edward Turner
Hi David,

We tried using copyfields, and we can get this to work, but it's not
exactly what we want because we need to use a common type. E.g.,
















Then if our "df" is specified as the "content" field, we can search over
"id", "name" and "organism" in one swoop. However, "content" has a
different type to "id" and "name", and so our search results might be
different than if we had searched directly on "id" or "name".

e.g.,
q=id:value1 // hits id field, which uses the "simple" type
q=value1 // hits content field, which uses the "complex" type
... so results might differ between the two queries

I hope this clarifies our question?

Best,

Edd


Edward Turner


On Tue, 11 Aug 2020 at 15:03, David Hastings 
wrote:

> why not use a copyfield for indexing?
>
> On Tue, Aug 11, 2020 at 9:59 AM Edward Turner  wrote:
>
> > Hi all,
> >
> > Is it possible to have multiple "df" fields? (We think the answer is no
> > because our experiments did not work when adding multiple "df" values to
> > solrconfig.xml -- but we just wanted to double check with those who know
> > better.) The reason we would like to do this is that we have two main
> field
> > types (with different analyzers) and we'd like queries without a field to
> > be searched over both of them. We could also use copyfields, but this
> would
> > require us to have a common analyzer, which isn't exactly what we want.
> >
> > An alternative solution is to pre-process the query prior to sending it
> to
> > Solr, so that queries with no field are changed as follows:
> >
> > q=value -> q=(field1:value OR field2:value)
> >
> > ... however, we feel a bit uncomfortable doing this though via String
> > manipulation.
> >
> > Is there an obvious way we should tackle this problem that we are missing
> > (e.g., which would be cleaner/safer and perhaps works at the Query object
> > level)?
> >
> > Many thanks and best wishes,
> >
> > Edd
> >
>


Multiple "df" fields

2020-08-11 Thread Edward Turner
Hi all,

Is it possible to have multiple "df" fields? (We think the answer is no
because our experiments did not work when adding multiple "df" values to
solrconfig.xml -- but we just wanted to double check with those who know
better.) The reason we would like to do this is that we have two main field
types (with different analyzers) and we'd like queries without a field to
be searched over both of them. We could also use copyfields, but this would
require us to have a common analyzer, which isn't exactly what we want.

An alternative solution is to pre-process the query prior to sending it to
Solr, so that queries with no field are changed as follows:

q=value -> q=(field1:value OR field2:value)

... however, we feel a bit uncomfortable doing this though via String
manipulation.

Is there an obvious way we should tackle this problem that we are missing
(e.g., which would be cleaner/safer and perhaps works at the Query object
level)?

Many thanks and best wishes,

Edd


Re: Solrcloud export all results sorted by score

2019-10-04 Thread Edward Turner
Hi Chris,

Good info, thank you for that!

> What's your UI & middle layer like for this application and
> eventual "download" ?

I'm working in a team on the back-end side of things, where we providing a
REST API that can be used by clients, which include our UI, which is a
React JS based app with various fancy bio visualisations in it. Slightly
more detail, Solr is used purely for search, giving the IDs of the hits. We
then use a key-value store to fetch the IDs entity data. So, generally
speaking, each "download" involves:

1. user request asking for data in content-type X
2. our REST app makes solr request
3. IDs <- solr fetches results
4. entities <- fetch from key-value store entities with keys in IDs
5. write entities in format X

Using cursorMark, 3 & 4 will be performed repeatedly until all hits
fetched; and we may run 3 in a separate thread to 4 & 5, to ensure Solr
communication need not block fetching entity data / writing. We could do
more optimisation around these tasks, but I'm sure you've already
understood.

Many thanks for your input.

Best,
Edd

On Thu, 3 Oct 2019 at 19:13, Chris Hostetter 
wrote:

>
> : We show a table of search results ordered by score (relevancy) that was
> : obtained from sending a query to the standard /select handler. We're
> : working in the life-sciences domain and it is common for our result sets
> to
> : contain many millions of results (unfortunately). After users browse
> their
> : results, they then may want to download the results that they see, to do
> : some post-processing. However, to do this, such that the results appear
> in
> : the order that the user originally saw them, we'd need to be able to
> export
> : results based on score/relevancy.
>
> What's your UI & middle layer like for this application and
> eventual "download" ?
>
> I'm going to presume your end user facing app is reading the data from
> Solr, buffering it locally while formatting it in some user selected
> export format, and then giving the user a download link?
>
> In which case using a cursor, and making iterative requests to solr from
> your app should work just fine...
>
>
> https://lucene.apache.org/solr/guide/8_0/pagination-of-results.html#fetching-a-large-number-of-sorted-results-cursors
>
> (The added benefit of cursors over /export is that it doesn't require doc
> values on every field you return ... which seems like something that you
> might care about if you have large (text) fields and an index growing as
> fast as you describe yours growing)
>
>
> If you don't have any sort of middle layer application, and you're just
> providing a very thin (ie: javascript) based UI in front of solr,
> and need a way to stream a full result set from solr that you can give
> your end users raw direct access to ... then i think you're out of luck?
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Solrcloud export all results sorted by score

2019-10-03 Thread Edward Turner
Hi Walter,

Thank you also for your reply. Good to know of your experience. Roughly how
many documents were you fetching? Unfortunately, it's possible that some of
our users could attempt to "download" many records, meaning we'd need to
make a request to Solr where rows >= 150M. A key challenge for us is that
in the life sciences, when more sequencing data comes in, it's possible for
our data-sets to grow extremely quickly. Currently it doubles every 18
months or so (and today we have about 200M records, so not super big right
now).

Best,
Edd
----
Edward Turner


On Tue, 1 Oct 2019 at 17:33, Walter Underwood  wrote:

> I had to do this recently on a Solr Cloud cluster. I wanted to export all
> the IDs, but they weren’t stored as docvalues.
>
> The fastest approach was to fetch all the IDs in one request. First, I
> make a request for zero rows to get the numFound. Then I fetch
> numFound+1000 (in case docs were added while I wasn’t looking) in one
> request.
>
> I also have a hairy shell script to do /export on each leader after
> parsing cluster status. That might be a little large to post to this list,
> but I can do it if there is general interest.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Oct 1, 2019, at 9:14 AM, Erick Erickson 
> wrote:
> >
> > First, thanks for taking the time to ask a question with enough
> supporting details that I can hope to be able to answer in one exchange ;).
> It’s a pleasure to see.
> >
> > Second, NP with asking on Stack Overflow, they have some excellent
> answers there. But you’re right, this list gets more Solr-centered eyeballs.
> >
> > On to your question. I think the best answer was that “/export wasn’t
> designed to deal with scores”, which you’ll find disappointing.
> >
> > You could use the Streaming “search” expression (using qt=/select or
> just leave qt out) but that’ll sort all of the docs you’re exporting into a
> huge list, which may perform worse than CursorMark even if it doesn’t blow
> up memory.
> >
> > The root of this problem is that export can sort in batches since the
> values it’s sorting on are contained in each document, so it can iterate in
> batches, send them out, then iterate again on the remaining documents.
> >
> > Score, since it’s dynamic, can’t do that. Solr has to score _all_ the
> docs to know where a doc lands in the final set relative to any other doc,
> so if it were going to work it’d have to have enough memory to hold the
> scores of all the docs in an ordered list, which is very expensive.
> Conceptually this is an ordered list up to maxDoc long. Not only does there
> have to be enough memory to hold the entire list, every doc has to be
> inserted individually which can kill performance. This is the “deep paging”
> problem.
> >
> > In the usual case of returning, say, 20 docs, the sorted list only has
> to be 20 long, higher scoring docs evict lower scoring docs.
> >
> > So I think CursorMark is your best bet.
> >
> > Best,
> > Erick
> >
> >> On Oct 1, 2019, at 3:59 AM, Edward Turner  wrote:
> >>
> >> Hi all,
> >>
> >> As far as I understand, SolrCloud currently does not allow the use of
> >> sorting by the pseudofield, score in the /export request handler (i.e.,
> get
> >> the results in relevancy order). If we do attempt this, we get an
> >> exception, "org.apache.solr.search.SyntaxError: Scoring is not currently
> >> supported with xsort". We could use Solr's cursorMark, but this takes a
> >> very long time ...
> >>
> >> Exporting results does work, however, when exporting result sets by a
> >> specific document field that has docValues set to true.
> >>
> >> Question:
> >> Does anyone know if/when it will be possible to sort by score in the
> >> /export handler?
> >>
> >> Research on the problem:
> >> We've seen https://issues.apache.org/jira/browse/SOLR-5244 and
> >> https://issues.apache.org/jira/browse/SOLR-8664, which are related to
> this
> >> issue, but don't fix it. Maybe I've missed a more relevant issue?
> >>
> >> Our use-case We are using Solrcloud in our team and it's added a huge
> >> amount of value to our users.
> >>
> >> We show a table of search results ordered by score (relevancy) that was
> >> obtained from sending a query to the standard /select handler. We're
> >> working in the life-sciences domain and it is common for our result
> sets to
> >> contain many millions of results (unfortunately). After us

Re: Solrcloud export all results sorted by score

2019-10-03 Thread Edward Turner
Hi Erick,

Many thanks for your detailed reply. It's really good information for us to
know, and although not exactly what we wanted to hear (that /export wasn't
designed to handle ranking), it's much better for us to definitively know
one way or the other -- and this allows us to move forward. We'll
experiment by going the cursorMark route. I'm hoping that the bottleneck
then isn't Solr, but rather the fetching and writing of the full records
(we use Solr as just a search engine, which gives us IDs of records of
interest; and we use a separate key-value store to get the actual record
data). Anyway, we'll see and fingers crossed :).

Best wishes,

Edd



On Tue, 1 Oct 2019 at 17:15, Erick Erickson  wrote:

> First, thanks for taking the time to ask a question with enough supporting
> details that I can hope to be able to answer in one exchange ;). It’s a
> pleasure to see.
>
> Second, NP with asking on Stack Overflow, they have some excellent answers
> there. But you’re right, this list gets more Solr-centered eyeballs.
>
> On to your question. I think the best answer was that “/export wasn’t
> designed to deal with scores”, which you’ll find disappointing.
>
> You could use the Streaming “search” expression (using qt=/select or just
> leave qt out) but that’ll sort all of the docs you’re exporting into a huge
> list, which may perform worse than CursorMark even if it doesn’t blow up
> memory.
>
> The root of this problem is that export can sort in batches since the
> values it’s sorting on are contained in each document, so it can iterate in
> batches, send them out, then iterate again on the remaining documents.
>
> Score, since it’s dynamic, can’t do that. Solr has to score _all_ the docs
> to know where a doc lands in the final set relative to any other doc, so if
> it were going to work it’d have to have enough memory to hold the scores of
> all the docs in an ordered list, which is very expensive. Conceptually this
> is an ordered list up to maxDoc long. Not only does there have to be enough
> memory to hold the entire list, every doc has to be inserted individually
> which can kill performance. This is the “deep paging” problem.
>
> In the usual case of returning, say, 20 docs, the sorted list only has to
> be 20 long, higher scoring docs evict lower scoring docs.
>
> So I think CursorMark is your best bet.
>
> Best,
> Erick
>
> > On Oct 1, 2019, at 3:59 AM, Edward Turner  wrote:
> >
> > Hi all,
> >
> > As far as I understand, SolrCloud currently does not allow the use of
> > sorting by the pseudofield, score in the /export request handler (i.e.,
> get
> > the results in relevancy order). If we do attempt this, we get an
> > exception, "org.apache.solr.search.SyntaxError: Scoring is not currently
> > supported with xsort". We could use Solr's cursorMark, but this takes a
> > very long time ...
> >
> > Exporting results does work, however, when exporting result sets by a
> > specific document field that has docValues set to true.
> >
> > Question:
> > Does anyone know if/when it will be possible to sort by score in the
> > /export handler?
> >
> > Research on the problem:
> > We've seen https://issues.apache.org/jira/browse/SOLR-5244 and
> > https://issues.apache.org/jira/browse/SOLR-8664, which are related to
> this
> > issue, but don't fix it. Maybe I've missed a more relevant issue?
> >
> > Our use-case We are using Solrcloud in our team and it's added a huge
> > amount of value to our users.
> >
> > We show a table of search results ordered by score (relevancy) that was
> > obtained from sending a query to the standard /select handler. We're
> > working in the life-sciences domain and it is common for our result sets
> to
> > contain many millions of results (unfortunately). After users browse
> their
> > results, they then may want to download the results that they see, to do
> > some post-processing. However, to do this, such that the results appear
> in
> > the order that the user originally saw them, we'd need to be able to
> export
> > results based on score/relevancy.
> >
> > Any suggestions or advice on this would be greatly appreciated!
> >
> > Many thanks!
> >
> > Edd
> >
> > PS. apologies for posting also on Stackoverflow (
> >
> https://stackoverflow.com/questions/58167152/solrcloud-export-all-results-sorted-by-score
> )
> > --
> > I only discovered the Solr mailing-list afterwards and thought it
> probably
> > better to reach out directly to Solr's people (I can share any answer
> from
> > this forum on there retrospectively).
>
>


Solrcloud export all results sorted by score

2019-10-01 Thread Edward Turner
Hi all,

As far as I understand, SolrCloud currently does not allow the use of
sorting by the pseudofield, score in the /export request handler (i.e., get
the results in relevancy order). If we do attempt this, we get an
exception, "org.apache.solr.search.SyntaxError: Scoring is not currently
supported with xsort". We could use Solr's cursorMark, but this takes a
very long time ...

Exporting results does work, however, when exporting result sets by a
specific document field that has docValues set to true.

Question:
Does anyone know if/when it will be possible to sort by score in the
/export handler?

Research on the problem:
We've seen https://issues.apache.org/jira/browse/SOLR-5244 and
https://issues.apache.org/jira/browse/SOLR-8664, which are related to this
issue, but don't fix it. Maybe I've missed a more relevant issue?

Our use-case We are using Solrcloud in our team and it's added a huge
amount of value to our users.

We show a table of search results ordered by score (relevancy) that was
obtained from sending a query to the standard /select handler. We're
working in the life-sciences domain and it is common for our result sets to
contain many millions of results (unfortunately). After users browse their
results, they then may want to download the results that they see, to do
some post-processing. However, to do this, such that the results appear in
the order that the user originally saw them, we'd need to be able to export
results based on score/relevancy.

Any suggestions or advice on this would be greatly appreciated!

Many thanks!

Edd

PS. apologies for posting also on Stackoverflow (
https://stackoverflow.com/questions/58167152/solrcloud-export-all-results-sorted-by-score)
--
I only discovered the Solr mailing-list afterwards and thought it probably
better to reach out directly to Solr's people (I can share any answer from
this forum on there retrospectively).