Re: Solr Slack Workspace

2021-01-15 Thread matthew sporleder
IRC has kind of died off,
https://lucene.apache.org/solr/community.html has a slack mentioned,
I'm on https://opensourceconnections.com/slack after taking their solr
training class and assume it's mostly open to solr community.

On Fri, Jan 15, 2021 at 8:10 PM Justin Sweeney
 wrote:
>
> Hi all,
>
> I did some googling and didn't find anything, but is there a Slack
> workspace for Solr? I think this could be useful to expand interaction
> within the community of Solr users and connect people solving similar
> problems.
>
> I'd be happy to get this setup if it does not exist already.
>
> Justin


Solr Slack Workspace

2021-01-15 Thread Justin Sweeney
Hi all,

I did some googling and didn't find anything, but is there a Slack
workspace for Solr? I think this could be useful to expand interaction
within the community of Solr users and connect people solving similar
problems.

I'd be happy to get this setup if it does not exist already.

Justin


Re: [Solr8.7] Performance of group.ngroups ?

2021-01-15 Thread Joel Bernstein
You can try collapse as well.



Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Jan 15, 2021 at 4:51 AM Bruno Mannina  wrote:

> Hello,
>
>
>
> I found a temporary solution to my problem.
>
>
>
> I do a request without ngroups=true => result is quickly
>
> And just after, I do a simple request with my query and this param:
>
> ….={x:"unique(fid)"}
>
> Where the field « fid » is my group field name.
>
>
>
> 88 sec => 3~4 sec for both requests.
>
>
>
> Regards,
>
> Bruno
>
>
>
> De : Matheo Software [mailto:i...@matheo-software.com]
> Envoyé : jeudi 14 janvier 2021 14:48
> À : solr-user@lucene.apache.org
> Objet : [Solr8.7] Performance of group.ngroups ?
>
>
>
> Hi All,
>
>
>
> I have more than 130 million documents, with an index size of more than
> 400GB on Solr8.7.
>
>
>
> I do a simple query and it takes around 1400ms, it’s ok but when I use
> ngroups=true, I get an answer in 88sec.
>
> I know it’s because Solr calculates the number of groups on a specific
> field
> but is exist a solution to improve that? An alternate solution?
>
>
>
> Many thanks,
>
>
>
> Cordialement, Best Regards
>
> Bruno Mannina
>
>   www.matheo-software.com
>
>   www.patent-pulse.com
>
> Tél. +33 0 970 738 743
>
> Mob. +33 0 634 421 817
>
>   facebook (1)
>  1425551717
>  1425551737
>  1425551760
>
>
>
>
>
>
>
> <
> https://www.avast.com/sig-email?utm_medium=email_source=link_campai
> gn=sig-email_content=emailclient
> >
>
>
> Garanti sans virus.
> <
> https://www.avast.com/sig-email?utm_medium=email_source=link_campai
> gn=sig-email_content=emailclient
> >
> www.avast.com
>
>
>
>
>
> --
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> https://www.avast.com/antivirus
>


Re: Solrcloud - Reads on specific nodes

2021-01-15 Thread Shawn Heisey

On 1/15/2021 7:56 AM, Doss wrote:

1. Suppose we have 10 node SOLR Cloud setup, is it possible to dedicate 4
nodes for writes and 6 nodes for selects?

2. We have a SOLR cloud setup for our customer facing applications, and we
would like to have two more SOLR nodes for some backend jobs. Is it good
idea to form these nodes as slave nodes and making one node in the cloud as
Master?


SolrCloud does not have masters or slaves.

One thing you could do is set the replica types on four of those nodes 
to one type, and on the other nodes, use a different replica type.  For 
instance, the four nodes could be TLOG and the six nodes could be PULL.


Then you can use the shards.preference parameter on your queries to only 
query the type of replica that you want.


https://lucene.apache.org/solr/guide/8_7/distributed-requests.html#shards-preference-parameter

Thanks,
Shawn


Re: Handling acronyms

2021-01-15 Thread Michael Gibney
EDIT: "the equivalent terms are separated by commas (as they should be)" =>
"the equivalent terms are _not_ separated by commas (as they should be)"

On Fri, Jan 15, 2021 at 10:09 AM Michael Gibney 
wrote:

> Shaun,
>
> I'm not 100% sure, but don't give up on this just yet:
>
> > For example if I enter diabetes it finds the acronym DM for diabetes
> mellitus
>
> I think the behavior you're observing may simply be a side-effect of a
> misconfiguration of synonyms.txt. In the example you posted, the equivalent
> terms are separated by commas (as they should be), which would lead to
> treating line `DM diabetes mellitus` as effectively "DM == diabetes ==
> mellitus", which as you point out is clearly not what you want. Do you see
> similar results for `DM, diabetes mellitus` (which should be parsed as
> meaning "DM == 'diabetes mellitus'", which iiuc _is_ what you want)?
>
> (see the note about ensuring proper comma-separation in my earlier
> response)
>
> Michael
>
>
> On Fri, Jan 15, 2021 at 9:52 AM Shaun Campbell 
> wrote:
>
>> Hi Michael
>>
>> Thanks for that I'll have a study later.  It's just reminded me of the
>> expand option which I meant to have a look at.
>>
>> Thanks
>> Shaun
>>
>> On Fri, 15 Jan 2021 at 14:33, Michael Gibney 
>> wrote:
>>
>> > The equivalent terms on the right-hand side of the `=>` operator in the
>> > example you sent should be separated by a comma. You mention you already
>> > tried only-comma-separated (e.g. one line: `SRN,Stroke Research
>> Network`)
>> > and that that yielded unexpected results as well. I would recommend
>> > pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case),
>> and
>> > applying the synonym filter _after_ case normalization in the analysis
>> > chain (there are other ways you could do, but the key point being that
>> you
>> > need to pay attention to case and how it interacts with the order in
>> which
>> > filters are applied).
>> >
>> > Re: Charlie's recommendation to apply these at index-time, a word of
>> > caution (and it's possible that this is in fact the underlying cause of
>> > some of the unexpected behavior you're observine?): be careful if you're
>> > using term _expansion_ at index-time (i.e., mapping single terms to
>> > multiple terms, which I note appears to be what you're trying to do in
>> the
>> > example lines you provided). Multi-term index-time synonyms can lead to
>> > unexpected results for positional queries (either explicit phrase
>> queries,
>> > or implicit, e.g. as configured by `pf` param in edismax). I'm aware of
>> at
>> > least two good overviews of this topic, one by Mike McCandless focusing
>> on
>> > Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The
>> underlying
>> > issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
>> > relevant.
>> >
>> > One way to work around this is to "collapse" (rather than expand)
>> synonyms,
>> > at both index and query time. Another option would be to apply synonym
>> > expansion only at query-time. It's also worth noting that increasing
>> phrase
>> > slop (`ps` param, etc.) can cause the issues with index-time synonym
>> > expansion to "fly under the radar" a little, wrt the most blatant "false
>> > negative" manifestations of index-time synonym issues for phrase
>> queries.
>> >
>> > [1]
>> >
>> >
>> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
>> > [2]
>> >
>> >
>> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
>> > [3] https://issues.apache.org/jira/browse/LUCENE-4312
>> >
>> > On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
>> > ch...@opensourceconnections.com> wrote:
>> >
>> > > I'm wondering if you should be using these acronyms at index time, not
>> > > search time. It will make your index bigger and you'll have to
>> re-index
>> > > to add new synonyms (as they may apply to old documents) but this
>> could
>> > > be an occasional task, and in the meantime you could use query-time
>> > > synonyms for the new ones.
>> > >
>> > > Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy
>> to
>> > me.
>> > >
>> > > Cheers
>> > >
>> > > Charlie
>> > >
>> > > On 15/01/2021 09:48, Shaun Campbell wrote:
>> > > > I have a medical journals search application and I've a list of some
>> > > 9,000
>> > > > acronyms like this:
>> > > >
>> > > > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening
>> > Questionnaire
>> > > > SRN=>SRN Stroke Research Network
>> > > > IGBP=>IGBP isolated gastric bypass
>> > > > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
>> > > Obstructive
>> > > > sleep apnoea–hypopnoea
>> > > > SRM=>SRM standardised response mean
>> > > > SRT=>SRT substrate reduction therapy
>> > > > SRS=>SRS Sexual Rating Scale
>> > > > SRU=>SRU stroke rehabilitation unit
>> > > > T2w=>T2w T2-weighted
>> > > > Ab-P=>Ab-P Aberdeen participation restriction subscale
>> > > > MSOA=>MSOA middle-layer super output area
>> > > > SSA=>SSA 

Re: Solrcloud - Reads on specific nodes

2021-01-15 Thread Michael Gibney
I know you're asking about nodes, not replicas; but depending on what
you're trying to achieve you might be as well off routing requests based on
replica. Have you considered the various options available via the
`shards.preference` param [1]? For instance, you could set up your "write"
replicas as `NRT`, and your "read" replicas as `PULL`, then use the
`replica.type` property of the `shards.preference` param to route "select"
requests to the `PULL` replicas.

It might also be worth looking at the options for stable routing provided
by the relatively new `replica.base` property (of `shards.preference`
param). If you have varying workloads with distinct cache usage patterns,
for instance, this could be useful to you.

To tie this back to nodes (your original question, if a replica-focused
solution is not sufficient): you could still use replica types and the
`shards.preference` param to control request routing, and implicitly route
by node by paying extra attention to careful replica placement on
particular nodes. As it happens, I'm actually doing a very simple variant
of this -- but not in a general-purpose enough way to feel I'm in a
position to make any specific recommendations.

[1]
https://lucene.apache.org/solr/guide/8_7/distributed-requests.html#shards-preference-parameter

On Fri, Jan 15, 2021 at 9:56 AM Doss  wrote:

> Dear All,
>
> 1. Suppose we have 10 node SOLR Cloud setup, is it possible to dedicate 4
> nodes for writes and 6 nodes for selects?
>
> 2. We have a SOLR cloud setup for our customer facing applications, and we
> would like to have two more SOLR nodes for some backend jobs. Is it good
> idea to form these nodes as slave nodes and making one node in the cloud as
> Master?
>
> Thanks!
> Mohandoss.
>


Re: Replicaton SolrCloud

2021-01-15 Thread Shawn Heisey

On 1/15/2021 7:20 AM, Jae Joo wrote:

Is non CDCR replication in SolrCloud still working in Solr 9.0?


Solr 9 doesn't exist yet.  Probably won't for at least a few months. 
The latest version is 8.7.0.


Solr's replication feature is used by SolrCloud internally for recovery 
operations, but the user doesn't configure it at all.  SolrCloud uses 
its own mechanisms to replicate indexes.  I doubt that those mechanisms 
will disappear when version 9.0 comes out.


Thanks,
Shawn


Re: Handling acronyms

2021-01-15 Thread Michael Gibney
Shaun,

I'm not 100% sure, but don't give up on this just yet:

> For example if I enter diabetes it finds the acronym DM for diabetes
mellitus

I think the behavior you're observing may simply be a side-effect of a
misconfiguration of synonyms.txt. In the example you posted, the equivalent
terms are separated by commas (as they should be), which would lead to
treating line `DM diabetes mellitus` as effectively "DM == diabetes ==
mellitus", which as you point out is clearly not what you want. Do you see
similar results for `DM, diabetes mellitus` (which should be parsed as
meaning "DM == 'diabetes mellitus'", which iiuc _is_ what you want)?

(see the note about ensuring proper comma-separation in my earlier response)

Michael


On Fri, Jan 15, 2021 at 9:52 AM Shaun Campbell 
wrote:

> Hi Michael
>
> Thanks for that I'll have a study later.  It's just reminded me of the
> expand option which I meant to have a look at.
>
> Thanks
> Shaun
>
> On Fri, 15 Jan 2021 at 14:33, Michael Gibney 
> wrote:
>
> > The equivalent terms on the right-hand side of the `=>` operator in the
> > example you sent should be separated by a comma. You mention you already
> > tried only-comma-separated (e.g. one line: `SRN,Stroke Research Network`)
> > and that that yielded unexpected results as well. I would recommend
> > pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case),
> and
> > applying the synonym filter _after_ case normalization in the analysis
> > chain (there are other ways you could do, but the key point being that
> you
> > need to pay attention to case and how it interacts with the order in
> which
> > filters are applied).
> >
> > Re: Charlie's recommendation to apply these at index-time, a word of
> > caution (and it's possible that this is in fact the underlying cause of
> > some of the unexpected behavior you're observine?): be careful if you're
> > using term _expansion_ at index-time (i.e., mapping single terms to
> > multiple terms, which I note appears to be what you're trying to do in
> the
> > example lines you provided). Multi-term index-time synonyms can lead to
> > unexpected results for positional queries (either explicit phrase
> queries,
> > or implicit, e.g. as configured by `pf` param in edismax). I'm aware of
> at
> > least two good overviews of this topic, one by Mike McCandless focusing
> on
> > Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The underlying
> > issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
> > relevant.
> >
> > One way to work around this is to "collapse" (rather than expand)
> synonyms,
> > at both index and query time. Another option would be to apply synonym
> > expansion only at query-time. It's also worth noting that increasing
> phrase
> > slop (`ps` param, etc.) can cause the issues with index-time synonym
> > expansion to "fly under the radar" a little, wrt the most blatant "false
> > negative" manifestations of index-time synonym issues for phrase queries.
> >
> > [1]
> >
> >
> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
> > [2]
> >
> >
> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
> > [3] https://issues.apache.org/jira/browse/LUCENE-4312
> >
> > On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
> > ch...@opensourceconnections.com> wrote:
> >
> > > I'm wondering if you should be using these acronyms at index time, not
> > > search time. It will make your index bigger and you'll have to re-index
> > > to add new synonyms (as they may apply to old documents) but this could
> > > be an occasional task, and in the meantime you could use query-time
> > > synonyms for the new ones.
> > >
> > > Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to
> > me.
> > >
> > > Cheers
> > >
> > > Charlie
> > >
> > > On 15/01/2021 09:48, Shaun Campbell wrote:
> > > > I have a medical journals search application and I've a list of some
> > > 9,000
> > > > acronyms like this:
> > > >
> > > > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening
> > Questionnaire
> > > > SRN=>SRN Stroke Research Network
> > > > IGBP=>IGBP isolated gastric bypass
> > > > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> > > Obstructive
> > > > sleep apnoea–hypopnoea
> > > > SRM=>SRM standardised response mean
> > > > SRT=>SRT substrate reduction therapy
> > > > SRS=>SRS Sexual Rating Scale
> > > > SRU=>SRU stroke rehabilitation unit
> > > > T2w=>T2w T2-weighted
> > > > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > > > MSOA=>MSOA middle-layer super output area
> > > > SSA=>SSA site-specific assessment
> > > > SSC=>SSC Study Steering Committee
> > > > SSB=>SSB short-stretch bandage
> > > > SSE=>SSE sum squared error
> > > > SSD=>SSD social services department
> > > > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> > > >
> > > > I tried to put them in a synonyms file, either just with a comma
> > between,
> > > > or with an arrow 

Solrcloud - Reads on specific nodes

2021-01-15 Thread Doss
Dear All,

1. Suppose we have 10 node SOLR Cloud setup, is it possible to dedicate 4
nodes for writes and 6 nodes for selects?

2. We have a SOLR cloud setup for our customer facing applications, and we
would like to have two more SOLR nodes for some backend jobs. Is it good
idea to form these nodes as slave nodes and making one node in the cloud as
Master?

Thanks!
Mohandoss.


Re: Handling acronyms

2021-01-15 Thread Shaun Campbell
Hi Michael

Thanks for that I'll have a study later.  It's just reminded me of the
expand option which I meant to have a look at.

Thanks
Shaun

On Fri, 15 Jan 2021 at 14:33, Michael Gibney 
wrote:

> The equivalent terms on the right-hand side of the `=>` operator in the
> example you sent should be separated by a comma. You mention you already
> tried only-comma-separated (e.g. one line: `SRN,Stroke Research Network`)
> and that that yielded unexpected results as well. I would recommend
> pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case), and
> applying the synonym filter _after_ case normalization in the analysis
> chain (there are other ways you could do, but the key point being that you
> need to pay attention to case and how it interacts with the order in which
> filters are applied).
>
> Re: Charlie's recommendation to apply these at index-time, a word of
> caution (and it's possible that this is in fact the underlying cause of
> some of the unexpected behavior you're observine?): be careful if you're
> using term _expansion_ at index-time (i.e., mapping single terms to
> multiple terms, which I note appears to be what you're trying to do in the
> example lines you provided). Multi-term index-time synonyms can lead to
> unexpected results for positional queries (either explicit phrase queries,
> or implicit, e.g. as configured by `pf` param in edismax). I'm aware of at
> least two good overviews of this topic, one by Mike McCandless focusing on
> Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The underlying
> issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
> relevant.
>
> One way to work around this is to "collapse" (rather than expand) synonyms,
> at both index and query time. Another option would be to apply synonym
> expansion only at query-time. It's also worth noting that increasing phrase
> slop (`ps` param, etc.) can cause the issues with index-time synonym
> expansion to "fly under the radar" a little, wrt the most blatant "false
> negative" manifestations of index-time synonym issues for phrase queries.
>
> [1]
>
> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
> [2]
>
> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
> [3] https://issues.apache.org/jira/browse/LUCENE-4312
>
> On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
> ch...@opensourceconnections.com> wrote:
>
> > I'm wondering if you should be using these acronyms at index time, not
> > search time. It will make your index bigger and you'll have to re-index
> > to add new synonyms (as they may apply to old documents) but this could
> > be an occasional task, and in the meantime you could use query-time
> > synonyms for the new ones.
> >
> > Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to
> me.
> >
> > Cheers
> >
> > Charlie
> >
> > On 15/01/2021 09:48, Shaun Campbell wrote:
> > > I have a medical journals search application and I've a list of some
> > 9,000
> > > acronyms like this:
> > >
> > > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening
> Questionnaire
> > > SRN=>SRN Stroke Research Network
> > > IGBP=>IGBP isolated gastric bypass
> > > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> > Obstructive
> > > sleep apnoea–hypopnoea
> > > SRM=>SRM standardised response mean
> > > SRT=>SRT substrate reduction therapy
> > > SRS=>SRS Sexual Rating Scale
> > > SRU=>SRU stroke rehabilitation unit
> > > T2w=>T2w T2-weighted
> > > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > > MSOA=>MSOA middle-layer super output area
> > > SSA=>SSA site-specific assessment
> > > SSC=>SSC Study Steering Committee
> > > SSB=>SSB short-stretch bandage
> > > SSE=>SSE sum squared error
> > > SSD=>SSD social services department
> > > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> > >
> > > I tried to put them in a synonyms file, either just with a comma
> between,
> > > or with an arrow in between and the acronym repeated on the right like
> > > above, and no matter what I try I'm getting really strange search
> > results.
> > > It's like words in one acronym are matching with the same word in
> another
> > > acronym and then searching with that acronym which is completely
> > unrelated.
> > >
> > > I don't think Solr can handle this, but does anyone know of any crafty
> > > tricks in Solr to handle this situation where I can either search by
> the
> > > acronym or by the text?
> > >
> > > Shaun
> > >
> >
> > --
> > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > 
> > Founding member of The Search Network 
> > and co-author of Searching the Enterprise
> > 
> > tel/fax: +44 (0)8700 118334
> > mobile: +44 (0)7767 825828
> >
>


Re: Handling acronyms

2021-01-15 Thread Shaun Campbell
Hi Charlie

I was indexing at index time only. The synonyms/acronyms were coming from
the published journals xml files so I wasn't expecting to maintain them
myself.  If it worked, I was expecting, hopefully, to update the synonyms
file automatically.

As I just explained to Bernd I'm finding that because I'm just using
supplied acronyms from the documents there's some overlap on the words used
and it's giving me unexpected results.  For example if I enter diabetes it
finds the acronym DM for diabetes mellitus, which then coincides with an
authors initials and puts them at the top of the list which is completely
wrong, or is it?  Perhaps I was looking for an author DM. Just too much
noise to be useful I think.

Thanks for your input anyway.
Shaun



On Fri, 15 Jan 2021 at 11:18, Charlie Hull 
wrote:

> I'm wondering if you should be using these acronyms at index time, not
> search time. It will make your index bigger and you'll have to re-index
> to add new synonyms (as they may apply to old documents) but this could
> be an occasional task, and in the meantime you could use query-time
> synonyms for the new ones.
>
> Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to me.
>
> Cheers
>
> Charlie
>
> On 15/01/2021 09:48, Shaun Campbell wrote:
> > I have a medical journals search application and I've a list of some
> 9,000
> > acronyms like this:
> >
> > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
> > SRN=>SRN Stroke Research Network
> > IGBP=>IGBP isolated gastric bypass
> > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> Obstructive
> > sleep apnoea–hypopnoea
> > SRM=>SRM standardised response mean
> > SRT=>SRT substrate reduction therapy
> > SRS=>SRS Sexual Rating Scale
> > SRU=>SRU stroke rehabilitation unit
> > T2w=>T2w T2-weighted
> > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > MSOA=>MSOA middle-layer super output area
> > SSA=>SSA site-specific assessment
> > SSC=>SSC Study Steering Committee
> > SSB=>SSB short-stretch bandage
> > SSE=>SSE sum squared error
> > SSD=>SSD social services department
> > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> >
> > I tried to put them in a synonyms file, either just with a comma between,
> > or with an arrow in between and the acronym repeated on the right like
> > above, and no matter what I try I'm getting really strange search
> results.
> > It's like words in one acronym are matching with the same word in another
> > acronym and then searching with that acronym which is completely
> unrelated.
> >
> > I don't think Solr can handle this, but does anyone know of any crafty
> > tricks in Solr to handle this situation where I can either search by the
> > acronym or by the text?
> >
> > Shaun
> >
>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> 
> Founding member of The Search Network 
> and co-author of Searching the Enterprise
> 
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>


Re: Handling acronyms

2021-01-15 Thread Shaun Campbell
Hi Bernd

Thanks for that. I think it is working, but I think unfortunately what I'm
trying to do is impossible/not logical.  When I enter a term it goes off
and searches using all the matching acronyms, because I'm finding a term
used in more than one synonym eg diabetes.

I think at the end of the day this produces too much "noise" to make any
sense of the results.   Think I will have to park this for now.

Thanks
Shaun

On Fri, 15 Jan 2021 at 10:35, Bernd Fehling 
wrote:

> If you are using multiword synonyms, acronyms, ...
> Your should escape the space within the multiwords.
>
> As synonyms.txt:
> SRN, Stroke\ Research\ Network
> IGBP, isolated\ gastric\ bypass
> ...
>
> Redards
> Bernd
>
>
> Am 15.01.21 um 10:48 schrieb Shaun Campbell:
> > I have a medical journals search application and I've a list of some
> 9,000
> > acronyms like this:
> >
> > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
> > SRN=>SRN Stroke Research Network
> > IGBP=>IGBP isolated gastric bypass
> > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> Obstructive
> > sleep apnoea–hypopnoea
> > SRM=>SRM standardised response mean
> > SRT=>SRT substrate reduction therapy
> > SRS=>SRS Sexual Rating Scale
> > SRU=>SRU stroke rehabilitation unit
> > T2w=>T2w T2-weighted
> > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > MSOA=>MSOA middle-layer super output area
> > SSA=>SSA site-specific assessment
> > SSC=>SSC Study Steering Committee
> > SSB=>SSB short-stretch bandage
> > SSE=>SSE sum squared error
> > SSD=>SSD social services department
> > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> >
> > I tried to put them in a synonyms file, either just with a comma between,
> > or with an arrow in between and the acronym repeated on the right like
> > above, and no matter what I try I'm getting really strange search
> results.
> > It's like words in one acronym are matching with the same word in another
> > acronym and then searching with that acronym which is completely
> unrelated.
> >
> > I don't think Solr can handle this, but does anyone know of any crafty
> > tricks in Solr to handle this situation where I can either search by the
> > acronym or by the text?
> >
> > Shaun
> >
>


Re: Handling acronyms

2021-01-15 Thread Michael Gibney
The equivalent terms on the right-hand side of the `=>` operator in the
example you sent should be separated by a comma. You mention you already
tried only-comma-separated (e.g. one line: `SRN,Stroke Research Network`)
and that that yielded unexpected results as well. I would recommend
pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case), and
applying the synonym filter _after_ case normalization in the analysis
chain (there are other ways you could do, but the key point being that you
need to pay attention to case and how it interacts with the order in which
filters are applied).

Re: Charlie's recommendation to apply these at index-time, a word of
caution (and it's possible that this is in fact the underlying cause of
some of the unexpected behavior you're observine?): be careful if you're
using term _expansion_ at index-time (i.e., mapping single terms to
multiple terms, which I note appears to be what you're trying to do in the
example lines you provided). Multi-term index-time synonyms can lead to
unexpected results for positional queries (either explicit phrase queries,
or implicit, e.g. as configured by `pf` param in edismax). I'm aware of at
least two good overviews of this topic, one by Mike McCandless focusing on
Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The underlying
issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
relevant.

One way to work around this is to "collapse" (rather than expand) synonyms,
at both index and query time. Another option would be to apply synonym
expansion only at query-time. It's also worth noting that increasing phrase
slop (`ps` param, etc.) can cause the issues with index-time synonym
expansion to "fly under the radar" a little, wrt the most blatant "false
negative" manifestations of index-time synonym issues for phrase queries.

[1]
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
[2]
https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
[3] https://issues.apache.org/jira/browse/LUCENE-4312

On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
ch...@opensourceconnections.com> wrote:

> I'm wondering if you should be using these acronyms at index time, not
> search time. It will make your index bigger and you'll have to re-index
> to add new synonyms (as they may apply to old documents) but this could
> be an occasional task, and in the meantime you could use query-time
> synonyms for the new ones.
>
> Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to me.
>
> Cheers
>
> Charlie
>
> On 15/01/2021 09:48, Shaun Campbell wrote:
> > I have a medical journals search application and I've a list of some
> 9,000
> > acronyms like this:
> >
> > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
> > SRN=>SRN Stroke Research Network
> > IGBP=>IGBP isolated gastric bypass
> > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> Obstructive
> > sleep apnoea–hypopnoea
> > SRM=>SRM standardised response mean
> > SRT=>SRT substrate reduction therapy
> > SRS=>SRS Sexual Rating Scale
> > SRU=>SRU stroke rehabilitation unit
> > T2w=>T2w T2-weighted
> > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > MSOA=>MSOA middle-layer super output area
> > SSA=>SSA site-specific assessment
> > SSC=>SSC Study Steering Committee
> > SSB=>SSB short-stretch bandage
> > SSE=>SSE sum squared error
> > SSD=>SSD social services department
> > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> >
> > I tried to put them in a synonyms file, either just with a comma between,
> > or with an arrow in between and the acronym repeated on the right like
> > above, and no matter what I try I'm getting really strange search
> results.
> > It's like words in one acronym are matching with the same word in another
> > acronym and then searching with that acronym which is completely
> unrelated.
> >
> > I don't think Solr can handle this, but does anyone know of any crafty
> > tricks in Solr to handle this situation where I can either search by the
> > acronym or by the text?
> >
> > Shaun
> >
>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> 
> Founding member of The Search Network 
> and co-author of Searching the Enterprise
> 
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>


Replicaton SolrCloud

2021-01-15 Thread Jae Joo
Is non CDCR replication in SolrCloud still working in Solr 9.0?

Jae


Unicode Normalization and ICUNormalizer2Filter

2021-01-15 Thread Bernd Fehling

Hello list,

cloud it be that Apache Solr Reference Guide of all versions is wrong?

Example:
https://lucene.apache.org/solr/guide/8_7/filter-descriptions.html#icu-normalizer-2-filter

NFC: (name="nfc" mode="compose") Normalization Form C, canonical decomposition
NFD: (name="nfc" mode="decompose") Normalization Form D, canonical 
decomposition, followed by canonical composition
...

versus:
https://unicode.org/reports/tr15/#Norm_Forms

Normalization Form C (NFC) - Canonical Decomposition, followed by Canonical 
Composition
Normalization Form D (NFD) - Canonical Decomposition
...


I assume that unicode.org is correct?

Can someone please check this and if needed update the Reference Guides?

Regards
Bernd


SolrCloud 8.7.0 with Zookeeper 3.4.5

2021-01-15 Thread Subhajit Das

Hi There,

I am planning to implement Solr cloud 8.7.0 with existing Zookeeper 3.4.5. This 
is cloudera provided zookeeper.

Is there any red flags, for such configuration, as I couldn’t find any 
compatibility matrix?

Many thanks in advance.

Regards,
Subhajit


RE: Query over migrating a solr database from 7.7.1 to 8.7.0

2021-01-15 Thread Flowerday, Matthew J
Hi Jim

 

Thanks for looking into it for me.

 

I did some more testing and if I created a base solr 7.7.1 database using
the 'out of the box' schema.xml and solrconfig and add this item manually
using the Solr Admin tool documents/XML

 



ABCD-N1

A test



 

And then update it using

 



ABCD-N1

A test updated



 

It correctly updates and deletes the old copy. 

 

I then 'migrated' it to solr 8.7.0 and updated the record in the same manner
(using documents/XML) with this 

 



ABCD-N1

A test updated again



 

It created a new record without deleting the old record

 

{

  "responseHeader":{

"status":0,

"QTime":1,

"params":{

  "q":"*:*",

  "_":"1610703647168"}},

  "response":{"numFound":2,"start":0,"numFoundExact":true,"docs":[

  {

"id":"ABCD-N1",

"title_t":"A test updated",

"_version_":1688944583266795520},

  {

"id":"ABCD-N1",

"title_t":"A test updated again",

"_version_":1688950299184594944}]

  }}

 

It is almost as if the delete of the record from the segment set up 7.7.1 is
not recognised.

 

When I updated the record again using

 



ABCD-N1

A test updated again and again



 

It updated the newly created record  and deleted the old version.

 

{

  "responseHeader":{

"status":0,

"QTime":1,

"params":{

  "q":"*:*",

  "_":"1610703647168"}},

  "response":{"numFound":2,"start":0,"numFoundExact":true,"docs":[

  {

"id":"ABCD-N1",

"title_t":"A test updated",

"_version_":1688944583266795520},

  {

"id":"ABCD-N1",

"title_t":"A test updated again and again",

"_version_":1688950897568120832}]

  }}

 

I did further testing by turning on lucene TRACE on my database and first
update generated

 

2021-01-15 09:38:30.138 INFO  (qtp1458091526-18) [   x:uleaf]
o.a.s.u.LoggingInfoStream [BD][qtp1458091526-18]: now apply del packet
(org.apache.solr.update.SolrIndexWriter@15e9adf2
 ) to 10 segments,
mergeGen 0

2021-01-15 09:38:30.138 INFO  (qtp1458091526-18) [   x:uleaf]
o.a.s.u.LoggingInfoStream [BD][qtp1458091526-18]: applyTermDeletes took 0.44
msec for 10 segments and 1 del terms; 0 new deletions

 

Whilst the second update generated

 

2021-01-15 09:44:21.543 INFO  (qtp1458091526-17) [   x:uleaf]
o.a.s.u.LoggingInfoStream [BD][qtp1458091526-17]: now apply del packet
(org.apache.solr.update.SolrIndexWriter@15e9adf2
 ) to 11 segments,
mergeGen 0

2021-01-15 09:44:21.544 INFO  (qtp1458091526-17) [   x:uleaf]
o.a.s.u.LoggingInfoStream [BD][qtp1458091526-17]: applyTermDeletes took 0.29
msec for 11 segments and 1 del terms; 1 new deletions

 

 

I think that it does not seem to find the document to delete in the old
segment.

 

Could this be a bug in Solr?

 

Many thanks

 

Matthew

 

Matthew Flowerday | Consultant | ULEAF

Unisys | 01908 774830|  
matthew.flower...@unisys.com 

Address Enigma | Wavendon Business Park | Wavendon | Milton Keynes | MK17
8LX

 

  

 

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
MATERIAL and is for use only by the intended recipient. If you received this
in error, please contact the sender and delete the e-mail and its
attachments from all devices.

 

  
 

 

From: Dyer, Jim  
Sent: 13 January 2021 18:21
To: solr-user@lucene.apache.org
Subject: RE: Query over migrating a solr database from 7.7.1 to 8.7.0

 

EXTERNAL EMAIL - Be cautious of all links and attachments.

I think if you have _root_ in schema.xml you should look elsewhere.  My
memory is merely adding this one line to schema.xml took care of our
problem.

 

From: Flowerday, Matthew J mailto:matthew.flower...@gb.unisys.com> > 
Sent: Tuesday, January 12, 2021 3:23 AM
To: solr-user@lucene.apache.org  
Subject: RE: Query over migrating a solr database from 7.7.1 to 8.7.0

 

Hi Jim

 

Thanks for getting back to me.

 

I checked the schema.xml that we are using and it has the line you
mentioned:

 



 

And this is the only reference (apart from within a comment) for _root_ In
the schema.xml. Does your schema.xml have further references to _root_ that
I could need? I also checked out solrconfig.xml file for any references to
_root_ and there are none.

 

Many Thanks

 

Matthew

 

Matthew Flowerday | Consultant | ULEAF

Unisys | 01908 774830|  
matthew.flower...@unisys.com 

Address Enigma | Wavendon Business Park | Wavendon | Milton Keynes | MK17
8LX

 

  

 

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE 

Re: Handling acronyms

2021-01-15 Thread Charlie Hull
I'm wondering if you should be using these acronyms at index time, not 
search time. It will make your index bigger and you'll have to re-index 
to add new synonyms (as they may apply to old documents) but this could 
be an occasional task, and in the meantime you could use query-time 
synonyms for the new ones.


Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to me.

Cheers

Charlie

On 15/01/2021 09:48, Shaun Campbell wrote:

I have a medical journals search application and I've a list of some 9,000
acronyms like this:

MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
SRN=>SRN Stroke Research Network
IGBP=>IGBP isolated gastric bypass
TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for Obstructive
sleep apnoea–hypopnoea
SRM=>SRM standardised response mean
SRT=>SRT substrate reduction therapy
SRS=>SRS Sexual Rating Scale
SRU=>SRU stroke rehabilitation unit
T2w=>T2w T2-weighted
Ab-P=>Ab-P Aberdeen participation restriction subscale
MSOA=>MSOA middle-layer super output area
SSA=>SSA site-specific assessment
SSC=>SSC Study Steering Committee
SSB=>SSB short-stretch bandage
SSE=>SSE sum squared error
SSD=>SSD social services department
NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument

I tried to put them in a synonyms file, either just with a comma between,
or with an arrow in between and the acronym repeated on the right like
above, and no matter what I try I'm getting really strange search results.
It's like words in one acronym are matching with the same word in another
acronym and then searching with that acronym which is completely unrelated.

I don't think Solr can handle this, but does anyone know of any crafty
tricks in Solr to handle this situation where I can either search by the
acronym or by the text?

Shaun



--
Charlie Hull - Managing Consultant at OpenSource Connections Limited 

Founding member of The Search Network  
and co-author of Searching the Enterprise 


tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828


Fieldname alias for Highlighter results

2021-01-15 Thread Michael Aleythe, Sternwald
Hi everybody,

I'm looking for a way to replace solr index field names in the highlighting 
response.

For the query part there is the param fl=substitute:REAL_FIELD_NAME which 
substitutes the field name REAL_FIELD_NAME by "substitute". Sadly the 
substitution is not applied to the highlighter response. I tried using the same 
syntax on the hl.fl param but with no success. Has anybody an idea on how to 
achieve this?

Best regards
Michael Aleythe



Re: Handling acronyms

2021-01-15 Thread Bernd Fehling

If you are using multiword synonyms, acronyms, ...
Your should escape the space within the multiwords.

As synonyms.txt:
SRN, Stroke\ Research\ Network
IGBP, isolated\ gastric\ bypass
...

Redards
Bernd


Am 15.01.21 um 10:48 schrieb Shaun Campbell:

I have a medical journals search application and I've a list of some 9,000
acronyms like this:

MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
SRN=>SRN Stroke Research Network
IGBP=>IGBP isolated gastric bypass
TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for Obstructive
sleep apnoea–hypopnoea
SRM=>SRM standardised response mean
SRT=>SRT substrate reduction therapy
SRS=>SRS Sexual Rating Scale
SRU=>SRU stroke rehabilitation unit
T2w=>T2w T2-weighted
Ab-P=>Ab-P Aberdeen participation restriction subscale
MSOA=>MSOA middle-layer super output area
SSA=>SSA site-specific assessment
SSC=>SSC Study Steering Committee
SSB=>SSB short-stretch bandage
SSE=>SSE sum squared error
SSD=>SSD social services department
NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument

I tried to put them in a synonyms file, either just with a comma between,
or with an arrow in between and the acronym repeated on the right like
above, and no matter what I try I'm getting really strange search results.
It's like words in one acronym are matching with the same word in another
acronym and then searching with that acronym which is completely unrelated.

I don't think Solr can handle this, but does anyone know of any crafty
tricks in Solr to handle this situation where I can either search by the
acronym or by the text?

Shaun



RE: [Solr8.7] Performance of group.ngroups ?

2021-01-15 Thread Bruno Mannina
Hello,



I found a temporary solution to my problem.



I do a request without ngroups=true => result is quickly

And just after, I do a simple request with my query and this param:

….={x:"unique(fid)"}

Where the field « fid » is my group field name.



88 sec => 3~4 sec for both requests.



Regards,

Bruno



De : Matheo Software [mailto:i...@matheo-software.com]
Envoyé : jeudi 14 janvier 2021 14:48
À : solr-user@lucene.apache.org
Objet : [Solr8.7] Performance of group.ngroups ?



Hi All,



I have more than 130 million documents, with an index size of more than
400GB on Solr8.7.



I do a simple query and it takes around 1400ms, it’s ok but when I use
ngroups=true, I get an answer in 88sec.

I know it’s because Solr calculates the number of groups on a specific field
but is exist a solution to improve that? An alternate solution?



Many thanks,



Cordialement, Best Regards

Bruno Mannina

  www.matheo-software.com

  www.patent-pulse.com

Tél. +33 0 970 738 743

Mob. +33 0 634 421 817

  facebook (1)
 1425551717
 1425551737
 1425551760









Garanti sans virus.
 www.avast.com





--
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus


Handling acronyms

2021-01-15 Thread Shaun Campbell
I have a medical journals search application and I've a list of some 9,000
acronyms like this:

MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
SRN=>SRN Stroke Research Network
IGBP=>IGBP isolated gastric bypass
TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for Obstructive
sleep apnoea–hypopnoea
SRM=>SRM standardised response mean
SRT=>SRT substrate reduction therapy
SRS=>SRS Sexual Rating Scale
SRU=>SRU stroke rehabilitation unit
T2w=>T2w T2-weighted
Ab-P=>Ab-P Aberdeen participation restriction subscale
MSOA=>MSOA middle-layer super output area
SSA=>SSA site-specific assessment
SSC=>SSC Study Steering Committee
SSB=>SSB short-stretch bandage
SSE=>SSE sum squared error
SSD=>SSD social services department
NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument

I tried to put them in a synonyms file, either just with a comma between,
or with an arrow in between and the acronym repeated on the right like
above, and no matter what I try I'm getting really strange search results.
It's like words in one acronym are matching with the same word in another
acronym and then searching with that acronym which is completely unrelated.

I don't think Solr can handle this, but does anyone know of any crafty
tricks in Solr to handle this situation where I can either search by the
acronym or by the text?

Shaun


Re: Getting error "Bad Message 414 reason: URI Too Long"

2021-01-15 Thread Shawn Heisey

On 1/14/2021 2:31 AM, Abhay Kumar wrote:
I am trying to post below query to Solr but getting error as “Bad 
Message 414reason: URI Too Long”.


I am sending query using SolrNet library. Please suggest how to resolve 
this issue.


*Query :* 
http://localhost:8983/solr/documents/select?q=%22Geisteswissenschaften


If your query is a POST request rather than a GET, then you won't get 
that error.  And if the request is identical to the REALLY long URL that 
you included (which seemed to be incomplete), then it's definitely not a 
POST.  If it were, everything after the ? would be in the request body, 
not on the URL itself.


There is a note on the SolrNET FAQ about this.

https://github.com/SolrNet/SolrNet/blob/master/Documentation/FAQ.md#im-getting-a-uri-too-long-error

If you want more info on that, you'll need to ask SolrNET.  It's a 
completely different project.


Thanks,
Shawn