Re: CloudSolrClient getDocCollection

2019-02-08 Thread Hendrik Haddorp

Hi Jason,

thanks for your answer. Yes, you would need one watch per state.json and 
thus one watch per collection. That should however not really be a 
problem with ZK. I would assume that the Solr server instances need to 
monitor those nodes to be up to date on the cluster state. Using 
org.apache.solr.common.cloud.ZkStateReader.registerCollectionStateWatcher 
you can even add a watch for that using the SolrJ API. At least for the 
currently watched collections the client should thus actually already 
have the correct information available. The access to that would likely 
be a bit ugly though.


The CloudSolrClient also allows to set a watch on /collections using 
org.apache.solr.common.cloud.ZkStateReader.registerCloudCollectionsListener. 
This is actually another thing I just ran into. As the code has a watch 
on /collections the listener gets informed about new collections as soon 
as the "directory" for the collection is being created. If the listener 
does then straight away try to access the collection info via 
zkStateReader.getClusterState() the DocCollection can be returned as 
null as the DocCollection is build on the information stored in the 
state.json file, which might not exist yet. I'm trying to monitor the 
Solr cluster state and thus ran into this. Not sure if I should open a 
Jira for that.


regards,
Hendrik

On 08.02.2019 23:20, Jason Gerlowski wrote:

Hi Henrik,

I'll try to answer, and let others correct me if I stray.  I wasn't
around when CloudSolrClient was written, so take this with a grain of
salt:

"Why does the client need that timeout?Wouldn't it make sense to
use a watch?"

You could probably write a CloudSolrClient that uses watch(es) to keep
track of changing collection state.  But I suspect you'd need a
watch-per-collection, instead of just a single watch.

Modern versions of Solr store the state for each collection in
individual "state.json" ZK nodes
("/solr/collections//state.json").  To catch changes
to all of these collections, you'd need to watch each of those nodes.
Which wouldn't scale well for users who want lots of collections.  I
suspect this was one of the concerns that nudged the author(s) to use
a cache-based approach.

(Even when all collection state was stored in a single ZK node, a
watch-based CloudSolrClient would likely have scaling issues for the
many-collection use case.  The client would need to recalculate its
state information for _all_ collections any time that _any_ of the
collections changed, since it has no way to tell which collection was
changed.)

Best,

Jason

On Thu, Feb 7, 2019 at 11:44 AM Hendrik Haddorp  wrote:

Hi,

when I perform a query using the CloudSolrClient the code first
retrieves the DocCollection to determine to which instance the query
should be send [1]. getDocCollection [2] does a lookup in a cache, which
has a 60s expiration time [3]. When a DocCollection has to be reloaded
this is guarded by a lock [4]. Per default there are 3 locks, which can
cause some congestion. The main question though is why does the client
need that timeout? According to this [5] comment the code does not use a
watch. Wouldn't it make sense to use a watch? I thought the big
advantage of the CloudSolrClient is that is knows were to send requests
to, so that no extra hop needs to be done on the server side. Having to
query ZooKeeper though for the current state does however take some of
that advantage.

regards,
Hendrik

[1]
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java#L849
[2]
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java#L1180
[3]
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java#L162
[4]
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java#L1200
[5]
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java#L821




Re: Java object binding not working

2019-02-08 Thread Jason Gerlowski
Hi Swapnil,

Ray did suggest a potential cause.  Your Java object has "name" as a
String, but Solr returns the "name" value as an ArrayList.
Usually Solr returns ArrayLists when the field in question is
multivalued, so it's a safe bet that Solr is treating your "name"
field as multivalued.

You can check this by opening Solr's admin UI, selecting your
collection from the collection dropdown menu, and clicking on the
Schema tab.  In the "Schema" window you can select your "name" field
from the dropdown and see if the table that appears shows it as
"multivalued".

If the field is multivalued, you've got a few options:
- you can start fresh with a new collection, and modify your schema so
that "name" is single-valued
- you can try to change the field-definition in place.  I'm not sure
whether Solr will allow this, but the API to try is here:
https://lucene.apache.org/solr/guide/7_6/schema-api.html#replace-a-field
- you can just change your Java object to represent "name" as a
List instead of a String.

If the field _isn't_ multivalued, then I'm not sure what's going on.

Best,

Jason

On Fri, Feb 8, 2019 at 1:40 PM Swapnil Katkar  wrote:
>
> Hi,
>
> It would be beneficial to me if you provide me at least some hint to
> resolve this problem. Thanks in advance!
>
> Regards,
> Swapnil Katkar
>
>
>
> -- Forwarded message -
> From: Swapnil Katkar 
> Date: Tue, Feb 5, 2019 at 10:58 PM
> Subject: Fwd: Java object binding not working
> To: 
>
>
> Hello,
>
> Could you please let me know how can I get the resolution of the mentioned
> issue?
>
> Regards,
> Swapnil Katkar
>
> -- Forwarded message -
> From: Swapnil Katkar 
> Date: Sun, Feb 3, 2019, 17:31
> Subject: Java object binding not working
> To: 
>
>
> Greetings!
>
> I am working on a requirement where I want to query the data and want to do
> the object mapping for the retrieved result using Solrj. For this, I am
> referring to the official document at
> *https://lucene.apache.org/solr/guide/7_6/using-solrj.html#java-object-binding
> .*
> I
> set-up the necessary class files and the collections.
>
> With the help of this document, I can create the documents in the Solr DB,
> but it is not working for fetching and mapping the fields to the Java POJO
> class. To do the mapping, I used @Field annotation.
>
> Details are as below:
> *1)* Solrj version: 7.6.0
> *2)* The line of code which is not working: *List employees =
> response.getBeans(Employee.class);*
> *3)* Exception stack trace:
> *Caused by: java.lang.IllegalArgumentException: Can not set
> java.lang.String field demo.apache.solr.vo.Employee.name
>  to java.util.ArrayList*
> * at
> sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(Unknown
> Source)*
> * at
> sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(Unknown
> Source)*
> * at sun.reflect.UnsafeObjectFieldAccessorImpl.set(Unknown Source)*
> * at java.lang.reflect.Field.set(Unknown Source)*
> *4)* Collection was created using
> *solr.cmd create -c employees -s 2 -rf 2*
>
> Please find the attached source code files. Also, I attached the stack
> trace file. Can you please help me on how to resolve them?
>
> Regards,
> Swapnil Katkar
>
>
> --
> Hello,
>
>
> Regards,
> Swapnil Katkar


Re: CloudSolrClient getDocCollection

2019-02-08 Thread Jason Gerlowski
Hi Henrik,

I'll try to answer, and let others correct me if I stray.  I wasn't
around when CloudSolrClient was written, so take this with a grain of
salt:

"Why does the client need that timeout?Wouldn't it make sense to
use a watch?"

You could probably write a CloudSolrClient that uses watch(es) to keep
track of changing collection state.  But I suspect you'd need a
watch-per-collection, instead of just a single watch.

Modern versions of Solr store the state for each collection in
individual "state.json" ZK nodes
("/solr/collections//state.json").  To catch changes
to all of these collections, you'd need to watch each of those nodes.
Which wouldn't scale well for users who want lots of collections.  I
suspect this was one of the concerns that nudged the author(s) to use
a cache-based approach.

(Even when all collection state was stored in a single ZK node, a
watch-based CloudSolrClient would likely have scaling issues for the
many-collection use case.  The client would need to recalculate its
state information for _all_ collections any time that _any_ of the
collections changed, since it has no way to tell which collection was
changed.)

Best,

Jason

On Thu, Feb 7, 2019 at 11:44 AM Hendrik Haddorp  wrote:
>
> Hi,
>
> when I perform a query using the CloudSolrClient the code first
> retrieves the DocCollection to determine to which instance the query
> should be send [1]. getDocCollection [2] does a lookup in a cache, which
> has a 60s expiration time [3]. When a DocCollection has to be reloaded
> this is guarded by a lock [4]. Per default there are 3 locks, which can
> cause some congestion. The main question though is why does the client
> need that timeout? According to this [5] comment the code does not use a
> watch. Wouldn't it make sense to use a watch? I thought the big
> advantage of the CloudSolrClient is that is knows were to send requests
> to, so that no extra hop needs to be done on the server side. Having to
> query ZooKeeper though for the current state does however take some of
> that advantage.
>
> regards,
> Hendrik
>
> [1]
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java#L849
> [2]
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java#L1180
> [3]
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java#L162
> [4]
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java#L1200
> [5]
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java#L821


Re: Ignore accent in a request

2019-02-08 Thread elisabeth benoit
yes you do

and use the char filter at index and query time

Le ven. 8 févr. 2019 à 19:20, SAUNIER Maxence  a écrit :

> For the charFilter, I need to reindex all documents ?
>
> -Message d'origine-
> De : Erick Erickson 
> Envoyé : vendredi 8 février 2019 18:03
> À : solr-user 
> Objet : Re: Ignore accent in a request
>
> Elisabeth's suggestion is spot on for the accent.
>
> One other thing I noticed. You are using KeywordTokenizerFactory combined
> with EdgeNGramFilterFactory. This implies that you can't search for
> individual _words_, only prefix queries, i.e.
> je
> je s
> je su
> je sui
> je suis
>
> You can't search for "suis" for instance.
>
> basically this is an efficient way to search anything starting with
> three-or-more letter prefixes at the expense of index size. You might be
> better off just using wildcards (restrict to three letters at the prefix
> though).
>
> This is perfectly valid, I'm mostly asking if it's your intent.
>
> Best,
> Erick
>
> On Fri, Feb 8, 2019 at 9:35 AM SAUNIER Maxence  wrote:
> >
> > Thanks you !
> >
> > -Message d'origine-
> > De : elisabeth benoit  Envoyé : vendredi 8
> > février 2019 14:12 À : solr-user@lucene.apache.org Objet : Re: Ignore
> > accent in a request
> >
> > Hello,
> >
> > We use solr 7 and use
> >
> >  > mapping="mapping-ISOLatin1Accent.txt"/>
> >
> > with mapping-ISOLatin1Accent.txt
> >
> > containing lines like
> >
> > # À => A
> > "\u00C0" => "A"
> >
> > # Á => A
> > "\u00C1" => "A"
> >
> > # Â => A
> > "\u00C2" => "A"
> >
> > # Ã => A
> > "\u00C3" => "A"
> >
> > # Ä => A
> > "\u00C4" => "A"
> >
> > # Å => A
> > "\u00C5" => "A"
> >
> > # Ā Ă Ą =>
> > "\u0100" => "A"
> > "\u0102" => "A"
> > "\u0104" => "A"
> >
> > # Æ => AE
> > "\u00C6" => "AE"
> >
> > # Ç => C
> > "\u00C7" => "C"
> >
> > # é => e
> > "\u00E9" => "e"
> >
> > Best regards,
> > Elisabeth
> >
> > Le ven. 8 févr. 2019 à 11:18, Gopesh Sharma 
> a écrit :
> >
> > > We have fixed this type of issue by using Synonyms by adding
> > > SynonymFilterFactory(Before Solr 7).
> > >
> > > -Original Message-
> > > From: SAUNIER Maxence 
> > > Sent: Friday, February 8, 2019 3:36 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: Ignore accent in a request
> > >
> > > Hello,
> > >
> > > Thanks for you answer.
> > >
> > > I have test :
> > >
> > > select?defType=dismax=je suis avarié=content
> > > 90.000 results
> > >
> > > select?defType=dismax=je suis avarie=content
> > > 60.000 results
> > >
> > > With avarié, I dont find documents with avarie and with avarie, I
> > > don't find documents with avarié.
> > >
> > > I want to find they 150.000 documents with avarié or avarie.
> > >
> > > Thanks
> > >
> > > -Message d'origine-
> > > De : Erick Erickson  Envoyé : jeudi 7
> > > février
> > > 2019 19:37 À : solr-user  Objet : Re:
> > > Ignore accent in a request
> > >
> > > exactly _how_ is it "not working"?
> > >
> > > Try building your parameters _up_ rather than starting with a lot, e.g.
> > > select?defType=dismax=je suis avarié=title ^^ assumes you
> > > expect a match on title. Then:
> > > select?defType=dismax=je suis avarié=title subject
> > >
> > > etc.
> > >
> > > Because mm=757 looks really wrong. From the docs:
> > > Defines the minimum number of clauses that must match, regardless of
> > > how many clauses there are in total.
> > >
> > > edismax is used much more than dismax as it's more flexible, but
> > > that's not germane here.
> > >
> > > finally, try adding =query to the url to see exactly how the
> > > query is parsed.
> > >
> > > Best,
> > > Erick
> > >
> > > On Mon, Feb 4, 2019 at 9:09 AM SAUNIER Maxence 
> wrote:
> > > >
> > > > Hello,
> > > >
> > > > How can I ignore accent in the query result ?
> > > >
> > > > Request :
> > > > http://*:8983/solr/***/select?defType=dismax=je+suis+avarié;
> > > > qf
> > > > =t
> > > > itle%5e20+subject%5e15+category%5e1+content%5e0.5=757
> > > >
> > > > I want to have doc with avarié and avarie.
> > > >
> > > > I have add this in my schema :
> > > >
> > > >   {
> > > > "name": "string",
> > > > "positionIncrementGap": "100",
> > > > "analyzer": {
> > > >   "filters": [
> > > > {
> > > >   "class": "solr.LowerCaseFilterFactory"
> > > > },
> > > > {
> > > >   "class": "solr.ASCIIFoldingFilterFactory"
> > > > },
> > > > {
> > > >   "class": "solr.EdgeNGramFilterFactory",
> > > >   "minGramSize": "3",
> > > >   "maxGramSize": "50"
> > > > }
> > > >   ],
> > > >   "tokenizer": {
> > > > "class": "solr.KeywordTokenizerFactory"
> > > >   }
> > > > },
> > > > "stored": true,
> > > > "indexed": true,
> > > > "sortMissingLast": true,
> > > > "class": "solr.TextField"
> > > >   },
> > > >
> > > > But it not working.
> > > >
> > > > Thanks.
> > >
>


Fwd: Java object binding not working

2019-02-08 Thread Swapnil Katkar
Hi,

It would be beneficial to me if you provide me at least some hint to
resolve this problem. Thanks in advance!

Regards,
Swapnil Katkar



-- Forwarded message -
From: Swapnil Katkar 
Date: Tue, Feb 5, 2019 at 10:58 PM
Subject: Fwd: Java object binding not working
To: 


Hello,

Could you please let me know how can I get the resolution of the mentioned
issue?

Regards,
Swapnil Katkar

-- Forwarded message -
From: Swapnil Katkar 
Date: Sun, Feb 3, 2019, 17:31
Subject: Java object binding not working
To: 


Greetings!

I am working on a requirement where I want to query the data and want to do
the object mapping for the retrieved result using Solrj. For this, I am
referring to the official document at
*https://lucene.apache.org/solr/guide/7_6/using-solrj.html#java-object-binding
.*
I
set-up the necessary class files and the collections.

With the help of this document, I can create the documents in the Solr DB,
but it is not working for fetching and mapping the fields to the Java POJO
class. To do the mapping, I used @Field annotation.

Details are as below:
*1)* Solrj version: 7.6.0
*2)* The line of code which is not working: *List employees =
response.getBeans(Employee.class);*
*3)* Exception stack trace:
*Caused by: java.lang.IllegalArgumentException: Can not set
java.lang.String field demo.apache.solr.vo.Employee.name
 to java.util.ArrayList*
* at
sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(Unknown
Source)*
* at
sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(Unknown
Source)*
* at sun.reflect.UnsafeObjectFieldAccessorImpl.set(Unknown Source)*
* at java.lang.reflect.Field.set(Unknown Source)*
*4)* Collection was created using
*solr.cmd create -c employees -s 2 -rf 2*

Please find the attached source code files. Also, I attached the stack
trace file. Can you please help me on how to resolve them?

Regards,
Swapnil Katkar


-- 
Hello,


Regards,
Swapnil Katkar


RE: Ignore accent in a request

2019-02-08 Thread SAUNIER Maxence
For the charFilter, I need to reindex all documents ? 

-Message d'origine-
De : Erick Erickson  
Envoyé : vendredi 8 février 2019 18:03
À : solr-user 
Objet : Re: Ignore accent in a request

Elisabeth's suggestion is spot on for the accent.

One other thing I noticed. You are using KeywordTokenizerFactory combined with 
EdgeNGramFilterFactory. This implies that you can't search for individual 
_words_, only prefix queries, i.e.
je
je s
je su
je sui
je suis

You can't search for "suis" for instance.

basically this is an efficient way to search anything starting with 
three-or-more letter prefixes at the expense of index size. You might be better 
off just using wildcards (restrict to three letters at the prefix though).

This is perfectly valid, I'm mostly asking if it's your intent.

Best,
Erick

On Fri, Feb 8, 2019 at 9:35 AM SAUNIER Maxence  wrote:
>
> Thanks you !
>
> -Message d'origine-
> De : elisabeth benoit  Envoyé : vendredi 8 
> février 2019 14:12 À : solr-user@lucene.apache.org Objet : Re: Ignore 
> accent in a request
>
> Hello,
>
> We use solr 7 and use
>
>  mapping="mapping-ISOLatin1Accent.txt"/>
>
> with mapping-ISOLatin1Accent.txt
>
> containing lines like
>
> # À => A
> "\u00C0" => "A"
>
> # Á => A
> "\u00C1" => "A"
>
> # Â => A
> "\u00C2" => "A"
>
> # Ã => A
> "\u00C3" => "A"
>
> # Ä => A
> "\u00C4" => "A"
>
> # Å => A
> "\u00C5" => "A"
>
> # Ā Ă Ą =>
> "\u0100" => "A"
> "\u0102" => "A"
> "\u0104" => "A"
>
> # Æ => AE
> "\u00C6" => "AE"
>
> # Ç => C
> "\u00C7" => "C"
>
> # é => e
> "\u00E9" => "e"
>
> Best regards,
> Elisabeth
>
> Le ven. 8 févr. 2019 à 11:18, Gopesh Sharma  a 
> écrit :
>
> > We have fixed this type of issue by using Synonyms by adding 
> > SynonymFilterFactory(Before Solr 7).
> >
> > -Original Message-
> > From: SAUNIER Maxence 
> > Sent: Friday, February 8, 2019 3:36 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Ignore accent in a request
> >
> > Hello,
> >
> > Thanks for you answer.
> >
> > I have test :
> >
> > select?defType=dismax=je suis avarié=content
> > 90.000 results
> >
> > select?defType=dismax=je suis avarie=content
> > 60.000 results
> >
> > With avarié, I dont find documents with avarie and with avarie, I 
> > don't find documents with avarié.
> >
> > I want to find they 150.000 documents with avarié or avarie.
> >
> > Thanks
> >
> > -Message d'origine-
> > De : Erick Erickson  Envoyé : jeudi 7 
> > février
> > 2019 19:37 À : solr-user  Objet : Re:
> > Ignore accent in a request
> >
> > exactly _how_ is it "not working"?
> >
> > Try building your parameters _up_ rather than starting with a lot, e.g.
> > select?defType=dismax=je suis avarié=title ^^ assumes you 
> > expect a match on title. Then:
> > select?defType=dismax=je suis avarié=title subject
> >
> > etc.
> >
> > Because mm=757 looks really wrong. From the docs:
> > Defines the minimum number of clauses that must match, regardless of 
> > how many clauses there are in total.
> >
> > edismax is used much more than dismax as it's more flexible, but 
> > that's not germane here.
> >
> > finally, try adding =query to the url to see exactly how the 
> > query is parsed.
> >
> > Best,
> > Erick
> >
> > On Mon, Feb 4, 2019 at 9:09 AM SAUNIER Maxence  wrote:
> > >
> > > Hello,
> > >
> > > How can I ignore accent in the query result ?
> > >
> > > Request :
> > > http://*:8983/solr/***/select?defType=dismax=je+suis+avarié;
> > > qf
> > > =t
> > > itle%5e20+subject%5e15+category%5e1+content%5e0.5=757
> > >
> > > I want to have doc with avarié and avarie.
> > >
> > > I have add this in my schema :
> > >
> > >   {
> > > "name": "string",
> > > "positionIncrementGap": "100",
> > > "analyzer": {
> > >   "filters": [
> > > {
> > >   "class": "solr.LowerCaseFilterFactory"
> > > },
> > > {
> > >   "class": "solr.ASCIIFoldingFilterFactory"
> > > },
> > > {
> > >   "class": "solr.EdgeNGramFilterFactory",
> > >   "minGramSize": "3",
> > >   "maxGramSize": "50"
> > > }
> > >   ],
> > >   "tokenizer": {
> > > "class": "solr.KeywordTokenizerFactory"
> > >   }
> > > },
> > > "stored": true,
> > > "indexed": true,
> > > "sortMissingLast": true,
> > > "class": "solr.TextField"
> > >   },
> > >
> > > But it not working.
> > >
> > > Thanks.
> >


Re: Ignore accent in a request

2019-02-08 Thread Erick Erickson
Elisabeth's suggestion is spot on for the accent.

One other thing I noticed. You are using
KeywordTokenizerFactory combined with
EdgeNGramFilterFactory. This implies that you
can't search for individual _words_, only
prefix queries, i.e.
je
je s
je su
je sui
je suis

You can't search for "suis" for instance.

basically this is an efficient way to search
anything starting with three-or-more letter prefixes
at the expense of index size. You might be better
off just using wildcards (restrict to three letters
at the prefix though).

This is perfectly valid, I'm mostly asking if it's
your intent.

Best,
Erick

On Fri, Feb 8, 2019 at 9:35 AM SAUNIER Maxence  wrote:
>
> Thanks you !
>
> -Message d'origine-
> De : elisabeth benoit 
> Envoyé : vendredi 8 février 2019 14:12
> À : solr-user@lucene.apache.org
> Objet : Re: Ignore accent in a request
>
> Hello,
>
> We use solr 7 and use
>
>  mapping="mapping-ISOLatin1Accent.txt"/>
>
> with mapping-ISOLatin1Accent.txt
>
> containing lines like
>
> # À => A
> "\u00C0" => "A"
>
> # Á => A
> "\u00C1" => "A"
>
> # Â => A
> "\u00C2" => "A"
>
> # Ã => A
> "\u00C3" => "A"
>
> # Ä => A
> "\u00C4" => "A"
>
> # Å => A
> "\u00C5" => "A"
>
> # Ā Ă Ą =>
> "\u0100" => "A"
> "\u0102" => "A"
> "\u0104" => "A"
>
> # Æ => AE
> "\u00C6" => "AE"
>
> # Ç => C
> "\u00C7" => "C"
>
> # é => e
> "\u00E9" => "e"
>
> Best regards,
> Elisabeth
>
> Le ven. 8 févr. 2019 à 11:18, Gopesh Sharma  a 
> écrit :
>
> > We have fixed this type of issue by using Synonyms by adding
> > SynonymFilterFactory(Before Solr 7).
> >
> > -Original Message-
> > From: SAUNIER Maxence 
> > Sent: Friday, February 8, 2019 3:36 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Ignore accent in a request
> >
> > Hello,
> >
> > Thanks for you answer.
> >
> > I have test :
> >
> > select?defType=dismax=je suis avarié=content
> > 90.000 results
> >
> > select?defType=dismax=je suis avarie=content
> > 60.000 results
> >
> > With avarié, I dont find documents with avarie and with avarie, I
> > don't find documents with avarié.
> >
> > I want to find they 150.000 documents with avarié or avarie.
> >
> > Thanks
> >
> > -Message d'origine-
> > De : Erick Erickson  Envoyé : jeudi 7 février
> > 2019 19:37 À : solr-user  Objet : Re:
> > Ignore accent in a request
> >
> > exactly _how_ is it "not working"?
> >
> > Try building your parameters _up_ rather than starting with a lot, e.g.
> > select?defType=dismax=je suis avarié=title ^^ assumes you expect
> > a match on title. Then:
> > select?defType=dismax=je suis avarié=title subject
> >
> > etc.
> >
> > Because mm=757 looks really wrong. From the docs:
> > Defines the minimum number of clauses that must match, regardless of
> > how many clauses there are in total.
> >
> > edismax is used much more than dismax as it's more flexible, but
> > that's not germane here.
> >
> > finally, try adding =query to the url to see exactly how the
> > query is parsed.
> >
> > Best,
> > Erick
> >
> > On Mon, Feb 4, 2019 at 9:09 AM SAUNIER Maxence  wrote:
> > >
> > > Hello,
> > >
> > > How can I ignore accent in the query result ?
> > >
> > > Request :
> > > http://*:8983/solr/***/select?defType=dismax=je+suis+avarié
> > > =t
> > > itle%5e20+subject%5e15+category%5e1+content%5e0.5=757
> > >
> > > I want to have doc with avarié and avarie.
> > >
> > > I have add this in my schema :
> > >
> > >   {
> > > "name": "string",
> > > "positionIncrementGap": "100",
> > > "analyzer": {
> > >   "filters": [
> > > {
> > >   "class": "solr.LowerCaseFilterFactory"
> > > },
> > > {
> > >   "class": "solr.ASCIIFoldingFilterFactory"
> > > },
> > > {
> > >   "class": "solr.EdgeNGramFilterFactory",
> > >   "minGramSize": "3",
> > >   "maxGramSize": "50"
> > > }
> > >   ],
> > >   "tokenizer": {
> > > "class": "solr.KeywordTokenizerFactory"
> > >   }
> > > },
> > > "stored": true,
> > > "indexed": true,
> > > "sortMissingLast": true,
> > > "class": "solr.TextField"
> > >   },
> > >
> > > But it not working.
> > >
> > > Thanks.
> >


Re: Query of Death Lucene/Solr 7.6

2019-02-08 Thread Michael Gibney
Hi Markus,
As of 7.6, LUCENE-8531 
reverted a graph/Spans-based phrase query implementation (introduced in 6.5
-- LUCENE-7699 ) to an
implementation that builds a separate phrase query for each possible
enumerated path through the graph described by a parsed query.
The potential for combinatoric explosion of the enumerated approach was (as
far as I can tell) one of the main motivations for introducing the
Spans-based implementation. Some real-world use cases would be good to
explore. Markus, could you send (as an attachment) the debug toString() for
the queries with/without synonyms enabled? I'm also guessing you may have
WordDelimiterGraphFilter on the query analyzer?
As an alternative to disabling pf, LUCENE-8531 only reverts to the
enumerated approach for phrase queries where slop>0, so setting ps=0 would
probably also help.
Michael

On Fri, Feb 8, 2019 at 5:57 AM Markus Jelsma 
wrote:

> Hello (apologies for cross-posting),
>
> While working on SOLR-12743, using 7.6 on two nodes and 7.2.1 on the
> remaining four, we stumbled upon a situation where the 7.6 nodes quickly
> succumb when a 'Query-of-Death' is issued, 7.2.1 up to 7.5 are all
> unaffected (tested and confirmed).
>
> Following Smiley's suggestion i used Eclipse MAT to find the problem in
> the heap dump i obtained, this fantastic tool revealed within minutes that
> a query thread ate 65 % of all resources, in the class variables i could
> find the the query, and reproduce the problem.
>
> The problematic query is 'dubbele dijk/rijke dijkproject in het dijktracé
> eemshaven-delfzijl', on 7.6 this input produces a 40+ MB toString() output
> in edismax' newFieldQuery. If the node survives it takes 2+ seconds for the
> query to run (150 ms otherwise). If i disable all query time
> SynonymGraphFilters it still takes a second and produces just a 9 MB
> toString() for the query.
>
> I could not find anything like this in Jira. I did think of LUCENE-8479
> and LUCENE-8531 but they were about graphs, this problem looked related
> though.
>
> I think i tracked it further down to LUCENE-8589 or SOLR-12243. When i
> leave Solr's edismax' pf parameter empty, everything runs fast. When all
> fields are configured for pf, the node dies.
>
> I am now unsure whether this is a Solr or a Lucene issue.
>
> Please let me know.
>
> Many thanks,
> Markus
>
> ps. in Solr i even got an 'Impossible Exception', my first!
>


Re: Indexing in one collection affect index in another collection

2019-02-08 Thread Zheng Lin Edwin Yeo
Hi Shawn,

Thanks for your reply.

Although the space in the OS disk cache could be the issue, but we didn't
face this problem previously, especially in our other setup using Solr
6.5.1, which contains much more data (more than 1 TB), as compared to our
current setup in Solr 7.6.0, in which the data size is only 20 GB.

Regards,
Edwin



On Wed, 6 Feb 2019 at 23:52, Shawn Heisey  wrote:

> On 2/6/2019 7:58 AM, Zheng Lin Edwin Yeo wrote:
> > Hi everyone,
> >
> > Does anyone has further updates on this issue?
>
> It is my strong belief that all the software running on this server
> OTHER than Solr is competing with Solr for space in the OS disk cache,
> and that Solr's data is getting pushed out of that cache.
>
> Best guess is that with only one collection, the disk cache was able to
> hold onto Solr's data better, and that with another collection present,
> there's not enough disk cache space available to cache both of them
> effectively.
>
> I think you're going to need a dedicated machine for Solr, so Solr isn't
> competing for system resources.
>
> Thanks,
> Shawn
>


Re: Get recent documents from solr

2019-02-08 Thread Jan Høydahl
Add a field to schema which will insert the actual indexing date



Then query q=*:*=indextime desc

But if you re-index everything (why?) then you need to map some date stamp from 
the source DB into the same field in Solr schema, that you can then sort on. 
You're indexing from 4 DB views into the same collection in Solr, yes?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 8. feb. 2019 kl. 14:38 skrev shruti suri :
> 
> Hi,
> 
> I want to get latest updated documents from Solr. I am indexing data from
> multiple view and each view has its own update date. Also I am running
> full-indexing job every 4 hour so can't take solrtimestamp(NOW). Is there
> any solr functionality by which I can achieve this.
> 
> Thanks
> Shruti 
> 
> 
> 
> 
> -
> Regards
> Shruti
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



RE: Ignore accent in a request

2019-02-08 Thread SAUNIER Maxence
Thanks you !

-Message d'origine-
De : elisabeth benoit  
Envoyé : vendredi 8 février 2019 14:12
À : solr-user@lucene.apache.org
Objet : Re: Ignore accent in a request

Hello,

We use solr 7 and use



with mapping-ISOLatin1Accent.txt

containing lines like

# À => A
"\u00C0" => "A"

# Á => A
"\u00C1" => "A"

# Â => A
"\u00C2" => "A"

# Ã => A
"\u00C3" => "A"

# Ä => A
"\u00C4" => "A"

# Å => A
"\u00C5" => "A"

# Ā Ă Ą =>
"\u0100" => "A"
"\u0102" => "A"
"\u0104" => "A"

# Æ => AE
"\u00C6" => "AE"

# Ç => C
"\u00C7" => "C"

# é => e
"\u00E9" => "e"

Best regards,
Elisabeth

Le ven. 8 févr. 2019 à 11:18, Gopesh Sharma  a écrit 
:

> We have fixed this type of issue by using Synonyms by adding 
> SynonymFilterFactory(Before Solr 7).
>
> -Original Message-
> From: SAUNIER Maxence 
> Sent: Friday, February 8, 2019 3:36 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Ignore accent in a request
>
> Hello,
>
> Thanks for you answer.
>
> I have test :
>
> select?defType=dismax=je suis avarié=content
> 90.000 results
>
> select?defType=dismax=je suis avarie=content
> 60.000 results
>
> With avarié, I dont find documents with avarie and with avarie, I 
> don't find documents with avarié.
>
> I want to find they 150.000 documents with avarié or avarie.
>
> Thanks
>
> -Message d'origine-
> De : Erick Erickson  Envoyé : jeudi 7 février
> 2019 19:37 À : solr-user  Objet : Re: 
> Ignore accent in a request
>
> exactly _how_ is it "not working"?
>
> Try building your parameters _up_ rather than starting with a lot, e.g.
> select?defType=dismax=je suis avarié=title ^^ assumes you expect 
> a match on title. Then:
> select?defType=dismax=je suis avarié=title subject
>
> etc.
>
> Because mm=757 looks really wrong. From the docs:
> Defines the minimum number of clauses that must match, regardless of 
> how many clauses there are in total.
>
> edismax is used much more than dismax as it's more flexible, but 
> that's not germane here.
>
> finally, try adding =query to the url to see exactly how the 
> query is parsed.
>
> Best,
> Erick
>
> On Mon, Feb 4, 2019 at 9:09 AM SAUNIER Maxence  wrote:
> >
> > Hello,
> >
> > How can I ignore accent in the query result ?
> >
> > Request :
> > http://*:8983/solr/***/select?defType=dismax=je+suis+avarié
> > =t
> > itle%5e20+subject%5e15+category%5e1+content%5e0.5=757
> >
> > I want to have doc with avarié and avarie.
> >
> > I have add this in my schema :
> >
> >   {
> > "name": "string",
> > "positionIncrementGap": "100",
> > "analyzer": {
> >   "filters": [
> > {
> >   "class": "solr.LowerCaseFilterFactory"
> > },
> > {
> >   "class": "solr.ASCIIFoldingFilterFactory"
> > },
> > {
> >   "class": "solr.EdgeNGramFilterFactory",
> >   "minGramSize": "3",
> >   "maxGramSize": "50"
> > }
> >   ],
> >   "tokenizer": {
> > "class": "solr.KeywordTokenizerFactory"
> >   }
> > },
> > "stored": true,
> > "indexed": true,
> > "sortMissingLast": true,
> > "class": "solr.TextField"
> >   },
> >
> > But it not working.
> >
> > Thanks.
>


RE: Solr Index Size after reindex

2019-02-08 Thread Mathieu Menard
Hi Andrea,

I've checked this information and here is the result:



PRODUCTION

STAGING

numDocs

5.365.213

4.537.651

MaxDoc

5.845.469

5.129.556


It seems that there is more than 800.00 docs in PRODUCTION that will explain 
the size of indexes more important. But there is a thing that I don't 
understand, we have copied the DB and the contenstore the numDocs for the two 
environments should be the same no?

Could you also explain me the meaning of the maxDocs value pleases?

Thanks

Matthieu


From: Andrea Gazzarini [mailto:a.gazzar...@sease.io]
Sent: vendredi 8 février 2019 14:54
To: solr-user@lucene.apache.org
Subject: Re: Solr Index Size after reindex

Hi Mathieu,
what about the docs in the two infrastructures? Do they have the same numbers 
(numdocs / maxdocs)? Any meaningful message (error or not) in log files?

Andrea
On 08/02/2019 14:19, Mathieu Menard wrote:
Hello,

I would like to have your point of view about an observation we have made on 
our two alfresco install (Production and Staging environment) and more 
specifically on the size of our solr indexes on these two environments.

Regularly we do a rsync between the Production and the Staging environment, we 
make a copy of the Alfresco's DB and a copy of the entire contenstore after 
that we reindex all the alfresco content.

We have noticed that for the production environment we have 19 Gb of indexes 
while in the staging we have "only" 11. Gb of indexes. We have some 
difficulties to understand this difference because we assume that the indexes 
optimization in the same for a full reindex or for the normal use of solr.

I've verified the configuration between the two solr instances and I don't see 
any differences could you help me to better understand  this phenomenon.

Here you can find some information about our two environment, if you need more 
details, I will give you as soon as possible:



PRODUCTION

STAGING

Alfresco version

5.1.1.4

5.1.1.4

Solr Version

[cid:image002.jpg@01D4BFC5.52F6DE40]

[cid:image003.jpg@01D4BFC5.52F6DE40]

Java version

[cid:image004.png@01D4BFC5.52F6DE40]

[cid:image005.png@01D4BFC5.52F6DE40]

Linux Machine

See Staging_caracteristics.txt file in attachment

See Staging_caracteristics.txt file in attachment


Please let me know if you any other information I will sent it to you rapidly.

Kind Regards

Matthieu




RE: change in White Space when upgrading 6.6 to 7.4

2019-02-08 Thread Oakley, Craig (NIH/NLM/NCBI) [C]
> Can we take this thread back to the mailing list, please? It would be good to 
> allow other people to weigh in!

Sure

-Original Message-
From: Matt Pearce  
Sent: Friday, February 08, 2019 6:45 AM
To: Oakley, Craig (NIH/NLM/NCBI) [C] 
Subject: Re: change in White Space when upgrading 6.6 to 7.4


The first (sow=false) query parses to:
"+(+((text:pd text:2485621) | isolation_source:PDS24856.21) 
-erd_group:PDS24856.21)"
while the sow=true query parses to:
"+(+(text:\"pd 2485621\" | isolation_source:PDS24856.21) 
-erd_group:PDS24856.21)"

This suggests to me that the analyzer on the text field is using the 
WordDelimiterFilterFactory (or WordDelimiterGraphFilterFactory), and 
splitting the query text into separate tokens on number/word boundaries 
- so "ABC123" => "ABC" "123". It is also stripping the "S" from "PDS", 
and the decimal point from the numeric part, as you can see from the 
"text:2485621" part of both queries - this may not be the 
WordDelimiter filter, but I suspect it probably is.

It works when sow=true, because it's generating a phrase query. When 
sow=false, it doesn't generate a phrase query and you're getting matches 
on both "pd" and "2485621" - presumably "pd" appears in a lot of 
your documents.

A possible solution without using sow=true would be to modify the 
analyzer on your text field so it doesn't use 
WordDelimiterFilterFactory, and retains "PD24856.21" as a single 
token, or modify the behaviour of that filter so it doesn't split the 
tokens the same way. Of course, this may not be what you want, depending 
on the other data you have in the text field.

Can we take this thread back to the mailing list, please? It would be 
good to allow other people to weigh in!

Thanks,
Matt

On 07/02/2019 15:58, Oakley, Craig (NIH/NLM/NCBI) [C] wrote:
> Thanks. Here is what I have.
> 
> The first curl output is the problem results. The next two were changing the 
> query (adding quotation marks or adding "*:")
> 
> After the third curl output, I upload a new solrconfig.xml (in another 
> window) to include true in the /select requestHandler; 
> I then RELOAD the core and run the final curl commend
> 
> The correct answer should have numFound 0 (and the only one which fails to 
> get the correct answer is the first: the original query with sow defaulting 
> to false in Solr7.4)
> 
> Let me know if you see any clarification in the debugQuery output
> 
> Thanks again
> 
> 
> 
> *[10:33 ~ 2209]$ curl -s 
> 'http://host:/solr/isolates/select?indent=on=PDS24856.21%20AND%20-erd_group:PDS24856.21=json=0=on'|tee
>  ~/solr/DBH14432debug190207a.out
> {
>"responseHeader":{
>  "zkConnected":true,
>  "status":0,
>  "QTime":1,
>  "params":{
>"q":"PDS24856.21 AND -erd_group:PDS24856.21",
>"indent":"on",
>"rows":"0",
>"wt":"json",
>"debugQuery":"on"}},
>"response":{"numFound":21322074,"start":0,"docs":[]
>},
>"debug":{
>  "rawquerystring":"PDS24856.21 AND -erd_group:PDS24856.21",
>  "querystring":"PDS24856.21 AND -erd_group:PDS24856.21",
>  "parsedquery":"+(+DisjunctionMaxQuery(((text:pd text:2485621) | 
> isolation_source:PDS24856.21)) -erd_group:PDS24856.21)",
>  "parsedquery_toString":"+(+((text:pd text:2485621) | 
> isolation_source:PDS24856.21) -erd_group:PDS24856.21)",
>  "explain":{},
>  "QParser":"ExtendedDismaxQParser",
>  "altquerystring":null,
>  "boost_queries":null,
>  "parsed_boost_queries":[],
>  "boostfuncs":null,
>  "timing":{
>"time":1.0,
>"prepare":{
>  "time":0.0,
>  "query":{
>"time":0.0},
>  "facet":{
>"time":0.0},
>  "facet_module":{
>"time":0.0},
>  "mlt":{
>"time":0.0},
>  "highlight":{
>"time":0.0},
>  "stats":{
>"time":0.0},
>  "expand":{
>"time":0.0},
>  "terms":{
>"time":0.0},
>  "debug":{
>"time":0.0}},
>"process":{
>  "time":0.0,
>  "query":{
>"time":0.0},
>  "facet":{
>"time":0.0},
>  "facet_module":{
>"time":0.0},
>  "mlt":{
>"time":0.0},
>  "highlight":{
>"time":0.0},
>  "stats":{
>"time":0.0},
>  "expand":{
>"time":0.0},
>  "terms":{
>"time":0.0},
>  "debug":{
>"time":0.0}
> *[10:33 ~ 2210]$ curl -s 
> 'http://host:/solr/isolates/select?indent=on="PDS24856.21"%20AND%20-erd_group:PDS24856.21=json=0=on'|tee
>  ~/solr/DBH14432debug190207b.out
> {
>"responseHeader":{
>  "zkConnected":true,
>  "status":0,
>  "QTime":1,
>  "params":{
>"q":"\"PDS24856.21\" AND -erd_group:PDS24856.21",
>"indent":"on",
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-02-08 Thread Zheng Lin Edwin Yeo
Hi Paul,

Regarding the regex (\n\s*){2,} that we are using, when we try in on
https://regex101.com/, it is able to give us the correct result for all the
examples (ie: All of them will only have , and not more than that
like what we are getting in Solr in our earlier examples).

Could there be a possibility of a bug in Solr?

Regards,
Edwin

On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo 
wrote:

> Hi Paul,
>
> We have tried it with the space preceeding the \n i.e.  name="pattern">(\s*\n){2,}, with the following regex pattern:
>
> 
>content
>(\s*\n){2,}
>brbr
> 
>
> However, we are also getting the exact same results as the earlier Example
> 1, 2 and 3.
>
> As for your point 2 on perhaps in the data you have other (non printing)
> characters than \n, we have find that there are no non printing characters.
> It is just next line with a space. You can refer to the original content in
> the same examples below.
>
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content in EML file:*
> Dear Sir,
>
>
> I am terminating
> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *Dear Sir,  I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 )
> *Original content in EML file:*
>
> *exalted*
>
> *Psalm 89:17*
>
>
> 3 Choa Chu Kang Avenue 4
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  Psalm 89:17 3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 , there are 4 )
> *Original content in EML file:*
>
> http://www.concordpri.moe.edu.sg/
>
>
>
>
>
>
>
>
> On Tue, Dec 18, 2018 at 10:07 AM
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
> 2018 at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/ On
> Tue, Dec 18, 2018 at 10:07 AM
>
>
> Appreciate any other ideas or suggestions that you may have.
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 22:49,  wrote:
>
>> Hi Edwin
>>
>>
>>
>>   1.  Sorry, the pattern was wrong, the space should preceed the \n i.e.
>> (\s*\n){2,}
>>   2.  Perhaps in the data you have other (non printing) characters than
>> \n?
>>
>>
>>
>> Gesendet von Mail für
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo
>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>> An: solr-user@lucene.apache.org
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi Paul,
>>
>> We have tried this suggested regex pattern as follow:
>> 
>>content
>>(\n\s*){2,}
>>brbr
>> 
>>
>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>
>> Example 1: The sentence that the above regex pattern is working correctly
>> *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
>> *Index content: *Dear Sir,  I am terminating
>>
>> Example 2: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 , there are 4 )
>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>> Chu Kang Avenue 4, Singapore
>> *Index content: *exalted  Psalm 89:17 3 Choa
>> Chu Kang Avenue 4, Singapore
>>
>> Example 3: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 , there are 4 )
>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>> \n\n
>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> 2018
>> at 10:07 AM
>> *Index content: *http://www.concordpri.moe.edu.sg/ On
>> Tue, Dec 18, 2018 at 10:07 AM
>>
>> Any further suggestion?
>>
>> Thank you.
>>
>> Regards,
>> Edwin
>>
>> On Thu, 7 Feb 2019 at 22:20,  wrote:
>>
>> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
>> > part you could try
>> >
>> >
>> >
>> > (\n\s*){2,}
>> >
>> >
>> >
>> > If you also want to match CRLF then
>> >
>> > (\r?\n\s*){2,}
>> >
>> >
>> >
>> >
>> >
>> > Gesendet von Mail für
>> > Windows 10
>> >
>> >
>> >
>> > Von: Zheng Lin Edwin Yeo
>> > Gesendet: Donnerstag, 7. Februar 2019 15:10
>> > An: solr-user@lucene.apache.org
>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>> >
>> >
>> >
>> > Hi Paul,
>> >
>> > Thanks for your reply.
>> >
>> > When I use this pattern:
>> > 
>> >content
>> >(\n+\s*){2,}
>> >brbr
>> > 
>> >
>> > It is working for some sentence within the same content and not working
>> for
>> > some sentences. Please see below for the one that is working and another

Re: Solr Index Size after reindex

2019-02-08 Thread Andrea Gazzarini

Hi Mathieu,
what about the docs in the two infrastructures? Do they have the same 
numbers (numdocs / maxdocs)? Any meaningful message (error or not) in 
log files?


Andrea

On 08/02/2019 14:19, Mathieu Menard wrote:


Hello,

I would like to have your point of view about an observation we have 
made on our two alfresco install (Production and Staging environment) 
and more specifically on the size of our solr indexes on these two 
environments.


Regularly we do a rsync between the Production and the Staging 
environment, we make a copy of the Alfresco’s DB and a copy of the 
entire contenstore after that we reindex all the alfresco content.


We have noticed that for the production environment we have 19 Gb of 
indexes while in the staging we have “only” 11. Gb of indexes. We have 
some difficulties to understand this difference because we assume that 
the indexes optimization in the same for a full reindex or for the 
normal use of solr.


I’ve verified the configuration between the two solr instances and I 
don’t see any differences could you help me to better understand  this 
phenomenon.


Here you can find some information about our two environment, if you 
need more details, I will give you as soon as possible:




PRODUCTION



STAGING

Alfresco version



5.1.1.4



5.1.1.4

Solr Version





Java version





Linux Machine



See Staging_caracteristics.txt file in attachment



See Staging_caracteristics.txt file in attachment

Please let me know if you any other information I will sent it to you 
rapidly.


Kind Regards

Matthieu





Get recent documents from solr

2019-02-08 Thread shruti suri
Hi,

I want to get latest updated documents from Solr. I am indexing data from
multiple view and each view has its own update date. Also I am running
full-indexing job every 4 hour so can't take solrtimestamp(NOW). Is there
any solr functionality by which I can achieve this.

Thanks
Shruti 




-
Regards
Shruti
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Ignore accent in a request

2019-02-08 Thread elisabeth benoit
Hello,

We use solr 7 and use



with mapping-ISOLatin1Accent.txt

containing lines like

# À => A
"\u00C0" => "A"

# Á => A
"\u00C1" => "A"

# Â => A
"\u00C2" => "A"

# Ã => A
"\u00C3" => "A"

# Ä => A
"\u00C4" => "A"

# Å => A
"\u00C5" => "A"

# Ā Ă Ą =>
"\u0100" => "A"
"\u0102" => "A"
"\u0104" => "A"

# Æ => AE
"\u00C6" => "AE"

# Ç => C
"\u00C7" => "C"

# é => e
"\u00E9" => "e"

Best regards,
Elisabeth

Le ven. 8 févr. 2019 à 11:18, Gopesh Sharma  a
écrit :

> We have fixed this type of issue by using Synonyms by adding
> SynonymFilterFactory(Before Solr 7).
>
> -Original Message-
> From: SAUNIER Maxence 
> Sent: Friday, February 8, 2019 3:36 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Ignore accent in a request
>
> Hello,
>
> Thanks for you answer.
>
> I have test :
>
> select?defType=dismax=je suis avarié=content
> 90.000 results
>
> select?defType=dismax=je suis avarie=content
> 60.000 results
>
> With avarié, I dont find documents with avarie and with avarie, I don't
> find documents with avarié.
>
> I want to find they 150.000 documents with avarié or avarie.
>
> Thanks
>
> -Message d'origine-
> De : Erick Erickson  Envoyé : jeudi 7 février
> 2019 19:37 À : solr-user  Objet : Re: Ignore
> accent in a request
>
> exactly _how_ is it "not working"?
>
> Try building your parameters _up_ rather than starting with a lot, e.g.
> select?defType=dismax=je suis avarié=title ^^ assumes you expect a
> match on title. Then:
> select?defType=dismax=je suis avarié=title subject
>
> etc.
>
> Because mm=757 looks really wrong. From the docs:
> Defines the minimum number of clauses that must match, regardless of how
> many clauses there are in total.
>
> edismax is used much more than dismax as it's more flexible, but that's
> not germane here.
>
> finally, try adding =query to the url to see exactly how the query
> is parsed.
>
> Best,
> Erick
>
> On Mon, Feb 4, 2019 at 9:09 AM SAUNIER Maxence  wrote:
> >
> > Hello,
> >
> > How can I ignore accent in the query result ?
> >
> > Request :
> > http://*:8983/solr/***/select?defType=dismax=je+suis+avarié=t
> > itle%5e20+subject%5e15+category%5e1+content%5e0.5=757
> >
> > I want to have doc with avarié and avarie.
> >
> > I have add this in my schema :
> >
> >   {
> > "name": "string",
> > "positionIncrementGap": "100",
> > "analyzer": {
> >   "filters": [
> > {
> >   "class": "solr.LowerCaseFilterFactory"
> > },
> > {
> >   "class": "solr.ASCIIFoldingFilterFactory"
> > },
> > {
> >   "class": "solr.EdgeNGramFilterFactory",
> >   "minGramSize": "3",
> >   "maxGramSize": "50"
> > }
> >   ],
> >   "tokenizer": {
> > "class": "solr.KeywordTokenizerFactory"
> >   }
> > },
> > "stored": true,
> > "indexed": true,
> > "sortMissingLast": true,
> > "class": "solr.TextField"
> >   },
> >
> > But it not working.
> >
> > Thanks.
>


Re: Solr relevancy score different on replicated nodes

2019-02-08 Thread Aman Tandon
Hi Erick,

I find this thread very relevant to the people who are facing the same
problem.

In our case, we have a signals aggregation collection which is having total
of around 8 million records. We have Solr cloud architecture(3 shards and 4
replicas) and the whole size of index is of around 2.5 GB.

We use this collection to fetch the most clicked products against a query
and boost in search results. Boost score is the query score on aggregation
collection.

But when the query goes to different replica we get different boost score
for some of the keywords, hence on page refresh results ordering keep on
changing.

In order to solve we tried the exactstats cache for distributed IDF and on
debug level I am seeing global stats merge in logs but still the different
scores coming on refreshing the results from aggregation collection.

Our indexing occur once a day so should we do daily optimization or should
we reduce merge segment count to 2/3 currently it is -1.

What are your suggestions on this?

Regards,
Aman

On Fri, Feb 8, 2019, 00:15 Erick Erickson  Optimization is safe. The large segment is irrelevant, you'll
> lose a little parallelization, but on an index with this few
> documents I doubt you'll notice.
>
> As of Solr 5, optimize will respect the max segment size
> which defaults to 5G, but you're well under that limit.
>
> Best,
> Erick
>
> On Sun, Feb 3, 2019 at 11:54 PM Ashish Bisht 
> wrote:
> >
> > Thanks Erick and everyone.We are checking on stats cache.
> >
> > I noticed stats skew again and optimized the index to correct the same.As
> > per the documents.
> >
> >
> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
> > and
> >
> https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/
> >
> > wanted to check on below points considering we want stats skew to be
> > corrected.
> >
> > 1.When optimized single segment won't be natural merged easily.As we
> might
> > be doing manual optimize every time,what I visualize is at a certain
> point
> > in future we might be having a single large segment.What impact this
> large
> > segment is going to have?
> > Our index ~30k documents i.e files with content(Segment size <1Gb as of
> now)
> >
> > 1.Do you recommend going for optimize in these situations?Probably it
> will
> > be done only when stats skew.Is it safe?
> >
> > Regards
> > Ashish
> >
> >
> >
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Query of Death Lucene/Solr 7.6

2019-02-08 Thread Markus Jelsma
Hello (apologies for cross-posting),

While working on SOLR-12743, using 7.6 on two nodes and 7.2.1 on the remaining 
four, we stumbled upon a situation where the 7.6 nodes quickly succumb when a 
'Query-of-Death' is issued, 7.2.1 up to 7.5 are all unaffected (tested and 
confirmed).

Following Smiley's suggestion i used Eclipse MAT to find the problem in the 
heap dump i obtained, this fantastic tool revealed within minutes that a query 
thread ate 65 % of all resources, in the class variables i could find the the 
query, and reproduce the problem.

The problematic query is 'dubbele dijk/rijke dijkproject in het dijktracé 
eemshaven-delfzijl', on 7.6 this input produces a 40+ MB toString() output in 
edismax' newFieldQuery. If the node survives it takes 2+ seconds for the query 
to run (150 ms otherwise). If i disable all query time SynonymGraphFilters it 
still takes a second and produces just a 9 MB toString() for the query.

I could not find anything like this in Jira. I did think of LUCENE-8479 and 
LUCENE-8531 but they were about graphs, this problem looked related though.

I think i tracked it further down to LUCENE-8589 or SOLR-12243. When i leave 
Solr's edismax' pf parameter empty, everything runs fast. When all fields are 
configured for pf, the node dies.

I am now unsure whether this is a Solr or a Lucene issue. 

Please let me know.

Many thanks,
Markus

ps. in Solr i even got an 'Impossible Exception', my first!


Issue with dataimport xml validation with dtd and jetty: conflict of use for user.dir variable

2019-02-08 Thread jerome . dupont
Hello,

I use solr and dataimport to index xml files with a dtd.
The dtd is referenced like this


Previously we were using solr4 in a tomcat container.
During the import process, solr tries to validate the xml file with the 
dtd.
To find it we were defining -Duser.dir=pathToDtD and solr could find te 
dtd and validation was working

Now, we are migrating to solr7 (and jetty embedded)
When we start solr  with -a "-Duser.dir=pathToDtd", solr doesn't start and 
returns an error: Cannot find jetty main class

So I removed the a "-Duser.dir=pathToDtd" option, and solr starts. 
BUT
Now solr cannot anymore open xml file, because it doesn't find the dtd 
during validation stage.

Is there a way to:
- activate an xml catalog file to indicate where the dtd is? (Seems it 
would be the better way, fat I didn't find how to do)
- disable dtd validation 

Regards,
---
Jérôme Dupont
Bibliothèque Nationale de France
Département des Systèmes d'Information
Tour T3 - Quai François Mauriac
75706 Paris Cedex 13
téléphone: 33 (0)1 53 79 45 40
e-mail: jerome.dup...@bnf.fr
---

Pass BnF lecture/culture : bibliothèques, expositions, conférences, concerts en 
illimité pour 15 € / an  –  Acheter en ligne Avant d'imprimer, pensez à 
l'environnement. 

RE: Ignore accent in a request

2019-02-08 Thread Gopesh Sharma
We have fixed this type of issue by using Synonyms by adding 
SynonymFilterFactory(Before Solr 7).

-Original Message-
From: SAUNIER Maxence  
Sent: Friday, February 8, 2019 3:36 PM
To: solr-user@lucene.apache.org
Subject: RE: Ignore accent in a request

Hello,

Thanks for you answer.

I have test :

select?defType=dismax=je suis avarié=content
90.000 results

select?defType=dismax=je suis avarie=content
60.000 results

With avarié, I dont find documents with avarie and with avarie, I don't find 
documents with avarié.

I want to find they 150.000 documents with avarié or avarie.

Thanks

-Message d'origine-
De : Erick Erickson  Envoyé : jeudi 7 février 2019 
19:37 À : solr-user  Objet : Re: Ignore accent in 
a request

exactly _how_ is it "not working"?

Try building your parameters _up_ rather than starting with a lot, e.g.
select?defType=dismax=je suis avarié=title ^^ assumes you expect a match 
on title. Then:
select?defType=dismax=je suis avarié=title subject

etc.

Because mm=757 looks really wrong. From the docs:
Defines the minimum number of clauses that must match, regardless of how many 
clauses there are in total.

edismax is used much more than dismax as it's more flexible, but that's not 
germane here.

finally, try adding =query to the url to see exactly how the query is 
parsed.

Best,
Erick

On Mon, Feb 4, 2019 at 9:09 AM SAUNIER Maxence  wrote:
>
> Hello,
>
> How can I ignore accent in the query result ?
>
> Request : 
> http://*:8983/solr/***/select?defType=dismax=je+suis+avarié=t
> itle%5e20+subject%5e15+category%5e1+content%5e0.5=757
>
> I want to have doc with avarié and avarie.
>
> I have add this in my schema :
>
>   {
> "name": "string",
> "positionIncrementGap": "100",
> "analyzer": {
>   "filters": [
> {
>   "class": "solr.LowerCaseFilterFactory"
> },
> {
>   "class": "solr.ASCIIFoldingFilterFactory"
> },
> {
>   "class": "solr.EdgeNGramFilterFactory",
>   "minGramSize": "3",
>   "maxGramSize": "50"
> }
>   ],
>   "tokenizer": {
> "class": "solr.KeywordTokenizerFactory"
>   }
> },
> "stored": true,
> "indexed": true,
> "sortMissingLast": true,
> "class": "solr.TextField"
>   },
>
> But it not working.
>
> Thanks.


RE: Ignore accent in a request

2019-02-08 Thread SAUNIER Maxence
Hello,

Thanks for you answer.

I have test :

select?defType=dismax=je suis avarié=content
90.000 results

select?defType=dismax=je suis avarie=content
60.000 results

With avarié, I dont find documents with avarie and with avarie, I don't find 
documents with avarié.

I want to find they 150.000 documents with avarié or avarie.

Thanks

-Message d'origine-
De : Erick Erickson  
Envoyé : jeudi 7 février 2019 19:37
À : solr-user 
Objet : Re: Ignore accent in a request

exactly _how_ is it "not working"?

Try building your parameters _up_ rather than starting with a lot, e.g.
select?defType=dismax=je suis avarié=title ^^ assumes you expect a match 
on title. Then:
select?defType=dismax=je suis avarié=title subject

etc.

Because mm=757 looks really wrong. From the docs:
Defines the minimum number of clauses that must match, regardless of how many 
clauses there are in total.

edismax is used much more than dismax as it's more flexible, but that's not 
germane here.

finally, try adding =query to the url to see exactly how the query is 
parsed.

Best,
Erick

On Mon, Feb 4, 2019 at 9:09 AM SAUNIER Maxence  wrote:
>
> Hello,
>
> How can I ignore accent in the query result ?
>
> Request : 
> http://*:8983/solr/***/select?defType=dismax=je+suis+avarié=t
> itle%5e20+subject%5e15+category%5e1+content%5e0.5=757
>
> I want to have doc with avarié and avarie.
>
> I have add this in my schema :
>
>   {
> "name": "string",
> "positionIncrementGap": "100",
> "analyzer": {
>   "filters": [
> {
>   "class": "solr.LowerCaseFilterFactory"
> },
> {
>   "class": "solr.ASCIIFoldingFilterFactory"
> },
> {
>   "class": "solr.EdgeNGramFilterFactory",
>   "minGramSize": "3",
>   "maxGramSize": "50"
> }
>   ],
>   "tokenizer": {
> "class": "solr.KeywordTokenizerFactory"
>   }
> },
> "stored": true,
> "indexed": true,
> "sortMissingLast": true,
> "class": "solr.TextField"
>   },
>
> But it not working.
>
> Thanks.