Re: Solr Size Limitation upto 32 kb limitation

2019-01-16 Thread Kranthi Kumar K
Hi Team,


Can we have any updates on the below issue? We are awaiting your reply.


Thanks,

Kranthi kumar.K


From: Kranthi Kumar K
Sent: Friday, January 4, 2019 5:01:38 PM
To: d...@lucene.apache.org
Cc: Ananda Babu medida; Srinivasa Reddy Karri
Subject: Solr Size Limitation upto 32 kb limitation


Hi team,



We are currently using Solr 4.2.1 version in our project and everything is 
going well. But recently, we are facing an issue with Solr Data Import. It is 
not importing the files with size greater than 32766 bytes (i.e, 32 kb) and 
showing 2 exceptions:



  1.  java.lang.illegalargumentexception
  2.  org.apache.lucene.util.bytesref hash$maxbyteslengthexceededexception



Please find the attached screenshot for reference.



We have searched for solutions in many forums and didn’t find the exact 
solution for this issue. Interestingly, we found in the article, by changing 
the type of the ‘field’ from sting to  ‘text_general’ might solve the issue. 
Please have a look in the below forum:



https://stackoverflow.com/questions/29445323/adding-a-document-to-the-index-in-solr-document-contains-at-least-one-immense-t



Schema.xml:

Changed from:

‘’



Changed to:

‘’



We have tried it but still it is not importing the files > 32 KB or 32766 bytes.



Could you please let us know the solution to fix this issue? We’ll be awaiting 
your reply.



Re: Solr index writing to s3

2019-01-16 Thread Jörn Franke
This is not a requirement. This is a statement to a problem where there could 
be other solutions. s3 is only eventually consistent and I am not sure Solr 
works properly in this case. You may also need to check the S3 consistency to 
be applied.

> Am 16.01.2019 um 19:39 schrieb Naveen M :
> 
> hi,
> 
> My requirement is to write the index data into S3, we have solr installed
> on aws instances. Please let me know if there is any documentation on how
> to achieve writing the index data to s3.
> 
> Thanks


Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-16 Thread Zheng Lin Edwin Yeo
Based on the discussion in Tika and also on the Jira (TIKA-2814), it was
said that the issue could be with the Solr's ExtractingRequestHandler, in
which the HTMLParser is either not being applied, or is somehow not
stripping the content of  elements. Straight Tika app is able to do
the right thing.

Regards,
Edwin

On Tue, 15 Jan 2019 at 10:56, Zheng Lin Edwin Yeo 
wrote:

> Hi Alex,
>
> Thanks for the suggestions.
> Yes, I have posted it in the Tika mailing list too.
>
> Regards,
> Edwin
>
> On Mon, 14 Jan 2019 at 21:16, Alexandre Rafalovitch 
> wrote:
>
>> I think asking this question on Tika mailing list may give you better
>> answers. Then, if the conclusion is that the behavior is configurable,
>> you can see how to do it in Solr. It may be however, that you need to
>> do the parsing outside of Solr with standalone Tika. Standalone Tika
>> is a production advice anyway.
>>
>> I would suggest the title be something like "How to prefer plain/text
>> part of an email message when parsing .eml files".
>>
>> Regards,
>>   Alex.
>>
>> On Mon, 14 Jan 2019 at 00:20, Zheng Lin Edwin Yeo 
>> wrote:
>> >
>> > Hi,
>> >
>> > I have uploaded a sample EML file here:
>> >
>> https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
>> >
>> > This is what is indexed in the content:
>> >
>> > "content":"  font-size: 14pt; font-family: book antiqua,
>> > palatino, serif;  Hi There,font-size: 14pt; font-family:
>> > book antiqua, palatino, serif;  My client owns the domain name “
>> > font-size: 14pt; color: #ff; font-family: arial black, sans-serif;
>> >  TravelInsuranceEurope.com   font-size: 14pt; font-family: book
>> > antiqua, palatino, serif;  ” and is considering putting it in market.
>> > It is keyword rich domain with good search volume,adword bidding and
>> > type-in-traffic.font-size: 14pt; font-family: book
>> > antiqua, palatino, serif;  Based on our extensive study, we strongly
>> > feel that you should consider buying this domain name to improve the
>> > SEO, Online visibility, brand image, authority and type-in-traffic for
>> > your business. We also do provide free 1 year hosting and unlimited
>> > emails along with domain name.font-size: 14pt;
>> > font-family: book antiqua, palatino, serif;  Besides this, if you need
>> > any other domain name, web and app designing services and digital
>> > marketing services (SEO, PPC and SMO) at reasonable charges, feel free
>> > to contact us.font-size: 14pt; font-family: book antiqua,
>> > palatino, serif;  Best Regards,font-size: 14pt;
>> > font-family: book antiqua, palatino, serif;  Josh   ",
>> >
>> >
>> > As you can see, this is taken from the Content-Type: text/html.
>> > However, the Content-Type: text/plain looks clean, and that is what we
>> want
>> > it to be indexed.
>> >
>> > How can we configure the Tika in Solr to change the priority to get the
>> > content from Content-Type: text/plain  instead of Content-Type:
>> text/html?
>> >
>> > On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo > >
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > I am using Solr 7.5.0 with Tika 1.18.
>> > >
>> > > Currently I am facing a situation during the indexing of EML files,
>> > > whereby the content is being extracted from the Content-type=text/html
>> > > instead of Content-type=text/plain.
>> > >
>> > > The problem with Content-type=text/html is that it contains alot of
>> words
>> > > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
>> > > these get indexed in Solr as well, which makes the content very
>> cluttered,
>> > > and it also affect the search, as when we search for words like
>> "font", all
>> > > the contents gets returned because of this.
>> > >
>> > > Would like to enquire on the following:
>> > > 1. Why Tika didn't get the text part (text/plain). Is there any way to
>> > > configure the Tika in Solr to change the priority to get the text part
>> > > (text/plain) instead of html part (text/html).
>> > > 2. If that is not possible, as you can see, the content is not clean,
>> > > which is not right. How can we get this to be clean when Tika is
>> extracting
>> > > text?
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>>
>


Re: How to config security.json?

2019-01-16 Thread Mutuhprasannth
Hi Byzen, I am also want to implement the authentication in same way. Did you
got any solution for this 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Re: Delayed/waiting requests

2019-01-16 Thread Markus Jelsma
Hello,

There is an extremely undocumented parameter to get the cache's contents 
displayed. Set showItems="100" on the filter cache. 

Regards,
Markus

 
 
-Original message-
> From:Erick Erickson 
> Sent: Wednesday 16th January 2019 17:40
> To: solr-user 
> Subject: Re: Re: Delayed/waiting requests
> 
> I don't know of any tools to inspect the cache. Under the covers,
> these are things like Java's ConcurrentHashMap which don't, for
> instance, carry along information like last access time IIUC.
> 
> I usually have to cull the Solr logs and eyeball the fq clauses to see
> if anything jumps out. If you do find any such patterns, you can
> always add {!cache=false} to those clauses to not use up cache
> entries
> 
> Best,
> Erick
> 
> On Wed, Jan 16, 2019 at 7:53 AM Gael Jourdan-Weil
>  wrote:
> >
> > Ok, I get your point.
> >
> >
> > Do you know if there is a tool to easily view filterCache content?
> >
> > I know we can see the top entries in the API or the UI but could we see 
> > more?
> >
> >
> > Regards,
> >
> > Gaël
> >
> > 
> > De : Erick Erickson 
> > Envoyé : mardi 15 janvier 2019 19:46:19
> > À : solr-user
> > Objet : Re: Re: Delayed/waiting requests
> >
> > bq. If I get your point, having a big cache might cause more troubles
> > than help if the cache hit ratio is not high enough because the cache
> > is constantly evicting/inserting entries?
> >
> > Pretty much. Although there are nuances.
> >
> > Right now, you have a 12K autowarm count. That means your cache will
> > eventually always contain 12K entries whether or not you ever use the
> > last 11K! I'm simplifying a bit, but it grows like this.
> >
> > Let's say I start Solr. Initially it has no cache entries. Now I start
> > both querying and indexing. For simplicity, say I have 100 _new_  fq
> > clauses come in between each commit. The first commit will autowarm
> > 100. The next will autowarm 200, then 300.. etc. Eventually this
> > will grow to 12K. So your performance will start to vary depending on
> > how long Solr has been running.
> >
> > Worse. it's not clear that you _ever_ re-use those clauses. One example:
> > fq=date_field:[* TO NOW]
> > NOW is really a Unix timestamp. So issuing the same fq 1 millisecond
> > from the first one will not re-use the entry. In the worst case almost
> > all of your autwarming is useless. It neither loads relevant index
> > data into RAM nor is reusable.
> >
> > Even if you use "date math" to round to, say, a minute, if you run
> > Solr long enough you'll still fill up with useless fq clauses.
> >
> > Best,
> > Erick
> >
> > On Tue, Jan 15, 2019 at 9:33 AM Gael Jourdan-Weil
> >  wrote:
> > >
> > > @Erick:
> > >
> > >
> > > We will try to lower the autowarm and run some tests to compare.
> > >
> > > If I get your point, having a big cache might cause more troubles than 
> > > help if the cache hit ratio is not high enough because the cache is 
> > > constantly evicting/inserting entries?
> > >
> > >
> > >
> > > @Jeremy:
> > >
> > >
> > > Index size: ~20G and ~14M documents
> > >
> > > Server memory available: 256G from which ~30G used and ~100G system cache
> > >
> > > Server CPU count: 32, ~10% usage
> > >
> > > JVM memory settings: -Xms12G -Xmx12G
> > >
> > >
> > > We have 3 servers and 3 clusters of 3 Solr instances.
> > >
> > > That is each server hosts 1 Solr instance for each cluster.
> > >
> > > And, indeed, each cluster only has 1 shard with replication factor 3.
> > >
> > >
> > > Among all these Solr instances, the pauses are observed on only one 
> > > single cluster but on every server at different times (sometimes on all 
> > > servers at the same time but I would say it's very rare).
> > >
> > > We do observe the traffic is evenly balanced across the 3 servers, around 
> > > 30-40 queries per second sent to each server.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Gaël
> > >
> > >
> > > 
> > > De : Branham, Jeremy (Experis) 
> > > Envoyé : mardi 15 janvier 2019 17:59:56
> > > À : solr-user@lucene.apache.org
> > > Objet : Re: Re: Delayed/waiting requests
> > >
> > > Hi Gael –
> > >
> > > Could you share this information?
> > > Size of the index
> > > Server memory available
> > > Server CPU count
> > > JVM memory settings
> > >
> > > You mentioned a cloud configuration of 3 replicas.
> > > Does that mean you have 1 shard with a replication factor of 3?
> > > Do the pauses occur on all 3 servers?
> > > Is the traffic evenly balanced across those servers?
> > >
> > >
> > > Jeremy Branham
> > > jb...@allstate.com
> > >
> > >
> > > On 1/15/19, 9:50 AM, "Erick Erickson"  wrote:
> > >
> > > Well, it was a nice theory anyway.
> > >
> > > "Other collections with the same settings"
> > > doesn't really mean much unless those other collections are very 
> > > similar,
> > > especially in terms of numbers of docs.
> > >
> > > You should only see a new searcher opening when you do a
> > > 

Re: Solr index writing to s3

2019-01-16 Thread Hendrik Haddorp
Theoretically you should be able to use the HDFS backend, which you can 
configure to use s3. Last time I tried that it did however not work for 
some reason. Here is an example for that, which also seems to have 
ultimately failed: 
https://community.plm.automation.siemens.com/t5/Developer-Space/Running-Solr-on-S3/td-p/449360


On 16.01.2019 19:39, Naveen M wrote:

hi,

My requirement is to write the index data into S3, we have solr installed
on aws instances. Please let me know if there is any documentation on how
to achieve writing the index data to s3.

Thanks





Solr index writing to s3

2019-01-16 Thread Naveen M
hi,

My requirement is to write the index data into S3, we have solr installed
on aws instances. Please let me know if there is any documentation on how
to achieve writing the index data to s3.

Thanks


Re: Inconsistent debugQuery score with multiplicative boost

2019-01-16 Thread Thomas Aglassinger
On 04.01.19, 09:11, "Thomas Aglassinger"  wrote:

>  When debugging a query using multiplicative boost based on the product() 
> function I noticed that the score computed in the explain section is correct 
> while the score in the actual result is wrong.

We digged into this further and seem to have found the culprit. 

The last working version is Solr 7.2.1. Using git bisect we found out that the 
issue got introduced with LUCENE-8099 (a refactoring). There's two changes that 
break the scoring in different ways:

LUCENE-8099: Deprecate CustomScoreQuery, BoostedQuery, BoostingQuery
LUCENE-8099: Replace BoostQParserPlugin.boostQuery() with 
FunctionScoreQuery.boostByValue()

Reverting parts of these changes to the previous version based on a deprecated 
class (which LUCENE-8099 clean up) seems to fix the issue.

We created a Solr issue to document our current findings and changes: 
https://issues.apache.org/jira/browse/SOLR-13126

It contains a patch for our experimental fix (which currently is in a rough 
state) and a test case that can reproduce the issue starting with Solr 7.3 up 
to the current master.

A proper fix of course would not revert to deprecated classes again but fix 
whatever went wrong during LUCENE-8099. 

Hopefully someone with a deeper understand of the mechanics behind can take a 
look into this.

Best regards, Thomas.




Re: Re: Delayed/waiting requests

2019-01-16 Thread Erick Erickson
I don't know of any tools to inspect the cache. Under the covers,
these are things like Java's ConcurrentHashMap which don't, for
instance, carry along information like last access time IIUC.

I usually have to cull the Solr logs and eyeball the fq clauses to see
if anything jumps out. If you do find any such patterns, you can
always add {!cache=false} to those clauses to not use up cache
entries

Best,
Erick

On Wed, Jan 16, 2019 at 7:53 AM Gael Jourdan-Weil
 wrote:
>
> Ok, I get your point.
>
>
> Do you know if there is a tool to easily view filterCache content?
>
> I know we can see the top entries in the API or the UI but could we see more?
>
>
> Regards,
>
> Gaël
>
> 
> De : Erick Erickson 
> Envoyé : mardi 15 janvier 2019 19:46:19
> À : solr-user
> Objet : Re: Re: Delayed/waiting requests
>
> bq. If I get your point, having a big cache might cause more troubles
> than help if the cache hit ratio is not high enough because the cache
> is constantly evicting/inserting entries?
>
> Pretty much. Although there are nuances.
>
> Right now, you have a 12K autowarm count. That means your cache will
> eventually always contain 12K entries whether or not you ever use the
> last 11K! I'm simplifying a bit, but it grows like this.
>
> Let's say I start Solr. Initially it has no cache entries. Now I start
> both querying and indexing. For simplicity, say I have 100 _new_  fq
> clauses come in between each commit. The first commit will autowarm
> 100. The next will autowarm 200, then 300.. etc. Eventually this
> will grow to 12K. So your performance will start to vary depending on
> how long Solr has been running.
>
> Worse. it's not clear that you _ever_ re-use those clauses. One example:
> fq=date_field:[* TO NOW]
> NOW is really a Unix timestamp. So issuing the same fq 1 millisecond
> from the first one will not re-use the entry. In the worst case almost
> all of your autwarming is useless. It neither loads relevant index
> data into RAM nor is reusable.
>
> Even if you use "date math" to round to, say, a minute, if you run
> Solr long enough you'll still fill up with useless fq clauses.
>
> Best,
> Erick
>
> On Tue, Jan 15, 2019 at 9:33 AM Gael Jourdan-Weil
>  wrote:
> >
> > @Erick:
> >
> >
> > We will try to lower the autowarm and run some tests to compare.
> >
> > If I get your point, having a big cache might cause more troubles than help 
> > if the cache hit ratio is not high enough because the cache is constantly 
> > evicting/inserting entries?
> >
> >
> >
> > @Jeremy:
> >
> >
> > Index size: ~20G and ~14M documents
> >
> > Server memory available: 256G from which ~30G used and ~100G system cache
> >
> > Server CPU count: 32, ~10% usage
> >
> > JVM memory settings: -Xms12G -Xmx12G
> >
> >
> > We have 3 servers and 3 clusters of 3 Solr instances.
> >
> > That is each server hosts 1 Solr instance for each cluster.
> >
> > And, indeed, each cluster only has 1 shard with replication factor 3.
> >
> >
> > Among all these Solr instances, the pauses are observed on only one single 
> > cluster but on every server at different times (sometimes on all servers at 
> > the same time but I would say it's very rare).
> >
> > We do observe the traffic is evenly balanced across the 3 servers, around 
> > 30-40 queries per second sent to each server.
> >
> >
> >
> > Regards,
> >
> > Gaël
> >
> >
> > 
> > De : Branham, Jeremy (Experis) 
> > Envoyé : mardi 15 janvier 2019 17:59:56
> > À : solr-user@lucene.apache.org
> > Objet : Re: Re: Delayed/waiting requests
> >
> > Hi Gael –
> >
> > Could you share this information?
> > Size of the index
> > Server memory available
> > Server CPU count
> > JVM memory settings
> >
> > You mentioned a cloud configuration of 3 replicas.
> > Does that mean you have 1 shard with a replication factor of 3?
> > Do the pauses occur on all 3 servers?
> > Is the traffic evenly balanced across those servers?
> >
> >
> > Jeremy Branham
> > jb...@allstate.com
> >
> >
> > On 1/15/19, 9:50 AM, "Erick Erickson"  wrote:
> >
> > Well, it was a nice theory anyway.
> >
> > "Other collections with the same settings"
> > doesn't really mean much unless those other collections are very 
> > similar,
> > especially in terms of numbers of docs.
> >
> > You should only see a new searcher opening when you do a
> > hard-commit-with-opensearcher-true or soft commit.
> >
> > So what happens when you just try lowering the autowarm
> > count? I'm assuming you're free to test in some non-prod
> > system.
> >
> > Focusing on the hit ratio is something of a red herring. Remember
> > that each entry in your filterCache is roughly maxDoc/8 + a little
> > overhead, the increase in GC pressure has to be balanced
> > against getting the hits from the cache.
> >
> > Now, all that said if there's no correlation, then you need to put
> > a profiler on the system when you see this kind of thing and

Re: Starting optimize... Reading and rewriting the entire index! Use with care

2019-01-16 Thread Walter Underwood
Agreed, make both of those changes.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 16, 2019, at 7:21 AM, Erick Erickson  wrote:
> 
> +1 to defaulting it to false. Maybe call it forceMerge to be more in
> line with Lucene for all the reasons that we've discussed elsewhere?
> 
> Talhanather:
> 
> Frankly, I'd just stop optimizing. It's mis-named and there are a
> series of very good reasons to avoid it _unless_
> 1> you can _measure_ a significant improvement after the op
> 2> you can afford the time etc. to do it every time.
> 
> See: 
> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
> It's not as bad, but still expensive in Solr 7.5 and later:
> https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/
> 
> If you see that message in your logs, then you're optimizing and need
> do nothing. If you absolutely
> feel you must, use .../update?optimize=true
> 
> Best,
> Erick
> 
> On Wed, Jan 16, 2019 at 6:38 AM Jan Høydahl  wrote:
>> 
>> Should we consider to default optimize to false in the DIH UI?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>>> 16. jan. 2019 kl. 14:23 skrev Jeremy Smith :
>>> 
>>> How are you calling the dataimport?  As I understand it, optimize defaults 
>>> to true, so unless you explicitly set it to false, the optimize will occur 
>>> after the import.
>>> 
>>> 
>>> 
>>> From: talhanather 
>>> Sent: Wednesday, January 16, 2019 7:57:29 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Starting optimize... Reading and rewriting the entire index! 
>>> Use with care
>>> 
>>> Hi Erick,
>>> 
>>> PFB the solr-config.xml,  Its not having optimization tag to true.
>>> Then how optimization is continuously occurring for me. ?
>>> 
>>> 
>>> 
>>> 
>>> >> class="org.apache.solr.handler.dataimport.DataImportHandler"
>>> name="/dataimport">
>>>   
>>>   uuid
>>>   db-data-config.xml
>>>   
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>> 



RE: Re: Delayed/waiting requests

2019-01-16 Thread Gael Jourdan-Weil
Ok, I get your point.


Do you know if there is a tool to easily view filterCache content?

I know we can see the top entries in the API or the UI but could we see more?


Regards,

Gaël


De : Erick Erickson 
Envoyé : mardi 15 janvier 2019 19:46:19
À : solr-user
Objet : Re: Re: Delayed/waiting requests

bq. If I get your point, having a big cache might cause more troubles
than help if the cache hit ratio is not high enough because the cache
is constantly evicting/inserting entries?

Pretty much. Although there are nuances.

Right now, you have a 12K autowarm count. That means your cache will
eventually always contain 12K entries whether or not you ever use the
last 11K! I'm simplifying a bit, but it grows like this.

Let's say I start Solr. Initially it has no cache entries. Now I start
both querying and indexing. For simplicity, say I have 100 _new_  fq
clauses come in between each commit. The first commit will autowarm
100. The next will autowarm 200, then 300.. etc. Eventually this
will grow to 12K. So your performance will start to vary depending on
how long Solr has been running.

Worse. it's not clear that you _ever_ re-use those clauses. One example:
fq=date_field:[* TO NOW]
NOW is really a Unix timestamp. So issuing the same fq 1 millisecond
from the first one will not re-use the entry. In the worst case almost
all of your autwarming is useless. It neither loads relevant index
data into RAM nor is reusable.

Even if you use "date math" to round to, say, a minute, if you run
Solr long enough you'll still fill up with useless fq clauses.

Best,
Erick

On Tue, Jan 15, 2019 at 9:33 AM Gael Jourdan-Weil
 wrote:
>
> @Erick:
>
>
> We will try to lower the autowarm and run some tests to compare.
>
> If I get your point, having a big cache might cause more troubles than help 
> if the cache hit ratio is not high enough because the cache is constantly 
> evicting/inserting entries?
>
>
>
> @Jeremy:
>
>
> Index size: ~20G and ~14M documents
>
> Server memory available: 256G from which ~30G used and ~100G system cache
>
> Server CPU count: 32, ~10% usage
>
> JVM memory settings: -Xms12G -Xmx12G
>
>
> We have 3 servers and 3 clusters of 3 Solr instances.
>
> That is each server hosts 1 Solr instance for each cluster.
>
> And, indeed, each cluster only has 1 shard with replication factor 3.
>
>
> Among all these Solr instances, the pauses are observed on only one single 
> cluster but on every server at different times (sometimes on all servers at 
> the same time but I would say it's very rare).
>
> We do observe the traffic is evenly balanced across the 3 servers, around 
> 30-40 queries per second sent to each server.
>
>
>
> Regards,
>
> Gaël
>
>
> 
> De : Branham, Jeremy (Experis) 
> Envoyé : mardi 15 janvier 2019 17:59:56
> À : solr-user@lucene.apache.org
> Objet : Re: Re: Delayed/waiting requests
>
> Hi Gael –
>
> Could you share this information?
> Size of the index
> Server memory available
> Server CPU count
> JVM memory settings
>
> You mentioned a cloud configuration of 3 replicas.
> Does that mean you have 1 shard with a replication factor of 3?
> Do the pauses occur on all 3 servers?
> Is the traffic evenly balanced across those servers?
>
>
> Jeremy Branham
> jb...@allstate.com
>
>
> On 1/15/19, 9:50 AM, "Erick Erickson"  wrote:
>
> Well, it was a nice theory anyway.
>
> "Other collections with the same settings"
> doesn't really mean much unless those other collections are very similar,
> especially in terms of numbers of docs.
>
> You should only see a new searcher opening when you do a
> hard-commit-with-opensearcher-true or soft commit.
>
> So what happens when you just try lowering the autowarm
> count? I'm assuming you're free to test in some non-prod
> system.
>
> Focusing on the hit ratio is something of a red herring. Remember
> that each entry in your filterCache is roughly maxDoc/8 + a little
> overhead, the increase in GC pressure has to be balanced
> against getting the hits from the cache.
>
> Now, all that said if there's no correlation, then you need to put
> a profiler on the system when you see this kind of thing and
> find out where the hotspots are, otherwise it's guesswork and
> I'm out of ideas.
>
> Best,
> Erick
>
> On Tue, Jan 15, 2019 at 12:06 AM Gael Jourdan-Weil
>  wrote:
> >
> > Hi Erick,
> >
> >
> > Thank you for your detailed answer, I better understand autowarming.
> >
> >
> > We have an autowarming time of ~10s for filterCache (queryResultCache 
> is not used at all, ratio = 0.02).
> >
> > We increased the size of the filterCache from 6k to 12k (and 
> autowarming size set to same values) to have a better ratio which is _only_ 
> around 0.85/0.90.
> >
> >
> > The thing I don't understand is I should see "Opening new searcher" in 
> the logs everytime a new 

Re: Starting optimize... Reading and rewriting the entire index! Use with care

2019-01-16 Thread Erick Erickson
+1 to defaulting it to false. Maybe call it forceMerge to be more in
line with Lucene for all the reasons that we've discussed elsewhere?

Talhanather:

Frankly, I'd just stop optimizing. It's mis-named and there are a
series of very good reasons to avoid it _unless_
1> you can _measure_ a significant improvement after the op
2> you can afford the time etc. to do it every time.

See: 
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
It's not as bad, but still expensive in Solr 7.5 and later:
https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/

If you see that message in your logs, then you're optimizing and need
do nothing. If you absolutely
feel you must, use .../update?optimize=true

Best,
Erick

On Wed, Jan 16, 2019 at 6:38 AM Jan Høydahl  wrote:
>
> Should we consider to default optimize to false in the DIH UI?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 16. jan. 2019 kl. 14:23 skrev Jeremy Smith :
> >
> > How are you calling the dataimport?  As I understand it, optimize defaults 
> > to true, so unless you explicitly set it to false, the optimize will occur 
> > after the import.
> >
> >
> > 
> > From: talhanather 
> > Sent: Wednesday, January 16, 2019 7:57:29 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Starting optimize... Reading and rewriting the entire index! 
> > Use with care
> >
> > Hi Erick,
> >
> > PFB the solr-config.xml,  Its not having optimization tag to true.
> > Then how optimization is continuously occurring for me. ?
> >
> >
> >
> >
> >   > class="org.apache.solr.handler.dataimport.DataImportHandler"
> > name="/dataimport">
> >
> >uuid
> >db-data-config.xml
> >
> > 
> >
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Starting optimize... Reading and rewriting the entire index! Use with care

2019-01-16 Thread Jan Høydahl
Should we consider to default optimize to false in the DIH UI?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 16. jan. 2019 kl. 14:23 skrev Jeremy Smith :
> 
> How are you calling the dataimport?  As I understand it, optimize defaults to 
> true, so unless you explicitly set it to false, the optimize will occur after 
> the import.
> 
> 
> 
> From: talhanather 
> Sent: Wednesday, January 16, 2019 7:57:29 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Starting optimize... Reading and rewriting the entire index! Use 
> with care
> 
> Hi Erick,
> 
> PFB the solr-config.xml,  Its not having optimization tag to true.
> Then how optimization is continuously occurring for me. ?
> 
> 
> 
> 
>   class="org.apache.solr.handler.dataimport.DataImportHandler"
> name="/dataimport">
>
>uuid
>db-data-config.xml
>
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Starting optimize... Reading and rewriting the entire index! Use with care

2019-01-16 Thread Jeremy Smith
How are you calling the dataimport?  As I understand it, optimize defaults to 
true, so unless you explicitly set it to false, the optimize will occur after 
the import.



From: talhanather 
Sent: Wednesday, January 16, 2019 7:57:29 AM
To: solr-user@lucene.apache.org
Subject: Re: Starting optimize... Reading and rewriting the entire index! Use 
with care

Hi Erick,

PFB the solr-config.xml,  Its not having optimization tag to true.
Then how optimization is continuously occurring for me. ?




  

uuid
db-data-config.xml






--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Starting optimize... Reading and rewriting the entire index! Use with care

2019-01-16 Thread talhanather
Hi Erick,

PFB the solr-config.xml,  Its not having optimization tag to true.
Then how optimization is continuously occurring for me. ?




  

uuid
db-data-config.xml






--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html