subject:"Question"

Re: Urgent- General Question about document Indexing frequency in solr

2021-02-04 Thread Scott Stults

Manisha,

The most general recommendation around commits is to not explicitly commit
after every update. There are settings that will let Solr automatically
commit after some threshold is met, and by delegating commits to that
mechanism you can generally ingest faster.

See this blog post that goes into detail about how to set that up for your
situation:

https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/


Kind regards,
Scott


On Wed, Feb 3, 2021 at 5:44 PM Manisha Rahatadkar <
manisha.rahatad...@anjusoftware.com> wrote:

> Hi All
>
> Looking for some help on document indexing frequency. I am using apache
> solr 7.7 and SolrNet library to commit documents to Solr. Summary for this
> function is:
> // Summary:
> // Commits posted documents, blocking until index changes are flushed
> to disk and
> // blocking until a new searcher is opened and registered as the main
> query searcher,
> // making the changes visible.
>
> I understand that, the document gets reindexed after every commit. I have
> noticed that as the number of documents are increasing, the reindexing
> takes time. and sometimes I am getting solr connection time out error.
> I have following questions:
>
>   1.  Is there any frequency suggested by Solr for document insert/update
> and reindex? Is there any standard recommendation?
>   2.  If I remove the copy fields from managed-schema.xml, do I need to
> delete the existing indexed data from solr core and then insert data and
> reindex it again?
>
> Thanks in advance.
>
> Regards
> Manisha
>
>
>
> Confidentiality Notice
> 
> This email message, including any attachments, is for the sole use of the
> intended recipient and may contain confidential and privileged information.
> Any unauthorized view, use, disclosure or distribution is prohibited. If
> you are not the intended recipient, please contact the sender by reply
> email and destroy all copies of the original message. Anju Software, Inc.
> 4500 S. Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Urgent- General Question about document Indexing frequency in solr

2021-02-03 Thread Manisha Rahatadkar

Hi All

Looking for some help on document indexing frequency. I am using apache solr 
7.7 and SolrNet library to commit documents to Solr. Summary for this function 
is:
// Summary:
// Commits posted documents, blocking until index changes are flushed to 
disk and
// blocking until a new searcher is opened and registered as the main query 
searcher,
// making the changes visible.

I understand that, the document gets reindexed after every commit. I have 
noticed that as the number of documents are increasing, the reindexing takes 
time. and sometimes I am getting solr connection time out error.
I have following questions:

  1.  Is there any frequency suggested by Solr for document insert/update and 
reindex? Is there any standard recommendation?
  2.  If I remove the copy fields from managed-schema.xml, do I need to delete 
the existing indexed data from solr core and then insert data and reindex it 
again?

Thanks in advance.

Regards
Manisha



Confidentiality Notice

This email message, including any attachments, is for the sole use of the 
intended recipient and may contain confidential and privileged information. Any 
unauthorized view, use, disclosure or distribution is prohibited. If you are 
not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message. Anju Software, Inc. 4500 S. 
Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.

Question: JavaBinCodec cannot handle BytesRef object

2021-01-12 Thread Boqi Gao

Dear all:

We are facing a problem recently when we are utilizing BinaryDocValueField of 
solr 7.3.1.
We have created a binary docValue field.
The constructor of BinaryDocValuesField(String name, BytesRef value) needs a 
BytesRef object to be set as its fieldData.

However, the JavaBinCodec cannot handle a BytesRef object. It writes the 
BytesRef as a string of class name and value.
(please see: 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.3.1/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L247)

And the response is like:
"response":{"docs":[
  { fieldName: "org.apache.lucene.util.BytesRef:[3c c1 a1 28 3d ……]”}
]
}

However, if the value of the field could be handled as a BytesRef object by 
JavabinCodec, the TextResponseWriter will write the response as a Base64 string.
(please see:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.3.1/solr/core/src/java/org/apache/solr/response/TextResponseWriter.java#L190-L192
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.3.1/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L247
)

And the response, which we hope to get, is like:
"response":{"docs":[
  { fieldName: "vApp0zDtHj69e9mq……”}
]
}

We would like to ask that do you have any idea or suggestion to fix this 
problem? We hope to get a response of Base64 string.
Many thanks!

Best wishes,
Gao

Re: Question on solr metrics

2020-10-27 Thread Emir Arnautović

Hi,
In order to see time range metrics, you’ll need to collect metrics periodically 
and send it to some storage and then query/visualise. Solr has exporters for 
some popular backends, or you can use some cloud based solution. One such 
solution is our: https://sematext.com/integrations/solr-monitoring/ and we’ve 
also just added Solr logs integration so you can collect/visualise/alert on 
both metrics and logs.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

> On 26 Oct 2020, at 22:08, yaswanth kumar  wrote:
> 
> Can we get the metrics for a particular time range? I know metrics history
> was not enabled, so that I will be having only from when the solr node is
> up and running last time, but even from it can we do a data range like for
> example on to see CPU usage on a particular time range?
> 
> Note: Solr version: 8.2
> 
> -- 
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com

Question on solr metrics

2020-10-26 Thread yaswanth kumar

Can we get the metrics for a particular time range? I know metrics history
was not enabled, so that I will be having only from when the solr node is
up and running last time, but even from it can we do a data range like for
example on to see CPU usage on a particular time range?

Note: Solr version: 8.2

-- 
Thanks & Regards,
Yaswanth Kumar Konathala.
yaswanth...@gmail.com

Re: Question on metric values

2020-10-26 Thread Andrzej Białecki

The “requests” metric is a simple counter. Please see the documentation in the 
Reference Guide on the available metrics and their meaning. This counter is 
initialised when the replica starts up, and it’s not persisted (so if you 
restart this Solr node it will reset to 0).


If by “frequency” you mean rate of requests over a time period then the 1-, 5- 
and 15-min rates are available from “QUERY./select.requestTimes”

—

Andrzej Białecki

> On 26 Oct 2020, at 17:25, yaswanth kumar  wrote:
> 
> I am new to metrics api in solr , when I try to do
> solr/admin/metrics?prefix=QUERY./select.requests its throwing numbers
> against each collection that I have, I can understand those are the
> requests coming in against each collection, but for how much frequencies??
> Like are those numbers from the time the collection went live or are those
> like last n minutes or any config based?? also what's the default
> frequencies when we don't configure anything??
> 
> Note: I am using solr 8.2
> 
> -- 
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com

Question on metric values

2020-10-26 Thread yaswanth kumar

I am new to metrics api in solr , when I try to do
solr/admin/metrics?prefix=QUERY./select.requests its throwing numbers
against each collection that I have, I can understand those are the
requests coming in against each collection, but for how much frequencies??
Like are those numbers from the time the collection went live or are those
like last n minutes or any config based?? also what's the default
frequencies when we don't configure anything??

Note: I am using solr 8.2

-- 
Thanks & Regards,
Yaswanth Kumar Konathala.
yaswanth...@gmail.com

Re: TieredMergePolicyFactory question

2020-10-26 Thread Moulay Hicham

Thanks Shawn and Erick.

So far I haven't noticed any performance issues before and after the change.

My concern all along is COST. We could have left the configuration as is -
keeping the deleting documents in the index - But we have to scale up our
Solr cluster.  This will double our Solr Cluster Cost. And the additional
COST is what we are trying to avoid.

I will test the expungeDeletes and revert the max segment size back to 5G.

Thanks again,

Moulay

On Mon, Oct 26, 2020 at 5:49 AM Erick Erickson 
wrote:

> "Some large segments were merged into 12GB segments and
> deleted documents were physically removed.”
> and
> “So with the current natural merge strategy, I need to update
> solrconfig.xml
> and increase the maxMergedSegmentMB often"
>
> I strongly recommend you do not continue down this path. You’re making a
> mountain out of a mole-hill. You have offered no proof that removing the
> deleted documents is noticeably improving performance. If you replace
> docs randomly, deleted docs will be removed eventually with the default
> merge policy without you doing _anything_ special at all.
>
> The fact that you think you need to continuously bump up the size of
> your segments indicates your understanding is incomplete. When
> you start changing settings basically at random in order to “fix” a
> problem,
> especially one that you haven’t demonstrated _is_ a problem, you
> invariably make the problem worse.
>
> By making segments larger, you’ve increased the work Solr (well Lucene) has
> to do in order to merge them since the merge process has to handle these
> larger segments. That’ll take longer. There are a fixed number of threads
> that do merging. If they’re all tied up, incoming updates will block until
> a thread frees up. I predict that if you continue down this path,
> eventually
> your updates will start to misbehave and you’ll spend a week trying to
> figure
> out why.
>
> If you insist on worrying about deleted documents, just expungeDeletes
> occasionally. I’d also set the segments size back to the default 5G. I
> can’t
> emphasize strongly enough that the way you’re approaching this will lead
> to problems, not to mention maintenance that is harder than it needs to
> be. If you do set the max segment size back to 5G, your 12G segments will
> _not_ merge until they have lots of deletes, making your problem worse.
> Then you’ll spend time trying to figure out why.
>
> Recovering from what you’ve done already has problems. Those large segments
> _will_ get rewritten (we call it “singleton merge”) when they’ve
> accumulated a
> lot of deletes, but meanwhile you’ll think that your problem is getting
> worse and worse.
>
> When those large segments have more than 10% deleted documents,
> expungeDeletes
> will singleton merge them and they’ll gradually shrink.
>
> So my prescription is:
>
> 1> set the max segment size back to 5G
>
> 2> monitor your segments. When you see your large segments  > 5G have
> more than 10% deleted documents, issue an expungeDeletes command (not
> optimize).
> This will recover your index from the changes you’ve already made.
>
> 3> eventually, all your segments will be under 5G. When that happens, stop
> issuing expungeDeletes.
>
> 4> gather some performance statistics and prove one way or another that as
> deleted
> docs accumulate over time, it impacts performance. NOTE: after your last
> expungeDeletes, deleted docs will accumulate over time until they reach a
> plateau and
> shouldn’t continue increasing after that. If you can _prove_ that
> accumulating deleted
> documents affects performance, institute a regular expungeDeletes.
> Optimize, but
> expungeDeletes is less expensive and on a changing index expungeDeletes is
> sufficient. Optimize is only really useful for a static index, so I’d
> avoid it in your
> situation.
>
> Best,
> Erick
>
> > On Oct 26, 2020, at 1:22 AM, Moulay Hicham 
> wrote:
> >
> > Some large segments were merged into 12GB segments and
> > deleted documents were physically removed.
>
>

Re: TieredMergePolicyFactory question

2020-10-26 Thread Erick Erickson

"Some large segments were merged into 12GB segments and
deleted documents were physically removed.”
and
“So with the current natural merge strategy, I need to update solrconfig.xml
and increase the maxMergedSegmentMB often"

I strongly recommend you do not continue down this path. You’re making a
mountain out of a mole-hill. You have offered no proof that removing the
deleted documents is noticeably improving performance. If you replace
docs randomly, deleted docs will be removed eventually with the default
merge policy without you doing _anything_ special at all.

The fact that you think you need to continuously bump up the size of
your segments indicates your understanding is incomplete. When
you start changing settings basically at random in order to “fix” a problem,
especially one that you haven’t demonstrated _is_ a problem, you 
invariably make the problem worse.

By making segments larger, you’ve increased the work Solr (well Lucene) has
to do in order to merge them since the merge process has to handle these
larger segments. That’ll take longer. There are a fixed number of threads
that do merging. If they’re all tied up, incoming updates will block until
a thread frees up. I predict that if you continue down this path, eventually
your updates will start to misbehave and you’ll spend a week trying to figure
out why.

If you insist on worrying about deleted documents, just expungeDeletes
occasionally. I’d also set the segments size back to the default 5G. I can’t
emphasize strongly enough that the way you’re approaching this will lead
to problems, not to mention maintenance that is harder than it needs to
be. If you do set the max segment size back to 5G, your 12G segments will
_not_ merge until they have lots of deletes, making your problem worse. 
Then you’ll spend time trying to figure out why.

Recovering from what you’ve done already has problems. Those large segments
_will_ get rewritten (we call it “singleton merge”) when they’ve accumulated a
lot of deletes, but meanwhile you’ll think that your problem is getting worse 
and worse.

When those large segments have more than 10% deleted documents, expungeDeletes
will singleton merge them and they’ll gradually shrink.

So my prescription is:

1> set the max segment size back to 5G

2> monitor your segments. When you see your large segments  > 5G have 
more than 10% deleted documents, issue an expungeDeletes command (not optimize).
This will recover your index from the changes you’ve already made.

3> eventually, all your segments will be under 5G. When that happens, stop
issuing expungeDeletes.

4> gather some performance statistics and prove one way or another that as 
deleted
docs accumulate over time, it impacts performance. NOTE: after your last
expungeDeletes, deleted docs will accumulate over time until they reach a 
plateau and
shouldn’t continue increasing after that. If you can _prove_ that accumulating 
deleted
documents affects performance, institute a regular expungeDeletes. Optimize, but
expungeDeletes is less expensive and on a changing index expungeDeletes is
sufficient. Optimize is only really useful for a static index, so I’d avoid it 
in your
situation.

Best,
Erick

> On Oct 26, 2020, at 1:22 AM, Moulay Hicham  wrote:
> 
> Some large segments were merged into 12GB segments and
> deleted documents were physically removed.

Re: TieredMergePolicyFactory question

2020-10-26 Thread Shawn Heisey


On 10/25/2020 11:22 PM, Moulay Hicham wrote:

I am wondering about 3 other things:

1 - You mentioned that I need free disk space. Just to make sure that we
are talking about disc space here. RAM can still remain at the same size?
My current RAM size is  Index size < RAM < 1.5 Index size


You must always have enough disk space available for your indexes to 
double in size.  We recommend having enough disk space for your indexes 
to *triple* in size, because there is a real-world scenario that will 
require that much disk space.



2 - When the merge is happening, it happens in disc and when it's
completed, then the data is sync'ed with RAM. I am just guessing here ;-).
I couldn't find a good explanation online about this.


If you have enough free memory, then the OS will make sure that the data 
is available in RAM.  All modern operating systems do this 
automatically.  Note that I am talking about memory that is not 
allocated to programs.  Any memory assigned to the Solr heap (or any 
other program) will NOT be available for caching index data.


If you want ideal performance in typical situations, you must have as 
much free memory as the space your indexes take up on disk.  For ideal 
performance in ALL situations, you'll want enough free memory to be able 
to hold both the original and optimized copies of your index data at the 
same time.  We have seen that good performance can be achieved without 
going to this extreme, but if you have little free memory, Solr 
performance will be terrible.


I wrote a wiki page that covers this in some detail:

https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems


3 - Also I am wondering what recommendation you have for continuously
purging deleted documents. optimize? expungeDeletes? Natural Merge?
Here are more details about the need to purge documents.


The only way to guarantee that all deleted docs are purged is to 
optimize.   You could use the expungeDeletes action ... but this might 
not get rid of all the deleted documents, and depending on how those 
documents are distributed across the whole index, expungeDeletes might 
not do anything at all.  These operations are expensive (require a lot 
of time and system resources) and will temporarily increase the size of 
your index, up to double the starting size.


Before you go down the road of optimizing regularly, you should 
determine whether freeing up the disk space for deleted documents 
actually makes a substantial difference in performance.  In very old 
Solr versions, optimizing the index did produce major performance 
gains... but current versions have much better performance on indexes 
that have deleted documents.  Because performance is typically 
drastically reduced while the optimize is happening, the tradeoff may 
not be worthwhile.


Thanks,
Shawn

Re: TieredMergePolicyFactory question

2020-10-25 Thread Moulay Hicham

Thanks so much for clarifying. I have deployed the change to prod and seems
to be working. Some large segments were merged into 12GB segments and
deleted documents were physically removed.

I am wondering about 3 other things:

1 - You mentioned that I need free disk space. Just to make sure that we
are talking about disc space here. RAM can still remain at the same size?
My current RAM size is  Index size < RAM < 1.5 Index size

2 - When the merge is happening, it happens in disc and when it's
completed, then the data is sync'ed with RAM. I am just guessing here ;-).
I couldn't find a good explanation online about this.

3 - Also I am wondering what recommendation you have for continuously
purging deleted documents. optimize? expungeDeletes? Natural Merge?
Here are more details about the need to purge documents.
My solr cluster is very expensive. So we would like to maintain the cost
and avoid scaling up if possible.
The solr index is being written at a rate > 100 TPS
Also we have a requirement to delete old data. So we are
continuously trimming millions of documents daily that are older than X
years.
So with the current natural merge strategy, I need to update solrconfig.xml
and increase the maxMergedSegmentMB often. So that I can reclaim physical
disc space.

Wondering if a feature of rewriting one single large merged segment into
another segment - and purging deleted documents in this process - can be
useful for use cases like mine. This will help purge deleted documents
without the need of continuously increasing the maxMergedSegmentMB.

Thanks,
Moulay

On Fri, Oct 23, 2020 at 11:10 AM Erick Erickson 
wrote:

> Well, you mentioned that the segments you’re concerned were merged a year
> ago.
> If segments aren’t being merged, they’re pretty static.
>
> There’s no real harm in optimizing _occasionally_, even in an NRT index.
> If you have
> segments that were merged that long ago, you may be indexing continually
> but it
> sounds like it’s a situation where you update more recent docs rather than
> random
> ones over the entire corpus.
>
> That caution is more for indexes where you essentially replace docs in your
> corpus randomly, and it’s really about wasting a lot of cycles rather than
> bad stuff happening. When you randomly update documents (or delete them),
> the extra work isn’t worth it.
>
> Either operation will involve a lot of CPU cycles and can require that you
> have
> at least as much free space on your disk as the indexes occupy, so do be
> aware
> of that.
>
> All that said, what evidence do you have that this is worth any effort at
> all?
> Depending on the environment, you may not even be able to measure
> performance changes so this all may be irrelevant anyway.
>
> But to your question. Yes, you can cause regular merging to more
> aggressively
> merge segments with deleted docs by setting the
> deletesPctAllowed
> in solroconfig.xml. The default value is 33, and you can set it as low as
> 20 or as
> high as 50. We put
> a floor of 20% because the cost starts to rise quickly if it’s lower than
> that, and
> expungeDeletes is a better alternative at that point.
>
> This is not a hard number, and in practice the percentage of you index
> that consists
> of deleted documents tends to be lower than this number, depending of
> course
> on your particular environment.
>
> Best,
> Erick
>
> > On Oct 23, 2020, at 12:59 PM, Moulay Hicham 
> wrote:
> >
> > Thanks Eric.
> >
> > My index is near real time and frequently updated.
> > I checked this page
> >
> https://lucene.apache.org/solr/guide/8_1/uploading-data-with-index-handlers.html#xml-update-commands
> > and using forceMerge/expungeDeletes are NOT recommended.
> >
> > So I was hoping that the change in mergePolicyFactory will affect the
> > segments with high percent of deletes as part of the REGULAR segment
> > merging cycles. Is my understanding correct?
> >
> >
> >
> >
> > On Fri, Oct 23, 2020 at 9:47 AM Erick Erickson 
> > wrote:
> >
> >> Just go ahead and optimize/forceMerge, but do _not_ optimize to one
> >> segment. Or you can expungeDeletes, that will rewrite all segments with
> >> more than 10% deleted docs. As of Solr 7.5, these operations respect
> the 5G
> >> limit.
> >>
> >> See:
> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
> >>
> >> Best
> >> Erick
> >>
> >> On Fri, Oct 23, 2020, 12:36 Moulay Hicham 
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I am using solr 8.1 in production. We have about 30%-50% of deleted
> >>> documents in some

Re: TieredMergePolicyFactory question

2020-10-23 Thread Erick Erickson

Well, you mentioned that the segments you’re concerned were merged a year ago.
If segments aren’t being merged, they’re pretty static.

There’s no real harm in optimizing _occasionally_, even in an NRT index. If you 
have
segments that were merged that long ago, you may be indexing continually but it
sounds like it’s a situation where you update more recent docs rather than 
random
ones over the entire corpus.

That caution is more for indexes where you essentially replace docs in your
corpus randomly, and it’s really about wasting a lot of cycles rather than
bad stuff happening. When you randomly update documents (or delete them),
the extra work isn’t worth it.

Either operation will involve a lot of CPU cycles and can require that you have
at least as much free space on your disk as the indexes occupy, so do be aware
of that.

All that said, what evidence do you have that this is worth any effort at all?
Depending on the environment, you may not even be able to measure
performance changes so this all may be irrelevant anyway.

But to your question. Yes, you can cause regular merging to more aggressively 
merge segments with deleted docs by setting the
deletesPctAllowed
in solroconfig.xml. The default value is 33, and you can set it as low as 20 or 
as
high as 50. We put
a floor of 20% because the cost starts to rise quickly if it’s lower than that, 
and
expungeDeletes is a better alternative at that point.

This is not a hard number, and in practice the percentage of you index that 
consists
of deleted documents tends to be lower than this number, depending of course
on your particular environment.

Best,
Erick

> On Oct 23, 2020, at 12:59 PM, Moulay Hicham  wrote:
> 
> Thanks Eric.
> 
> My index is near real time and frequently updated.
> I checked this page
> https://lucene.apache.org/solr/guide/8_1/uploading-data-with-index-handlers.html#xml-update-commands
> and using forceMerge/expungeDeletes are NOT recommended.
> 
> So I was hoping that the change in mergePolicyFactory will affect the
> segments with high percent of deletes as part of the REGULAR segment
> merging cycles. Is my understanding correct?
> 
> 
> 
> 
> On Fri, Oct 23, 2020 at 9:47 AM Erick Erickson 
> wrote:
> 
>> Just go ahead and optimize/forceMerge, but do _not_ optimize to one
>> segment. Or you can expungeDeletes, that will rewrite all segments with
>> more than 10% deleted docs. As of Solr 7.5, these operations respect the 5G
>> limit.
>> 
>> See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
>> 
>> Best
>> Erick
>> 
>> On Fri, Oct 23, 2020, 12:36 Moulay Hicham  wrote:
>> 
>>> Hi,
>>> 
>>> I am using solr 8.1 in production. We have about 30%-50% of deleted
>>> documents in some old segments that were merged a year ago.
>>> 
>>> These segments size is about 5GB.
>>> 
>>> I was wondering why these segments have a high % of deleted docs and
>> found
>>> out that they are NOT being candidates for merging because the
>>> default TieredMergePolicy maxMergedSegmentMB is 5G.
>>> 
>>> So I have modified the TieredMergePolicyFactory config as below to
>>> lower the delete docs %
>>> 
>>> > class="org.apache.solr.index.TieredMergePolicyFactory">
>>>  10
>>>  10
>>>  12000
>>>  20
>>> 
>>> 
>>> 
>>> Do you see any issues with increasing the max merged segment to 12GB and
>>> lowered the deletedPctAllowed to 20%?
>>> 
>>> Thanks,
>>> 
>>> Moulay
>>> 
>>

Re: TieredMergePolicyFactory question

2020-10-23 Thread Moulay Hicham

Thanks Eric.

My index is near real time and frequently updated.
I checked this page
https://lucene.apache.org/solr/guide/8_1/uploading-data-with-index-handlers.html#xml-update-commands
and using forceMerge/expungeDeletes are NOT recommended.

So I was hoping that the change in mergePolicyFactory will affect the
segments with high percent of deletes as part of the REGULAR segment
merging cycles. Is my understanding correct?




On Fri, Oct 23, 2020 at 9:47 AM Erick Erickson 
wrote:

> Just go ahead and optimize/forceMerge, but do _not_ optimize to one
> segment. Or you can expungeDeletes, that will rewrite all segments with
> more than 10% deleted docs. As of Solr 7.5, these operations respect the 5G
> limit.
>
> See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
>
> Best
> Erick
>
> On Fri, Oct 23, 2020, 12:36 Moulay Hicham  wrote:
>
> > Hi,
> >
> > I am using solr 8.1 in production. We have about 30%-50% of deleted
> > documents in some old segments that were merged a year ago.
> >
> > These segments size is about 5GB.
> >
> > I was wondering why these segments have a high % of deleted docs and
> found
> > out that they are NOT being candidates for merging because the
> > default TieredMergePolicy maxMergedSegmentMB is 5G.
> >
> > So I have modified the TieredMergePolicyFactory config as below to
> > lower the delete docs %
> >
> >  class="org.apache.solr.index.TieredMergePolicyFactory">
> >   10
> >   10
> >   12000
> >   20
> > 
> >
> >
> > Do you see any issues with increasing the max merged segment to 12GB and
> > lowered the deletedPctAllowed to 20%?
> >
> > Thanks,
> >
> > Moulay
> >
>

Re: TieredMergePolicyFactory question

2020-10-23 Thread Erick Erickson

Just go ahead and optimize/forceMerge, but do _not_ optimize to one
segment. Or you can expungeDeletes, that will rewrite all segments with
more than 10% deleted docs. As of Solr 7.5, these operations respect the 5G
limit.

See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

Best
Erick

On Fri, Oct 23, 2020, 12:36 Moulay Hicham  wrote:

> Hi,
>
> I am using solr 8.1 in production. We have about 30%-50% of deleted
> documents in some old segments that were merged a year ago.
>
> These segments size is about 5GB.
>
> I was wondering why these segments have a high % of deleted docs and found
> out that they are NOT being candidates for merging because the
> default TieredMergePolicy maxMergedSegmentMB is 5G.
>
> So I have modified the TieredMergePolicyFactory config as below to
> lower the delete docs %
>
> 
>   10
>   10
>   12000
>   20
> 
>
>
> Do you see any issues with increasing the max merged segment to 12GB and
> lowered the deletedPctAllowed to 20%?
>
> Thanks,
>
> Moulay
>

TieredMergePolicyFactory question

2020-10-23 Thread Moulay Hicham

Hi,

I am using solr 8.1 in production. We have about 30%-50% of deleted
documents in some old segments that were merged a year ago.

These segments size is about 5GB.

I was wondering why these segments have a high % of deleted docs and found
out that they are NOT being candidates for merging because the
default TieredMergePolicy maxMergedSegmentMB is 5G.

So I have modified the TieredMergePolicyFactory config as below to
lower the delete docs %


  10
  10
  12000
  20



Do you see any issues with increasing the max merged segment to 12GB and
lowered the deletedPctAllowed to 20%?

Thanks,

Moulay

Re: Question about solr commits

2020-10-08 Thread Erick Erickson

This is a bit confused. There will be only one timer that starts at time T when
the first doc comes in. At T+ 15 seconds, all docs that have been received since
time T will be committed. The first doc to hit Solr _after_ T+15 seconds starts
a single new timer and the process repeats.

Best,
rick

> On Oct 8, 2020, at 2:26 PM, Rahul Goswami  wrote:
> 
> Shawn,
> So if the autoCommit interval is 15 seconds, and one update request arrives
> at t=0 and another at t=10 seconds, then will there be two timers one
> expiring at t=15 and another at t=25 seconds, but this would amount to ONLY
> ONE commit at t=15 since that one would include changes from both updates.
> Is this understanding correct ?
> 
> Thanks,
> Rahul
> 
> On Wed, Oct 7, 2020 at 11:39 PM yaswanth kumar 
> wrote:
> 
>> Thank you very much both Eric and Shawn
>> 
>> Sent from my iPhone
>> 
>>> On Oct 7, 2020, at 10:41 PM, Shawn Heisey  wrote:
>>> 
>>> On 10/7/2020 4:40 PM, yaswanth kumar wrote:
 I have the below in my solrconfig.xml
 

  ${solr.Data.dir:}


  ${solr.autoCommit.maxTime:6}
  false


  ${solr.autoSoftCommit.maxTime:5000}

  
 Does this mean even though we are always sending data with commit=false
>> on
 update solr api, the above should do the commit every minute (6 ms)
 right?
>>> 
>>> Assuming that you have not defined the "solr.autoCommit.maxTime" and/or
>> "solr.autoSoftCommit.maxTime" properties, this config has autoCommit set to
>> 60 seconds without opening a searcher, and autoSoftCommit set to 5 seconds.
>>> 
>>> So five seconds after any indexing begins, Solr will do a soft commit.
>> When that commit finishes, changes to the index will be visible to
>> queries.  One minute after any indexing begins, Solr will do a hard commit,
>> which guarantees that data is written to disk, but it will NOT open a new
>> searcher, which means that when the hard commit happens, any pending
>> changes to the index will not be visible.
>>> 
>>> It's not "every five seconds" or "every 60 seconds" ... When any changes
>> are made, Solr starts a timer.  When the timer expires, the commit is
>> fired.  If no changes are made, no commits happen, because the timer isn't
>> started.
>>> 
>>> Thanks,
>>> Shawn
>>

Re: Question about solr commits

2020-10-08 Thread Rahul Goswami

Shawn,
So if the autoCommit interval is 15 seconds, and one update request arrives
at t=0 and another at t=10 seconds, then will there be two timers one
expiring at t=15 and another at t=25 seconds, but this would amount to ONLY
ONE commit at t=15 since that one would include changes from both updates.
Is this understanding correct ?

Thanks,
Rahul

On Wed, Oct 7, 2020 at 11:39 PM yaswanth kumar 
wrote:

> Thank you very much both Eric and Shawn
>
> Sent from my iPhone
>
> > On Oct 7, 2020, at 10:41 PM, Shawn Heisey  wrote:
> >
> > On 10/7/2020 4:40 PM, yaswanth kumar wrote:
> >> I have the below in my solrconfig.xml
> >> 
> >> 
> >>   ${solr.Data.dir:}
> >> 
> >> 
> >>   ${solr.autoCommit.maxTime:6}
> >>   false
> >> 
> >> 
> >>   ${solr.autoSoftCommit.maxTime:5000}
> >> 
> >>   
> >> Does this mean even though we are always sending data with commit=false
> on
> >> update solr api, the above should do the commit every minute (6 ms)
> >> right?
> >
> > Assuming that you have not defined the "solr.autoCommit.maxTime" and/or
> "solr.autoSoftCommit.maxTime" properties, this config has autoCommit set to
> 60 seconds without opening a searcher, and autoSoftCommit set to 5 seconds.
> >
> > So five seconds after any indexing begins, Solr will do a soft commit.
> When that commit finishes, changes to the index will be visible to
> queries.  One minute after any indexing begins, Solr will do a hard commit,
> which guarantees that data is written to disk, but it will NOT open a new
> searcher, which means that when the hard commit happens, any pending
> changes to the index will not be visible.
> >
> > It's not "every five seconds" or "every 60 seconds" ... When any changes
> are made, Solr starts a timer.  When the timer expires, the commit is
> fired.  If no changes are made, no commits happen, because the timer isn't
> started.
> >
> > Thanks,
> > Shawn
>

Re: Question about solr commits

2020-10-07 Thread yaswanth kumar

Thank you very much both Eric and Shawn

Sent from my iPhone

> On Oct 7, 2020, at 10:41 PM, Shawn Heisey  wrote:
> 
> On 10/7/2020 4:40 PM, yaswanth kumar wrote:
>> I have the below in my solrconfig.xml
>> 
>> 
>>   ${solr.Data.dir:}
>> 
>> 
>>   ${solr.autoCommit.maxTime:6}
>>   false
>> 
>> 
>>   ${solr.autoSoftCommit.maxTime:5000}
>> 
>>   
>> Does this mean even though we are always sending data with commit=false on
>> update solr api, the above should do the commit every minute (6 ms)
>> right?
> 
> Assuming that you have not defined the "solr.autoCommit.maxTime" and/or 
> "solr.autoSoftCommit.maxTime" properties, this config has autoCommit set to 
> 60 seconds without opening a searcher, and autoSoftCommit set to 5 seconds.
> 
> So five seconds after any indexing begins, Solr will do a soft commit. When 
> that commit finishes, changes to the index will be visible to queries.  One 
> minute after any indexing begins, Solr will do a hard commit, which 
> guarantees that data is written to disk, but it will NOT open a new searcher, 
> which means that when the hard commit happens, any pending changes to the 
> index will not be visible.
> 
> It's not "every five seconds" or "every 60 seconds" ... When any changes are 
> made, Solr starts a timer.  When the timer expires, the commit is fired.  If 
> no changes are made, no commits happen, because the timer isn't started.
> 
> Thanks,
> Shawn

Re: Question about solr commits

2020-10-07 Thread Shawn Heisey


On 10/7/2020 4:40 PM, yaswanth kumar wrote:

I have the below in my solrconfig.xml


 
   ${solr.Data.dir:}
 
 
   ${solr.autoCommit.maxTime:6}
   false
 
 
   ${solr.autoSoftCommit.maxTime:5000}
 
   

Does this mean even though we are always sending data with commit=false on
update solr api, the above should do the commit every minute (6 ms)
right?


Assuming that you have not defined the "solr.autoCommit.maxTime" and/or 
"solr.autoSoftCommit.maxTime" properties, this config has autoCommit set 
to 60 seconds without opening a searcher, and autoSoftCommit set to 5 
seconds.


So five seconds after any indexing begins, Solr will do a soft commit. 
When that commit finishes, changes to the index will be visible to 
queries.  One minute after any indexing begins, Solr will do a hard 
commit, which guarantees that data is written to disk, but it will NOT 
open a new searcher, which means that when the hard commit happens, any 
pending changes to the index will not be visible.


It's not "every five seconds" or "every 60 seconds" ... When any changes 
are made, Solr starts a timer.  When the timer expires, the commit is 
fired.  If no changes are made, no commits happen, because the timer 
isn't started.


Thanks,
Shawn

Re: Question about solr commits

2020-10-07 Thread Erick Erickson

Yes.

> On Oct 7, 2020, at 6:40 PM, yaswanth kumar  wrote:
> 
> I have the below in my solrconfig.xml
> 
> 
>
>  ${solr.Data.dir:}
>
>
>  ${solr.autoCommit.maxTime:6}
>  false
>
>
>  ${solr.autoSoftCommit.maxTime:5000}
>
>  
> 
> Does this mean even though we are always sending data with commit=false on
> update solr api, the above should do the commit every minute (6 ms)
> right?
> 
> -- 
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com

Question about solr commits

2020-10-07 Thread yaswanth kumar

I have the below in my solrconfig.xml



  ${solr.Data.dir:}


  ${solr.autoCommit.maxTime:6}
  false


  ${solr.autoSoftCommit.maxTime:5000}

  

Does this mean even though we are always sending data with commit=false on
update solr api, the above should do the commit every minute (6 ms)
right?

-- 
Thanks & Regards,
Yaswanth Kumar Konathala.
yaswanth...@gmail.com

Re: Solr 7.6 query performace question

2020-10-01 Thread raj.yadav

harjags wrote
> Below errors are very common in 7.6 and we have solr nodes failing with
> tanking memory.
> 
> The request took too long to iterate over terms. Timeout: timeoutAt:
> 162874656583645 (System.nanoTime(): 162874701942020),
> TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@74507f4a
> 
> or 
> 
> #*BitSetDocTopFilter*]; The request took too long to iterate over terms.
> Timeout: timeoutAt: 33288640223586 (System.nanoTime(): 33288700895778),
> TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@5e458644
> 
> 
> or 
> 
> #SortedIntDocSetTopFilter]; The request took too long to iterate over
> terms.
> Timeout: timeoutAt: 552497919389297 (System.nanoTime(): 552508251053558),
> TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@60b7186e
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



We are also seeing such errors in our log. But our nodes are not failing
also the frequency of such warnings are less then 5% of overall traffic.
What does this error means.
Can someone eleaborate following :
1. What does `The request took too long to iterate over terms` means ? 
2. what is `BitSetDocTopFilter` and `SortedIntDocSetTopFilter` ?

Regards,
Raj



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Quick Question

2020-09-02 Thread William Morin

  Hi,
I was looking for some articles to read about "Schema Markup" today when I
stumbled on your [
https://cwiki.apache.org/confluence/display/SOLR/UsingMailingLists  ].
Very cool.
Anyway, I noticed that there a text in your blog "Schema Markup" and
luckily it's my keyword. I hope if you don't mind can you give me a
backlink on that "Schema Markup" on your page or in the resource page
section. It might be worth adding to your page.
Thanks and have a great day!
Regards,
William

Re: Question on sorting

2020-07-23 Thread Saurabh Sharma

Hi,
It is because field is string and numbers are getting sorted
lexicographically.It has nothing to do with number of digits.

Thanks
Saurabh


On Thu, Jul 23, 2020, 11:24 AM Srinivas Kashyap
 wrote:

> Hello,
>
> I have schema and field definition as shown below:
>
>  omitNorms="true"/>
>
>
>   />
>
> TRACK_ID field contains "NUMERIC VALUE".
>
> When I use sort on track_id (TRACK_ID desc) it is not working properly.
>
> ->I have below values in Track_ID
>
> Doc1: "84806"
> Doc2: "124561"
>
> Ideally, when I use sort command, query result should be
>
> Doc2: "124561"
> Doc1: "84806"
>
> But I'm getting:
>
> Doc1: "84806"
> Doc2: "124561"
>
> Is this because, field type is string and doc1 has 5 digits and doc2 has 6
> digits?
>
> Please provide solution for this.
>
> Thanks,
> Srinivas
>
>
> 
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender
> immediately by replying to the e-mail, and then delete it without making
> copies or using it in any way.
> No representation is made that this email or any attachments are free of
> viruses. Virus scanning is recommended and is the responsibility of the
> recipient.
>
> Disclaimer
>
> The information contained in this communication from the sender is
> confidential. It is intended solely for use by the recipient and others
> authorized to receive it. If you are not the recipient, you are hereby
> notified that any disclosure, copying, distribution or taking action in
> relation of the contents of this information is strictly prohibited and may
> be unlawful.
>
> This email has been scanned for viruses and malware, and may have been
> automatically archived by Mimecast Ltd, an innovator in Software as a
> Service (SaaS) for business. Providing a safer and more useful place for
> your human generated data. Specializing in; Security, archiving and
> compliance. To find out more visit the Mimecast website.
>

Question on sorting

2020-07-22 Thread Srinivas Kashyap

Hello,

I have schema and field definition as shown below:






TRACK_ID field contains "NUMERIC VALUE".

When I use sort on track_id (TRACK_ID desc) it is not working properly.

->I have below values in Track_ID

Doc1: "84806"
Doc2: "124561"

Ideally, when I use sort command, query result should be

Doc2: "124561"
Doc1: "84806"

But I'm getting:

Doc1: "84806"
Doc2: "124561"

Is this because, field type is string and doc1 has 5 digits and doc2 has 6 
digits?

Please provide solution for this.

Thanks,
Srinivas



DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by 
replying to the e-mail, and then delete it without making copies or using it in 
any way.
No representation is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.

Re: Question regarding replica leader

2020-07-20 Thread Vishal Vaibhav

So how do we recover from such state ?  When I am trying addreplica , it
returns me 503. Also my node has multiple replicas out of them most are
dead. How do we make get rid of those dead replicas via script. ?is that a
possibility?

On Mon, 20 Jul 2020 at 11:00 AM, Radu Gheorghe 
wrote:

> Hi Vishal,
>
> I think that’s true, yes. The cluster has a leader (overseer), but this
> particular shard doesn’t seem to have a leader (yet). Logs should give you
> some pointers about why this happens (it may be, for example, that each
> replica is waiting for the other to become a leader, because each missed
> some updates).
>
> Best regards,
> Radu
> --
> Sematext Cloud - Full Stack Observability - https://sematext.com
> Solr and Elasticsearch Consulting, Training and Production Support
>
> > On 20 Jul 2020, at 04:17, Vishal Vaibhav  wrote:
> >
> > Hi any pointers on this ?
> >
> > On Wed, 15 Jul 2020 at 11:13 AM, Vishal Vaibhav 
> wrote:
> >
> >> Hi Solr folks,
> >>
> >> I am using solr cloud 8.4.1 . I am using*
> >> `/solr/admin/collections?action=CLUSTERSTATUS`*. Hitting this endpoint I
> >> get a list of replicas in which one is active but neither of them is
> >> leader. Something like this
> >>
> >> "core_node72": {"core": "rules_shard1_replica_n71","base_url": "node3,"
> >> node_name": "node3 base url","state": "active","type": "NRT","
> >> force_set_state": "false"},"core_node74": {"core":
> >> "rules_shard1_replica_n73","base_url": "node1","node_name":
> >> "node1_base_url","state": "down","type": "NRT","force_set_state":
> "false"}
> >> }}},"router": {"name": "compositeId"},"maxShardsPerNode": "1","
> >> autoAddReplicas": "false","nrtReplicas": "1","tlogReplicas": "0","
> >> znodeVersion": 276,"configName": "rules"}},"live_nodes":
> ["node1","node2",
> >> "node3","node4"] And when i see overseer status
> >> solr/admin/collections?action=OVERSEERSTATUS I get response like this
> which
> >> shows node 3 as leaderresponseHeader": {"status": 0,"QTime": 66},"leader
> >> ": "node 3","overseer_queue_size": 0,"overseer_work_queue_size": 0,"
> >> overseer_collection_queue_size": 2,"overseer_operations": ["addreplica",
> >>
> >> Does it mean the cluster is having a leader node but there is no leader
> >> replica as of now? And why the leader election is not happening?
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>
>

Re: Question regarding replica leader

2020-07-19 Thread Radu Gheorghe

Hi Vishal,

I think that’s true, yes. The cluster has a leader (overseer), but this 
particular shard doesn’t seem to have a leader (yet). Logs should give you some 
pointers about why this happens (it may be, for example, that each replica is 
waiting for the other to become a leader, because each missed some updates).

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support

> On 20 Jul 2020, at 04:17, Vishal Vaibhav  wrote:
> 
> Hi any pointers on this ?
> 
> On Wed, 15 Jul 2020 at 11:13 AM, Vishal Vaibhav  wrote:
> 
>> Hi Solr folks,
>> 
>> I am using solr cloud 8.4.1 . I am using*
>> `/solr/admin/collections?action=CLUSTERSTATUS`*. Hitting this endpoint I
>> get a list of replicas in which one is active but neither of them is
>> leader. Something like this
>> 
>> "core_node72": {"core": "rules_shard1_replica_n71","base_url": "node3,"
>> node_name": "node3 base url","state": "active","type": "NRT","
>> force_set_state": "false"},"core_node74": {"core":
>> "rules_shard1_replica_n73","base_url": "node1","node_name":
>> "node1_base_url","state": "down","type": "NRT","force_set_state": "false"}
>> }}},"router": {"name": "compositeId"},"maxShardsPerNode": "1","
>> autoAddReplicas": "false","nrtReplicas": "1","tlogReplicas": "0","
>> znodeVersion": 276,"configName": "rules"}},"live_nodes": ["node1","node2",
>> "node3","node4"] And when i see overseer status
>> solr/admin/collections?action=OVERSEERSTATUS I get response like this which
>> shows node 3 as leaderresponseHeader": {"status": 0,"QTime": 66},"leader
>> ": "node 3","overseer_queue_size": 0,"overseer_work_queue_size": 0,"
>> overseer_collection_queue_size": 2,"overseer_operations": ["addreplica",
>> 
>> Does it mean the cluster is having a leader node but there is no leader
>> replica as of now? And why the leader election is not happening?
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>

Re: Question regarding replica leader

2020-07-19 Thread Vishal Vaibhav

Hi any pointers on this ?

On Wed, 15 Jul 2020 at 11:13 AM, Vishal Vaibhav  wrote:

> Hi Solr folks,
>
> I am using solr cloud 8.4.1 . I am using*
> `/solr/admin/collections?action=CLUSTERSTATUS`*. Hitting this endpoint I
> get a list of replicas in which one is active but neither of them is
> leader. Something like this
>
> "core_node72": {"core": "rules_shard1_replica_n71","base_url": "node3,"
> node_name": "node3 base url","state": "active","type": "NRT","
> force_set_state": "false"},"core_node74": {"core":
> "rules_shard1_replica_n73","base_url": "node1","node_name":
> "node1_base_url","state": "down","type": "NRT","force_set_state": "false"}
> }}},"router": {"name": "compositeId"},"maxShardsPerNode": "1","
> autoAddReplicas": "false","nrtReplicas": "1","tlogReplicas": "0","
> znodeVersion": 276,"configName": "rules"}},"live_nodes": ["node1","node2",
> "node3","node4"] And when i see overseer status
> solr/admin/collections?action=OVERSEERSTATUS I get response like this which
> shows node 3 as leaderresponseHeader": {"status": 0,"QTime": 66},"leader
> ": "node 3","overseer_queue_size": 0,"overseer_work_queue_size": 0,"
> overseer_collection_queue_size": 2,"overseer_operations": ["addreplica",
>
> Does it mean the cluster is having a leader node but there is no leader
> replica as of now? And why the leader election is not happening?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Question regarding replica leader

2020-07-14 Thread Vishal Vaibhav

Hi Solr folks,

I am using solr cloud 8.4.1 . I am using*
`/solr/admin/collections?action=CLUSTERSTATUS`*. Hitting this endpoint I
get a list of replicas in which one is active but neither of them is
leader. Something like this

"core_node72": {"core": "rules_shard1_replica_n71","base_url": "node3,"
node_name": "node3 base url","state": "active","type": "NRT","
force_set_state": "false"},"core_node74": {"core":
"rules_shard1_replica_n73","base_url": "node1","node_name": "node1_base_url"
,"state": "down","type": "NRT","force_set_state": "false","router": {"
name": "compositeId"},"maxShardsPerNode": "1","autoAddReplicas": "false","
nrtReplicas": "1","tlogReplicas": "0","znodeVersion": 276,"configName":
"rules"}},"live_nodes": ["node1","node2","node3","node4"] And when i see
overseer status solr/admin/collections?action=OVERSEERSTATUS I get response
like this which shows node 3 as leaderresponseHeader": {"status": 0,"QTime
": 66},"leader": "node 3","overseer_queue_size": 0,"overseer_work_queue_size
": 0,"overseer_collection_queue_size": 2,"overseer_operations": [
"addreplica",

Does it mean the cluster is having a leader node but there is no leader
replica as of now? And why the leader election is not happening?

Re: eDismax query syntax question

2020-06-16 Thread Shawn Heisey


On 6/15/2020 8:01 AM, Webster Homer wrote:

Only the minus following the parenthesis is treated as a NOT.
Are parentheses special? They're not mentioned in the eDismax documentation.


Yes, parentheses are special to edismax.  They are used just like in 
math equations, to group and separate things or to override the default 
operator order.


https://lucene.apache.org/solr/guide/8_5/the-standard-query-parser.html#escaping-special-characters

The edismax parser supports a superset of what the standard (lucene) 
parser does, so they have the same special characters.


Thanks,
Shawn

Re: eDismax query syntax question

2020-06-15 Thread Mikhail Khludnev

Hello.
Not sure if it's useful or relevant, I encountered another problem with
parentheses (braces) in eDisMax recently
https://issues.apache.org/jira/browse/SOLR-14557.

On Mon, Jun 15, 2020 at 5:01 PM Webster Homer <
webster.ho...@milliporesigma.com> wrote:

> Markus,
> Thanks, for the reference, but that doesn't answer my question. If - is a
> special character, it's not consistently special. In my example
> "3-DIMETHYL" behaves quite differently than ")-PYRIMIDINE".  If I escape
> the closing parenthesis the following minus no longer behaves specially.
> The referred article does not even mention parenthesis, but it changes the
> behavior of the following "-" if it is escaped. In "3-DIMETHYL" the minus
> is not special.
>
> These all fix the problem:
> 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE\)-PYRIMIDINE-2,4,6-TRIONE
> 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)\-PYRIMIDINE-2,4,6-TRIONE
> 1,3-DIMETHYL-5-\(3-PHENYL-ALLYLIDENE\)-PYRIMIDINE-2,4,6-TRIONE
>
> Only the minus following the parenthesis is treated as a NOT.
> Are parentheses special? They're not mentioned in the eDismax
> documentation.
>
> -Original Message-
> From: Markus Jelsma 
> Sent: Saturday, June 13, 2020 4:57 AM
> To: solr-user@lucene.apache.org
> Subject: RE: eDismax query syntax question
>
> Hello,
>
> These are special characters, if you don't need them, you must escape them.
>
> See top of the article:
>
> https://lucene.apache.org/solr/guide/8_5/the-extended-dismax-query-parser.html
>
> Markus
>
>
>
>
> -Original message-
> > From:Webster Homer 
> > Sent: Friday 12th June 2020 22:09
> > To: solr-user@lucene.apache.org
> > Subject: eDismax query syntax question
> >
> > Recently we found strange behavior in a query. We use eDismax as the
> query parser.
> >
> > This is the query term:
> > 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)-PYRIMIDINE-2,4,6-TRIONE
> >
> > It should hit one document in our index. It does not. However, if you
> use the Dismax query parser it does match the record.
> >
> > The problem seems to involve the parenthesis and the dashes. If you
> > escape the dash after the parenthesis it matches
> > 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)\-PYRIMIDINE-2,4,6-TRIONE
> >
> > I thought that eDismax and Dismax escaped all lucene special characters
> before passing the query to lucene. Although I also remember reading that +
> and - can have special significance in a query if preceded with white
> space. I can find very little documentation on either query parser in how
> they work.
> >
> > Is this expected behavior or is this a bug? If expected, where can I
> find documentation?
> >
> >
> >
> > This message and any attachment are confidential and may be privileged
> or otherwise protected from disclosure. If you are not the intended
> recipient, you must not copy this message or attachment or disclose the
> contents to any other person. If you have received this transmission in
> error, please notify the sender immediately and delete the message and any
> attachment from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
> >
> >
> >
> > Click http://www.merckgroup.com/disclaimer to access the German,
> French, Spanish and Portuguese versions of this disclaimer.
> >
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
>
>
> Click http://www.merckgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.
>


-- 
Sincerely yours
Mikhail Khludnev

Re: eDismax query syntax question

2020-06-15 Thread Andrea Gazzarini

Hi Webster,
what does the query debug say? if you set debug=true in the request you can
have a better idea about how the two queries get interpreted

Andrea

On Mon, 15 Jun 2020 at 16:01, Webster Homer <
webster.ho...@milliporesigma.com> wrote:

> Markus,
> Thanks, for the reference, but that doesn't answer my question. If - is a
> special character, it's not consistently special. In my example
> "3-DIMETHYL" behaves quite differently than ")-PYRIMIDINE".  If I escape
> the closing parenthesis the following minus no longer behaves specially.
> The referred article does not even mention parenthesis, but it changes the
> behavior of the following "-" if it is escaped. In "3-DIMETHYL" the minus
> is not special.
>
> These all fix the problem:
> 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE\)-PYRIMIDINE-2,4,6-TRIONE
> 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)\-PYRIMIDINE-2,4,6-TRIONE
> 1,3-DIMETHYL-5-\(3-PHENYL-ALLYLIDENE\)-PYRIMIDINE-2,4,6-TRIONE
>
> Only the minus following the parenthesis is treated as a NOT.
> Are parentheses special? They're not mentioned in the eDismax
> documentation.
>
> -Original Message-
> From: Markus Jelsma 
> Sent: Saturday, June 13, 2020 4:57 AM
> To: solr-user@lucene.apache.org
> Subject: RE: eDismax query syntax question
>
> Hello,
>
> These are special characters, if you don't need them, you must escape them.
>
> See top of the article:
>
> https://lucene.apache.org/solr/guide/8_5/the-extended-dismax-query-parser.html
>
> Markus
>
>
>
>
> -Original message-
> > From:Webster Homer 
> > Sent: Friday 12th June 2020 22:09
> > To: solr-user@lucene.apache.org
> > Subject: eDismax query syntax question
> >
> > Recently we found strange behavior in a query. We use eDismax as the
> query parser.
> >
> > This is the query term:
> > 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)-PYRIMIDINE-2,4,6-TRIONE
> >
> > It should hit one document in our index. It does not. However, if you
> use the Dismax query parser it does match the record.
> >
> > The problem seems to involve the parenthesis and the dashes. If you
> > escape the dash after the parenthesis it matches
> > 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)\-PYRIMIDINE-2,4,6-TRIONE
> >
> > I thought that eDismax and Dismax escaped all lucene special characters
> before passing the query to lucene. Although I also remember reading that +
> and - can have special significance in a query if preceded with white
> space. I can find very little documentation on either query parser in how
> they work.
> >
> > Is this expected behavior or is this a bug? If expected, where can I
> find documentation?
> >
> >
> >
> > This message and any attachment are confidential and may be privileged
> or otherwise protected from disclosure. If you are not the intended
> recipient, you must not copy this message or attachment or disclose the
> contents to any other person. If you have received this transmission in
> error, please notify the sender immediately and delete the message and any
> attachment from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
> >
> >
> >
> > Click http://www.merckgroup.com/disclaimer to access the German,
> French, Spanish and Portuguese versions of this disclaimer.
> >
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
>
>
> Click http://www.merckgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.
>
-- 
Andrea Gazzarini
*Search Consultant, R Software Engineer*


www.sease.io

email: a.gazzar...@sease.io
cell: +39 349 513 86 25

Re: Question about Atomic Update

2020-06-15 Thread david . davila

Hi Erick,

Thank you for your answer.

Unfortunatelly our most important field is that text field, so, we need to 
index it. We will have to assume that big documents takes a long time to 
index.

Best,

David

David Dávila Atienza
AEAT - Departamento de Informática Tributaria
Subdirección de Tecnologías de Análisis de la Información e Investigación 
del Fraude
Teléfono: 915828763
Extensión: 36763

De: "Erick Erickson" 
Para:   solr-user@lucene.apache.org
Fecha:  15/06/2020 14:27
Asunto: Re: Question about Atomic Update

All Atomic Updates do is 
1> read all the stored fields from the record being updated
2> overlay your updates
3> re-index the document.

At <3> it?s exactly as though you sent the entire document
again, so your observation that the whole document is 
re-indexed is accurate.

If the fields you want to update are single-valued, docValues=true
numeric fields you can update those without the whole doc being
re-indexed. But if you need to search on those fields it?ll probably
be unacceptably slow. However, if you _do_ need to search,
sometimes you can get creative with function queries. OK, this
last is opaque but say you have a ?quantity? field and only want to
find docs that have quantity > 0. You can add a function query
to your query (either q or fq) that returns the value of that field,
which means the score is 0 for docs where quantity==0 and the
doc drops out of the result set.

It?s not clear whether you search the text field, but if not you can
store it somewhere else and only fetch it as needed.

Best,
Erick

> On Jun 15, 2020, at 7:55 AM, david.dav...@correo.aeat.es wrote:
> 
> Hi,
> 
> I have a question related with atomic update in Solr.
> 
> In our collection,  documents have a lot of fields, most of them small. 
> However, there is one of them that includes the text of the document. 
> Sometimes, not many fortunatelly, this text is very long, more than 3 or 
4 
> MB of plain text. We use different analyzers such as synonyms, etc. and 
> this causes that index time in that documents is long, about 15 seconds.
> 
> Sometimes, we should update some small fields, and it is a big problem 
for 
> us because of the time that it consumes. We have been testing with 
atomic 
> update, but time is exactly the same than sending the document again. We 

> expected that with atomic update only the updated fields were indexed 
and 
> time would reduce. But it seems that internally Solr gets the whole 
> document and reindex all the fields.
> 
> Does it works in that way? Am I worng, any advice?
> 
> We have tested with Solr 7.4 and Solr 4.10
> 
> Thanks,
> 
> David 

Este mensaje ha sido enviado desde un correo externo a la Agencia 
Tributaria. Por favor, no haga click en enlaces ni abra los documentos 
adjuntos a menos que reconozca al remitente del correo y la temática del 
mismo.

RE: eDismax query syntax question

2020-06-15 Thread Webster Homer

Markus,
Thanks, for the reference, but that doesn't answer my question. If - is a 
special character, it's not consistently special. In my example "3-DIMETHYL" 
behaves quite differently than ")-PYRIMIDINE".  If I escape the closing 
parenthesis the following minus no longer behaves specially. The referred 
article does not even mention parenthesis, but it changes the behavior of the 
following "-" if it is escaped. In "3-DIMETHYL" the minus is not special.

These all fix the problem:
1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE\)-PYRIMIDINE-2,4,6-TRIONE
1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)\-PYRIMIDINE-2,4,6-TRIONE
1,3-DIMETHYL-5-\(3-PHENYL-ALLYLIDENE\)-PYRIMIDINE-2,4,6-TRIONE

Only the minus following the parenthesis is treated as a NOT.
Are parentheses special? They're not mentioned in the eDismax documentation.

-Original Message-
From: Markus Jelsma 
Sent: Saturday, June 13, 2020 4:57 AM
To: solr-user@lucene.apache.org
Subject: RE: eDismax query syntax question

Hello,

These are special characters, if you don't need them, you must escape them.

See top of the article:
https://lucene.apache.org/solr/guide/8_5/the-extended-dismax-query-parser.html

Markus




-Original message-
> From:Webster Homer 
> Sent: Friday 12th June 2020 22:09
> To: solr-user@lucene.apache.org
> Subject: eDismax query syntax question
>
> Recently we found strange behavior in a query. We use eDismax as the query 
> parser.
>
> This is the query term:
> 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)-PYRIMIDINE-2,4,6-TRIONE
>
> It should hit one document in our index. It does not. However, if you use the 
> Dismax query parser it does match the record.
>
> The problem seems to involve the parenthesis and the dashes. If you
> escape the dash after the parenthesis it matches
> 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)\-PYRIMIDINE-2,4,6-TRIONE
>
> I thought that eDismax and Dismax escaped all lucene special characters 
> before passing the query to lucene. Although I also remember reading that + 
> and - can have special significance in a query if preceded with white space. 
> I can find very little documentation on either query parser in how they work.
>
> Is this expected behavior or is this a bug? If expected, where can I find 
> documentation?
>
>
>
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from your 
> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> accept liability for any omissions or errors in this message which may arise 
> as a result of E-Mail-transmission or for damages resulting from any 
> unauthorized changes of the content of this message and any attachment 
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> guarantee that this message is free of viruses and does not accept liability 
> for any damages caused by any virus transmitted therewith.
>
>
>
> Click http://www.merckgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.
>


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: Question about Atomic Update

2020-06-15 Thread Erick Erickson

All Atomic Updates do is 
1> read all the stored fields from the record being updated
2> overlay your updates
3> re-index the document.

At <3> it’s exactly as though you sent the entire document
again, so your observation that the whole document is 
re-indexed is accurate.

If the fields you want to update are single-valued, docValues=true
numeric fields you can update those without the whole doc being
re-indexed. But if you need to search on those fields it’ll probably
be unacceptably slow. However, if you _do_ need to search,
sometimes you can get creative with function queries. OK, this
last is opaque but say you have a “quantity” field and only want to
find docs that have quantity > 0. You can add a function query
to your query (either q or fq) that returns the value of that field,
which means the score is 0 for docs where quantity==0 and the
doc drops out of the result set.

It’s not clear whether you search the text field, but if not you can
store it somewhere else and only fetch it as needed.

Best,
Erick

> On Jun 15, 2020, at 7:55 AM, david.dav...@correo.aeat.es wrote:
> 
> Hi,
> 
> I have a question related with atomic update in Solr.
> 
> In our collection,  documents have a lot of fields, most of them small. 
> However, there is one of them that includes the text of the document. 
> Sometimes, not many fortunatelly, this text is very long, more than 3 or 4 
> MB of plain text. We use different analyzers such as synonyms, etc. and 
> this causes that index time in that documents is long, about 15 seconds.
> 
> Sometimes, we should update some small fields, and it is a big problem for 
> us because of the time that it consumes. We have been testing with atomic 
> update, but time is exactly the same than sending the document again. We 
> expected that with atomic update only the updated fields were indexed and 
> time would reduce. But it seems that internally Solr gets the whole 
> document and reindex all the fields.
> 
> Does it works in that way? Am I worng, any advice?
> 
> We have tested with Solr 7.4 and Solr 4.10
> 
> Thanks,
> 
> David

Question about Atomic Update

2020-06-15 Thread david . davila

Hi,

I have a question related with atomic update in Solr.

In our collection,  documents have a lot of fields, most of them small. 
However, there is one of them that includes the text of the document. 
Sometimes, not many fortunatelly, this text is very long, more than 3 or 4 
MB of plain text. We use different analyzers such as synonyms, etc. and 
this causes that index time in that documents is long, about 15 seconds.

Sometimes, we should update some small fields, and it is a big problem for 
us because of the time that it consumes. We have been testing with atomic 
update, but time is exactly the same than sending the document again. We 
expected that with atomic update only the updated fields were indexed and 
time would reduce. But it seems that internally Solr gets the whole 
document and reindex all the fields.

Does it works in that way? Am I worng, any advice?

We have tested with Solr 7.4 and Solr 4.10

Thanks,

David

RE: eDismax query syntax question

2020-06-13 Thread Markus Jelsma

Hello,

These are special characters, if you don't need them, you must escape them.

See top of the article:
https://lucene.apache.org/solr/guide/8_5/the-extended-dismax-query-parser.html

Markus


 
 
-Original message-
> From:Webster Homer 
> Sent: Friday 12th June 2020 22:09
> To: solr-user@lucene.apache.org
> Subject: eDismax query syntax question
> 
> Recently we found strange behavior in a query. We use eDismax as the query 
> parser.
> 
> This is the query term:
> 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)-PYRIMIDINE-2,4,6-TRIONE
> 
> It should hit one document in our index. It does not. However, if you use the 
> Dismax query parser it does match the record.
> 
> The problem seems to involve the parenthesis and the dashes. If you escape 
> the dash after the parenthesis it matches
> 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)\-PYRIMIDINE-2,4,6-TRIONE
> 
> I thought that eDismax and Dismax escaped all lucene special characters 
> before passing the query to lucene. Although I also remember reading that + 
> and - can have special significance in a query if preceded with white space. 
> I can find very little documentation on either query parser in how they work.
> 
> Is this expected behavior or is this a bug? If expected, where can I find 
> documentation?
> 
> 
> 
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from your 
> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> accept liability for any omissions or errors in this message which may arise 
> as a result of E-Mail-transmission or for damages resulting from any 
> unauthorized changes of the content of this message and any attachment 
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> guarantee that this message is free of viruses and does not accept liability 
> for any damages caused by any virus transmitted therewith.
> 
> 
> 
> Click http://www.merckgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.
>

eDismax query syntax question

2020-06-12 Thread Webster Homer

Recently we found strange behavior in a query. We use eDismax as the query 
parser.

This is the query term:
1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)-PYRIMIDINE-2,4,6-TRIONE

It should hit one document in our index. It does not. However, if you use the 
Dismax query parser it does match the record.

The problem seems to involve the parenthesis and the dashes. If you escape the 
dash after the parenthesis it matches
1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)\-PYRIMIDINE-2,4,6-TRIONE

I thought that eDismax and Dismax escaped all lucene special characters before 
passing the query to lucene. Although I also remember reading that + and - can 
have special significance in a query if preceded with white space. I can find 
very little documentation on either query parser in how they work.

Is this expected behavior or is this a bug? If expected, where can I find 
documentation?



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: question about setup for maximizing solr performance

2020-06-01 Thread Shawn Heisey


On 6/1/2020 9:29 AM, Odysci wrote:

Hi,
I'm looking for some advice on improving performance of our solr setup.




Does anyone have any insights on what would be better for maximizing
throughput on multiple searches being done at the same time?
thanks!


In almost all cases, adding memory will provide the best performance 
boost.  This is because memory is faster than disks, even SSD.  I have 
put relevant information on a wiki page so that it is easy for people to 
find and digest:


https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems

Thanks,
Shawn

question about setup for maximizing solr performance

2020-06-01 Thread Odysci

Hi,
I'm looking for some advice on improving performance of our solr setup. In
particular, about the trade-offs between applying larger machines, vs more
smaller machines. Our full index has just over 100 million docs, and we do
almost all searches using fq's (with q=*:*) and facets. We are using solr
8.3.

Currently, I have a solrcloud setup with 2 physical machines (let's call
them A and B), and my index is divided into 2 shards, and 2 replicas, such
that each machine has a full copy of the index.
The nodes and replicas are as follows:
Machine A:
  core_node3 / shard1_replica_n1
  core_node7 / shard2_replica_n4
Machine B:
  core_node5 / shard1_replica_n2
  core_node8 / shard2_replica_n6

My Zookeeper setup uses 3 instances. It's also the case that most of the
searches we do, we have results returning from both shards (from the same
search).

My experiments indicate that our setup is cpu-bound.
Due to cost constraints, I could, either, double the cpu in each of the 2
machines, or make it a 4-machine setup (using current size machines) and 2
shards and 4 replicas (or 4 shards w/ 4 replicas). I assume that keeping
the full index on all machines will allow all searches to be evenly
distributed.

Does anyone have any insights on what would be better for maximizing
throughput on multiple searches being done at the same time?
thanks!

Reinaldo

Question for SOLR-14471

2020-05-26 Thread Kayak28

Hello, Solr community members:

I am working on translating Solr's release note every release.
Now, I am not clear about what SOLR-14471 actually fixes.

URL for SOLR-14471: https://issues.apache.org/jira/browse/SOLR-14471

My questions are the following.
- what does "all inherently equivalent groups of replicas mean"?
- does it mean children of the same shard?
- are they different from "all available replicas"?
- what does "last place" mean?
 - does it mean that a replica, which is created at the last?

Honestly, I am not familiar with Solr Cloud structure, so
I would be happy if anyone could help me to understand the issue.



-- 

Sincerely,
Kaya
github: https://github.com/28kayak

Re: LTR - FieldValueFeature Question

2020-04-26 Thread Dmitry Paramzin

It seems that in order to be available for FieldValueFeature score calculation, 
the field should be 'stored', otherwise it is not present in the document. It 
is also seems that indexed/docValue does not matter:

  final IndexableField indexableField = document.getField(field);
  if (indexableField == null) {
return getDefaultValue();
  }

On 2020/04/24 06:39:30, Ashwin Ramesh  wrote: 
> Hi everybody,
> 
> Do we need to have 'indexed=true' to be able to retrieve the value of a
> field via FieldValueFeature or is having docValue=true enough?
> 
> Currently, we have some dynamic fields as [dynamicField=true, stored=false,
> indexed=false, docValue=true]. However when we noticing that the value
> extracted is '0.0'.
> 
> This is the code I read around FieldFeatureValue:
> https://github.com/apache/lucene-solr/blob/master/solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java
> 
> Thanks,
> 
> Ash
> 
> -- 
> **
> ** Empowering the world to design
> Share accurate 
> information on COVID-19 and spread messages of support to your community.
> 
> Here are some resources 
> 
>  
> that can help.
>    
>    
>     
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

LTR - FieldValueFeature Question

2020-04-24 Thread Ashwin Ramesh

Hi everybody,

Do we need to have 'indexed=true' to be able to retrieve the value of a
field via FieldValueFeature or is having docValue=true enough?

Currently, we have some dynamic fields as [dynamicField=true, stored=false,
indexed=false, docValue=true]. However when we noticing that the value
extracted is '0.0'.

This is the code I read around FieldFeatureValue:
https://github.com/apache/lucene-solr/blob/master/solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java

Thanks,

Ash

-- 
**
** Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 

 
that can help.

Re: A question about underscore

2020-04-06 Thread Erick Erickson

I _strongly_ urge you to become acquainted with the Admin UI, particularly the 
“analysis” section. It’ll show you exactly what transformations each step in 
your analysis chain perform.

Without you providing the fieldType definition, all I can do is guess but my 
guess is that you have WordDelimiterFilterFactory in your analysis chain, which 
splits on underscores. If you change that definition, you’ll have to re-index 
from scratch.

Best,
Erick

> On Apr 6, 2020, at 10:13 AM, chalaulait 808  
> wrote:
> 
> I am using Solr4.0.13 to implement the search function of the document
> management system.
> I am currently having issues with search results when the search string
> contains an underscore.
> For example, if I search for the character string "AAA_001", the search
> results will return results like "AAA" OR "001"
> 
> I checked the Solr manual below and found that some characters needed
> escaping, but they didn't include underscores.
> 
> https://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-4.10.pdf
> 
> Is this event a specification?
> Please let me know if any information is missing.
> Thank you.

A question about underscore

2020-04-06 Thread chalaulait 808

I am using Solr4.0.13 to implement the search function of the document
management system.
I am currently having issues with search results when the search string
contains an underscore.
For example, if I search for the character string "AAA_001", the search
results will return results like "AAA" OR "001"

I checked the Solr manual below and found that some characters needed
escaping, but they didn't include underscores.

https://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-4.10.pdf

Is this event a specification?
Please let me know if any information is missing.
Thank you.

Autoscaling question

2020-03-26 Thread Kudrettin Güleryüz

Hi,

I'd like to balance freedisk and cores across eight nodes. Here is my
cluster-preferences and cluster-policy:

{
  "responseHeader":{
"status":0,
"QTime":0},
  "cluster-preferences":[{
  "precision":10,
  "maximize":"freedisk"}
,{
  "minimize":"cores",
  "precision":10}
,{
  "minimize":"sysLoadAvg",
  "precision":3}],
  "cluster-policy":[{
  "freedisk":"<10",
  "replica":"0",
  "strict":"true"}],
  "triggers":{".auto_add_replicas":{
  "name":".auto_add_replicas",
  "event":"nodeLost",
  "waitFor":120,
  "actions":[{
  "name":"auto_add_replicas_plan",
  "class":"solr.AutoAddReplicasPlanAction"},
{
  "name":"execute_plan",
  "class":"solr.ExecutePlanAction"}],
  "enabled":true}},
  "listeners":{".auto_add_replicas.system":{
  "trigger":".auto_add_replicas",
  "afterAction":[],
  "stage":["STARTED",
"ABORTED",
"SUCCEEDED",
"FAILED",
"BEFORE_ACTION",
"AFTER_ACTION",
"IGNORED"],
  "class":"org.apache.solr.cloud.autoscaling.SystemLogListener",
  "beforeAction":[]}},
  "properties":{},
  "WARNING":"This response format is experimental.  It is likely to
change in the future."}

Can you help me understand why least loaded node is test-54 in this case?
{
  "responseHeader":{
"status":0,
"QTime":1294},
  "diagnostics":{
"sortedNodes":[{
"node":"test-52:8983_solr",
"cores":99,
"freedisk":1136.8754272460938,
"sysLoadAvg":0.0},
  {
"node":"test-56:8983_solr",
"cores":99,
"freedisk":1045.345874786377,
"sysLoadAvg":6.0},
  {
"node":"test-51:8983_solr",
"cores":94,
"freedisk":1029.996826171875,
"sysLoadAvg":17.0},
  {
"node":"test-55:8983_solr",
"cores":98,
"freedisk":876.639045715332,
"sysLoadAvg":2.0},
  {
"node":"test-53:8983_solr",
"cores":91,
"freedisk":715.8955001831055,
"sysLoadAvg":17.0},
  {
"node":"test-58:8983_solr",
"cores":104,
"freedisk":927.1832389831543,
"sysLoadAvg":0.0},
  {
"node":"test-57:8983_solr",
"cores":120,
"freedisk":934.3348655700684,
"sysLoadAvg":0.0},
  {
"node":"test-54:8983_solr",
"cores":165,
"freedisk":580.5822525024414,
"sysLoadAvg":0.0}],
"violations":[]},
  "WARNING":"This response format is experimental.  It is likely to
change in the future."}

Solr 7.3.1 is running.

Thank you

Disastor Scenario Question Regarding Tlog+pull solrcloud setup

2020-03-04 Thread Sandeep Dharembra

Hi,

My question is about the solrcloud cluster we are trying to have. We have a
collection with Tlog and pull type replicas. We intend to keep all the
tlogs on one node and use that for writing and pull replicas distributed on
the remaining nodes.

What we have noticed is that when the Tlog node goes down and let's say the
disk is also lost, when the node is brought back up (since it was the only
node holding Tlog replicas for all shards it is not deleted from state.json

1) There is no way to remove the node from the cluster since the remaining
pull cannot become leader

2) Since the disk is blank (like a new node), when solr comes up the Tlog
replicas remain down. If we create the core folders with only
core.properties the Tlog replicas become active without any data but in
this case pull replicas sync and become blank as well

Is there a way to stop this sync of pull replicas in solrcloud mode
temporarily while we populate the index on tlogs. We can do this in legacy
mode

We understand that we can have multiple tlogs as replicas to solve this but
wanted to know how we can stop this replication.

Thanks,
Sandeep

Question About Solr Query Parser

2020-03-02 Thread Kayak28

Hello, Community:

I have a question about interpreting a parsed query from Debug Query.
I used Solr 8.4.1 and LuceneQueryParser.
I was learning the behavior of ManagedSynonymFilter because I was curious
about how "ManagedSynonymGraphFilter" fails to generate a graph.
So, I try to interpret the parsed query, which gives me:
MultiPhraseQuery(managed_synonym_filter_query:\"AA (B SYNONYM_AA_1)
SYNOYM_AA_2 SYNONYM_AA
SYNONYM_AA_3 SYNONYM_AA_4 SYNONYM_AA_5\")
when I query q=managed_synonym_filter_query:"AAB" (SYNONYM_AA_n means
synonyms for AA that is defined in the managed-resource (1 <= n <= 5) )

I wonder why (B SYNONYM_AA_1) are appeared and what these parentheses mean.

If anyone knows any reasons or clues, I would be very appreciated of you
sharing the information.

Sincerely,
Kaya Ota

Re: Solr Cloud Question

2020-02-24 Thread Erick Erickson

Assuming that the Solr node you stop does not contain all of the replicas for 
any shard, there should really be no effect. I do strongly recommend that you 
stop the node gracefully if possible unless what you really want to test is 
when a node goes away mysteriously...

What’ll happen is that all of the replica on the downed node will be marked as 
“down”. When the node comes back, the replicas on it will re-sync with the 
current leader and start handling queries and updates again. There should be no 
loss of data or data inconsistencies.

Best,
Erick

> On Feb 24, 2020, at 4:40 PM, Kevin Sante  wrote:
> 
> Hello guys,
> 
> I need some help understanding the setup with solr cloud. I am a newbie to 
> solr and I have successfully set up solr cloud with some alarms on AWS.
> 
> I have a two solr nodes and 3 zookeeper nodes for my set up. I already have 
> data indexed on the nodes and I am able to query the data from my website.
> 
> The question I have is what impact it will have for me to stop one of the 
> solr cloud nodes and then restart it. I want to test if my alarms are right 
> or not.
> 
> Thank you
>

Solr Cloud Question

2020-02-24 Thread Kevin Sante

Hello guys,

I need some help understanding the setup with solr cloud. I am a newbie to solr 
and I have successfully set up solr cloud with some alarms on AWS.

I have a two solr nodes and 3 zookeeper nodes for my set up. I already have 
data indexed on the nodes and I am able to query the data from my website.

The question I have is what impact it will have for me to stop one of the solr 
cloud nodes and then restart it. I want to test if my alarms are right or not.

Thank you

Re: A question about solr filter cache

2020-02-18 Thread Erick Erickson

Again depending on the version of Solr, but the metrics end point (added in 
6.4) has a TON of information. Be prepared to wade through it for half a day to 
find out the things you need ;). There are something like 150 different metrics 
returned…

Frankly I don’t remember if cache RAM usage is one of them, but that’s what 
grep was made for ;)

Best,
Erick



> On Feb 18, 2020, at 2:53 AM, Hongxu Ma  wrote:
> 
> @Vadim Ivanov<mailto:vadim.iva...@spb.ntk-intourist.ru>
> 
> Thank you!
> 
> From: Vadim Ivanov 
> Sent: Tuesday, February 18, 2020 15:27
> To: solr-user@lucene.apache.org 
> Subject: RE: A question about solr filter cache
> 
> Hi!
> Yes, it may depends on Solr version
> Solr 8.3 Admin filterCache page stats looks like:
> 
> stats:
> CACHE.searcher.filterCache.cleanupThread:false
> CACHE.searcher.filterCache.cumulative_evictions:0
> CACHE.searcher.filterCache.cumulative_hitratio:0.94
> CACHE.searcher.filterCache.cumulative_hits:198
> CACHE.searcher.filterCache.cumulative_idleEvictions:0
> CACHE.searcher.filterCache.cumulative_inserts:12
> CACHE.searcher.filterCache.cumulative_lookups:210
> CACHE.searcher.filterCache.evictions:0
> CACHE.searcher.filterCache.hitratio:1
> CACHE.searcher.filterCache.hits:84
> CACHE.searcher.filterCache.idleEvictions:0
> CACHE.searcher.filterCache.inserts:0
> CACHE.searcher.filterCache.lookups:84
> CACHE.searcher.filterCache.maxRamMB:-1
> CACHE.searcher.filterCache.ramBytesUsed:70768
> CACHE.searcher.filterCache.size:12
> CACHE.searcher.filterCache.warmupTime:1
> 
>> -Original Message-
>> From: Hongxu Ma [mailto:inte...@outlook.com]
>> Sent: Tuesday, February 18, 2020 5:32 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: A question about solr filter cache
>> 
>> @Erick Erickson<mailto:erickerick...@gmail.com> and @Mikhail Khludnev
>> 
>> got it, the explanation is very clear.
>> 
>> Thank you for your help.
>> 
>> From: Hongxu Ma 
>> Sent: Tuesday, February 18, 2020 10:22
>> To: Vadim Ivanov ; solr-
>> u...@lucene.apache.org 
>> Subject: Re: A question about solr filter cache
>> 
>> Thank you @Vadim Ivanov<mailto:vadim.iva...@spb.ntk-intourist.ru>
>> I know that admin page, but I cannot find the memory usage of filter cache
>> (only has "CACHE.searcher.filterCache.size", I think it's the used slot
> number
>> of filtercache)
>> 
>> There is my output (solr version 7.3.1):
>> 
>> filterCache
>> 
>>  *
>> 
>> class:
>> org.apache.solr.search.FastLRUCache
>>  *
>> 
>> description:
>> Concurrent LRU Cache(maxSize=512, initialSize=512, minSize=460,
>> acceptableSize=486, cleanupThread=false)
>>  *   stats:
>> *
>> 
>> CACHE.searcher.filterCache.cumulative_evictions:
>> 0
>> *
>> 
>> CACHE.searcher.filterCache.cumulative_hitratio:
>> 0.5
>> *
>> 
>> CACHE.searcher.filterCache.cumulative_hits:
>> 1
>> *
>> 
>> CACHE.searcher.filterCache.cumulative_inserts:
>> 1
>> *
>> 
>> CACHE.searcher.filterCache.cumulative_lookups:
>> 2
>> *
>> 
>> CACHE.searcher.filterCache.evictions:
>> 0
>> *
>> 
>> CACHE.searcher.filterCache.hitratio:
>> 0.5
>> *
>> 
>> CACHE.searcher.filterCache.hits:
>> 1
>> *
>> 
>> CACHE.searcher.filterCache.inserts:
>> 1
>> *
>> 
>> CACHE.searcher.filterCache.lookups:
>> 2
>> *
>> 
>> CACHE.searcher.filterCache.size:
>> 1
>> *
>> 
>> CACHE.searcher.filterCache.warmupTime:
>> 0
>> 
>> 
>> 
>> 
>> From: Vadim Ivanov 
>> Sent: Monday, February 17, 2020 17:51
>> To: solr-user@lucene.apache.org 
>> Subject: RE: A question about solr filter cache
>> 
>> You can easily check amount of RAM used by core filterCache in Admin UI:
>> Choose core - Plugins/Stats - Cache - filterCache It shows useful
> information
>> on configuration, statistics and current RAM usage by filter cache, as
> well as
>> some examples of current filtercaches in RAM Core, for ex, with 10 mln
> docs
>> uses 1.3 MB of Ram for every filterCache
>> 
>> 
>>> -Original Message-
>>> From: Hongxu Ma [mailto:inte...@outlook.com]
>>> Sent: Monday, February 17, 2020 12:13 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: A question about solr filter cache
>&g

Re: A question about solr filter cache

2020-02-17 Thread Hongxu Ma

@Vadim Ivanov<mailto:vadim.iva...@spb.ntk-intourist.ru>

Thank you!

From: Vadim Ivanov 
Sent: Tuesday, February 18, 2020 15:27
To: solr-user@lucene.apache.org 
Subject: RE: A question about solr filter cache

Hi!
Yes, it may depends on Solr version
Solr 8.3 Admin filterCache page stats looks like:

stats:
CACHE.searcher.filterCache.cleanupThread:false
CACHE.searcher.filterCache.cumulative_evictions:0
CACHE.searcher.filterCache.cumulative_hitratio:0.94
CACHE.searcher.filterCache.cumulative_hits:198
CACHE.searcher.filterCache.cumulative_idleEvictions:0
CACHE.searcher.filterCache.cumulative_inserts:12
CACHE.searcher.filterCache.cumulative_lookups:210
CACHE.searcher.filterCache.evictions:0
CACHE.searcher.filterCache.hitratio:1
CACHE.searcher.filterCache.hits:84
CACHE.searcher.filterCache.idleEvictions:0
CACHE.searcher.filterCache.inserts:0
CACHE.searcher.filterCache.lookups:84
CACHE.searcher.filterCache.maxRamMB:-1
CACHE.searcher.filterCache.ramBytesUsed:70768
CACHE.searcher.filterCache.size:12
CACHE.searcher.filterCache.warmupTime:1

> -Original Message-
> From: Hongxu Ma [mailto:inte...@outlook.com]
> Sent: Tuesday, February 18, 2020 5:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: A question about solr filter cache
>
> @Erick Erickson<mailto:erickerick...@gmail.com> and @Mikhail Khludnev
>
> got it, the explanation is very clear.
>
> Thank you for your help.
> 
> From: Hongxu Ma 
> Sent: Tuesday, February 18, 2020 10:22
> To: Vadim Ivanov ; solr-
> u...@lucene.apache.org 
> Subject: Re: A question about solr filter cache
>
> Thank you @Vadim Ivanov<mailto:vadim.iva...@spb.ntk-intourist.ru>
> I know that admin page, but I cannot find the memory usage of filter cache
> (only has "CACHE.searcher.filterCache.size", I think it's the used slot
number
> of filtercache)
>
> There is my output (solr version 7.3.1):
>
> filterCache
>
>   *
>
> class:
> org.apache.solr.search.FastLRUCache
>   *
>
> description:
> Concurrent LRU Cache(maxSize=512, initialSize=512, minSize=460,
> acceptableSize=486, cleanupThread=false)
>   *   stats:
>  *
>
> CACHE.searcher.filterCache.cumulative_evictions:
> 0
>  *
>
> CACHE.searcher.filterCache.cumulative_hitratio:
> 0.5
>  *
>
> CACHE.searcher.filterCache.cumulative_hits:
> 1
>  *
>
> CACHE.searcher.filterCache.cumulative_inserts:
> 1
>  *
>
> CACHE.searcher.filterCache.cumulative_lookups:
> 2
>  *
>
> CACHE.searcher.filterCache.evictions:
> 0
>  *
>
> CACHE.searcher.filterCache.hitratio:
> 0.5
>  *
>
> CACHE.searcher.filterCache.hits:
> 1
>  *
>
> CACHE.searcher.filterCache.inserts:
> 1
>  *
>
> CACHE.searcher.filterCache.lookups:
> 2
>  *
>
> CACHE.searcher.filterCache.size:
> 1
>  *
>
> CACHE.searcher.filterCache.warmupTime:
> 0
>
>
>
> 
> From: Vadim Ivanov 
> Sent: Monday, February 17, 2020 17:51
> To: solr-user@lucene.apache.org 
> Subject: RE: A question about solr filter cache
>
> You can easily check amount of RAM used by core filterCache in Admin UI:
> Choose core - Plugins/Stats - Cache - filterCache It shows useful
information
> on configuration, statistics and current RAM usage by filter cache, as
well as
> some examples of current filtercaches in RAM Core, for ex, with 10 mln
docs
> uses 1.3 MB of Ram for every filterCache
>
>
> > -Original Message-
> > From: Hongxu Ma [mailto:inte...@outlook.com]
> > Sent: Monday, February 17, 2020 12:13 PM
> > To: solr-user@lucene.apache.org
> > Subject: A question about solr filter cache
> >
> > Hi
> > I want to know the internal of solr filter cache, especially its
> > memory
> usage.
> >
> > I googled some pages:
> > https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> > https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.h
> > tml
> > (Erick Erickson's answer)
> >
> > All of them said its structure is: fq => a bitmap (total doc number
> > bits),
> but I
> > think it's not so simple, reason:
> > Given total doc number is 1 billion, each filter cache entry will use
> nearly
> > 1GB(10/8 bit), it's too big and very easy to make solr OOM (I
> > have
> a
> > 1 billion doc cluster, looks it works well)
> >
> > And I also checked solr node, but cannot find the details (only saw
> > using DocSets structure)
> >
> > So far, I guess:
> >
> >   *   degenerate into an doc id array/list when the bitmap is sparse
> >   *   using some compressed bitmap, e.g. roaring bitmaps
> >
> > which one is correct? or another answer, thanks you very much!
>

RE: A question about solr filter cache

2020-02-17 Thread Vadim Ivanov

Hi!
Yes, it may depends on Solr version
Solr 8.3 Admin filterCache page stats looks like:

stats:
CACHE.searcher.filterCache.cleanupThread:false
CACHE.searcher.filterCache.cumulative_evictions:0
CACHE.searcher.filterCache.cumulative_hitratio:0.94
CACHE.searcher.filterCache.cumulative_hits:198
CACHE.searcher.filterCache.cumulative_idleEvictions:0
CACHE.searcher.filterCache.cumulative_inserts:12
CACHE.searcher.filterCache.cumulative_lookups:210
CACHE.searcher.filterCache.evictions:0
CACHE.searcher.filterCache.hitratio:1
CACHE.searcher.filterCache.hits:84
CACHE.searcher.filterCache.idleEvictions:0
CACHE.searcher.filterCache.inserts:0
CACHE.searcher.filterCache.lookups:84
CACHE.searcher.filterCache.maxRamMB:-1
CACHE.searcher.filterCache.ramBytesUsed:70768
CACHE.searcher.filterCache.size:12
CACHE.searcher.filterCache.warmupTime:1

> -Original Message-
> From: Hongxu Ma [mailto:inte...@outlook.com]
> Sent: Tuesday, February 18, 2020 5:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: A question about solr filter cache
> 
> @Erick Erickson<mailto:erickerick...@gmail.com> and @Mikhail Khludnev
> 
> got it, the explanation is very clear.
> 
> Thank you for your help.
> 
> From: Hongxu Ma 
> Sent: Tuesday, February 18, 2020 10:22
> To: Vadim Ivanov ; solr-
> u...@lucene.apache.org 
> Subject: Re: A question about solr filter cache
> 
> Thank you @Vadim Ivanov<mailto:vadim.iva...@spb.ntk-intourist.ru>
> I know that admin page, but I cannot find the memory usage of filter cache
> (only has "CACHE.searcher.filterCache.size", I think it's the used slot
number
> of filtercache)
> 
> There is my output (solr version 7.3.1):
> 
> filterCache
> 
>   *
> 
> class:
> org.apache.solr.search.FastLRUCache
>   *
> 
> description:
> Concurrent LRU Cache(maxSize=512, initialSize=512, minSize=460,
> acceptableSize=486, cleanupThread=false)
>   *   stats:
>  *
> 
> CACHE.searcher.filterCache.cumulative_evictions:
> 0
>  *
> 
> CACHE.searcher.filterCache.cumulative_hitratio:
> 0.5
>  *
> 
> CACHE.searcher.filterCache.cumulative_hits:
> 1
>  *
> 
> CACHE.searcher.filterCache.cumulative_inserts:
> 1
>  *
> 
> CACHE.searcher.filterCache.cumulative_lookups:
> 2
>  *
> 
> CACHE.searcher.filterCache.evictions:
> 0
>  *
> 
> CACHE.searcher.filterCache.hitratio:
> 0.5
>  *
> 
> CACHE.searcher.filterCache.hits:
> 1
>  *
> 
> CACHE.searcher.filterCache.inserts:
> 1
>  *
> 
> CACHE.searcher.filterCache.lookups:
> 2
>  *
> 
> CACHE.searcher.filterCache.size:
> 1
>  *
> 
> CACHE.searcher.filterCache.warmupTime:
> 0
> 
> 
> 
> 
> From: Vadim Ivanov 
> Sent: Monday, February 17, 2020 17:51
> To: solr-user@lucene.apache.org 
> Subject: RE: A question about solr filter cache
> 
> You can easily check amount of RAM used by core filterCache in Admin UI:
> Choose core - Plugins/Stats - Cache - filterCache It shows useful
information
> on configuration, statistics and current RAM usage by filter cache, as
well as
> some examples of current filtercaches in RAM Core, for ex, with 10 mln
docs
> uses 1.3 MB of Ram for every filterCache
> 
> 
> > -Original Message-
> > From: Hongxu Ma [mailto:inte...@outlook.com]
> > Sent: Monday, February 17, 2020 12:13 PM
> > To: solr-user@lucene.apache.org
> > Subject: A question about solr filter cache
> >
> > Hi
> > I want to know the internal of solr filter cache, especially its
> > memory
> usage.
> >
> > I googled some pages:
> > https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> > https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.h
> > tml
> > (Erick Erickson's answer)
> >
> > All of them said its structure is: fq => a bitmap (total doc number
> > bits),
> but I
> > think it's not so simple, reason:
> > Given total doc number is 1 billion, each filter cache entry will use
> nearly
> > 1GB(10/8 bit), it's too big and very easy to make solr OOM (I
> > have
> a
> > 1 billion doc cluster, looks it works well)
> >
> > And I also checked solr node, but cannot find the details (only saw
> > using DocSets structure)
> >
> > So far, I guess:
> >
> >   *   degenerate into an doc id array/list when the bitmap is sparse
> >   *   using some compressed bitmap, e.g. roaring bitmaps
> >
> > which one is correct? or another answer, thanks you very much!
>

Re: A question about solr filter cache

2020-02-17 Thread Hongxu Ma

@Erick Erickson<mailto:erickerick...@gmail.com> and @Mikhail Khludnev

got it, the explanation is very clear.

Thank you for your help.

From: Hongxu Ma 
Sent: Tuesday, February 18, 2020 10:22
To: Vadim Ivanov ; 
solr-user@lucene.apache.org 
Subject: Re: A question about solr filter cache

Thank you @Vadim Ivanov<mailto:vadim.iva...@spb.ntk-intourist.ru>
I know that admin page, but I cannot find the memory usage of filter cache 
(only has "CACHE.searcher.filterCache.size", I think it's the used slot number 
of filtercache)

There is my output (solr version 7.3.1):

filterCache

  *

class:
org.apache.solr.search.FastLRUCache
  *

description:
Concurrent LRU Cache(maxSize=512, initialSize=512, minSize=460, 
acceptableSize=486, cleanupThread=false)
  *   stats:
 *

CACHE.searcher.filterCache.cumulative_evictions:
0
 *

CACHE.searcher.filterCache.cumulative_hitratio:
0.5
 *

CACHE.searcher.filterCache.cumulative_hits:
1
 *

CACHE.searcher.filterCache.cumulative_inserts:
1
 *

CACHE.searcher.filterCache.cumulative_lookups:
2
 *

CACHE.searcher.filterCache.evictions:
0
 *

CACHE.searcher.filterCache.hitratio:
0.5
 *

CACHE.searcher.filterCache.hits:
1
 *

CACHE.searcher.filterCache.inserts:
1
 *

CACHE.searcher.filterCache.lookups:
2
 *

CACHE.searcher.filterCache.size:
1
 *

CACHE.searcher.filterCache.warmupTime:
0




From: Vadim Ivanov 
Sent: Monday, February 17, 2020 17:51
To: solr-user@lucene.apache.org 
Subject: RE: A question about solr filter cache

You can easily check amount of RAM used by core filterCache in Admin UI:
Choose core - Plugins/Stats - Cache - filterCache
It shows useful information on configuration, statistics and current RAM
usage by filter cache,
as well as some examples of current filtercaches in RAM
Core, for ex, with 10 mln docs uses 1.3 MB of Ram for every filterCache


> -Original Message-
> From: Hongxu Ma [mailto:inte...@outlook.com]
> Sent: Monday, February 17, 2020 12:13 PM
> To: solr-user@lucene.apache.org
> Subject: A question about solr filter cache
>
> Hi
> I want to know the internal of solr filter cache, especially its memory
usage.
>
> I googled some pages:
> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html
> (Erick Erickson's answer)
>
> All of them said its structure is: fq => a bitmap (total doc number bits),
but I
> think it's not so simple, reason:
> Given total doc number is 1 billion, each filter cache entry will use
nearly
> 1GB(10/8 bit), it's too big and very easy to make solr OOM (I have
a
> 1 billion doc cluster, looks it works well)
>
> And I also checked solr node, but cannot find the details (only saw using
> DocSets structure)
>
> So far, I guess:
>
>   *   degenerate into an doc id array/list when the bitmap is sparse
>   *   using some compressed bitmap, e.g. roaring bitmaps
>
> which one is correct? or another answer, thanks you very much!

Re: A question about solr filter cache

2020-02-17 Thread Hongxu Ma

Thank you @Vadim Ivanov<mailto:vadim.iva...@spb.ntk-intourist.ru>
I know that admin page, but I cannot find the memory usage of filter cache 
(only has "CACHE.searcher.filterCache.size", I think it's the used slot number 
of filtercache)

There is my output (solr version 7.3.1):

filterCache

  *

class:
org.apache.solr.search.FastLRUCache
  *

description:
Concurrent LRU Cache(maxSize=512, initialSize=512, minSize=460, 
acceptableSize=486, cleanupThread=false)
  *   stats:
 *

CACHE.searcher.filterCache.cumulative_evictions:
0
 *

CACHE.searcher.filterCache.cumulative_hitratio:
0.5
 *

CACHE.searcher.filterCache.cumulative_hits:
1
 *

CACHE.searcher.filterCache.cumulative_inserts:
1
 *

CACHE.searcher.filterCache.cumulative_lookups:
2
 *

CACHE.searcher.filterCache.evictions:
0
 *

CACHE.searcher.filterCache.hitratio:
0.5
 *

CACHE.searcher.filterCache.hits:
1
 *

CACHE.searcher.filterCache.inserts:
1
 *

CACHE.searcher.filterCache.lookups:
2
 *

CACHE.searcher.filterCache.size:
1
 *

CACHE.searcher.filterCache.warmupTime:
0




From: Vadim Ivanov 
Sent: Monday, February 17, 2020 17:51
To: solr-user@lucene.apache.org 
Subject: RE: A question about solr filter cache

You can easily check amount of RAM used by core filterCache in Admin UI:
Choose core - Plugins/Stats - Cache - filterCache
It shows useful information on configuration, statistics and current RAM
usage by filter cache,
as well as some examples of current filtercaches in RAM
Core, for ex, with 10 mln docs uses 1.3 MB of Ram for every filterCache


> -Original Message-
> From: Hongxu Ma [mailto:inte...@outlook.com]
> Sent: Monday, February 17, 2020 12:13 PM
> To: solr-user@lucene.apache.org
> Subject: A question about solr filter cache
>
> Hi
> I want to know the internal of solr filter cache, especially its memory
usage.
>
> I googled some pages:
> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html
> (Erick Erickson's answer)
>
> All of them said its structure is: fq => a bitmap (total doc number bits),
but I
> think it's not so simple, reason:
> Given total doc number is 1 billion, each filter cache entry will use
nearly
> 1GB(10/8 bit), it's too big and very easy to make solr OOM (I have
a
> 1 billion doc cluster, looks it works well)
>
> And I also checked solr node, but cannot find the details (only saw using
> DocSets structure)
>
> So far, I guess:
>
>   *   degenerate into an doc id array/list when the bitmap is sparse
>   *   using some compressed bitmap, e.g. roaring bitmaps
>
> which one is correct? or another answer, thanks you very much!

Re: A question about solr filter cache

2020-02-17 Thread Erick Erickson

That’s the upper limit of a filter cache entry (maxDoc/8). For low numbers of 
hits,
more space-efficient structures are used. Specifically a list of doc IDs is 
kept. So say
you have an fq clause that marks 10 doc. The filterCache entry is closer to 40 
bytes
+ sizeof(query object) etc.

Still, it’s what you have to be prepared for.

filterCache is local to the core. So if you have 8 replicas they’d each have 
128M docs
or so and each filterCache entry would be bounded by about 128M/8

Checking the filterCache via the admin UI is a way to find current usage, but 
be sure it’s
full. The memory is allocated as needed, not up front.

All that said, you’re certainly right, the filterCache can certainly lead to 
OOMs.
What I try to emphasize to people is that they cannot allocate huge filterCaches
without considering memory implications...

Best,
Erick

> On Feb 17, 2020, at 4:51 AM, Vadim Ivanov  
> wrote:
> 
> You can easily check amount of RAM used by core filterCache in Admin UI:
> Choose core - Plugins/Stats - Cache - filterCache
> It shows useful information on configuration, statistics and current RAM
> usage by filter cache,
> as well as some examples of current filtercaches in RAM
> Core, for ex, with 10 mln docs uses 1.3 MB of Ram for every filterCache
> 
> 
>> -Original Message-
>> From: Hongxu Ma [mailto:inte...@outlook.com]
>> Sent: Monday, February 17, 2020 12:13 PM
>> To: solr-user@lucene.apache.org
>> Subject: A question about solr filter cache
>> 
>> Hi
>> I want to know the internal of solr filter cache, especially its memory
> usage.
>> 
>> I googled some pages:
>> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
>> https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html
>> (Erick Erickson's answer)
>> 
>> All of them said its structure is: fq => a bitmap (total doc number bits),
> but I
>> think it's not so simple, reason:
>> Given total doc number is 1 billion, each filter cache entry will use
> nearly
>> 1GB(10/8 bit), it's too big and very easy to make solr OOM (I have
> a
>> 1 billion doc cluster, looks it works well)
>> 
>> And I also checked solr node, but cannot find the details (only saw using
>> DocSets structure)
>> 
>> So far, I guess:
>> 
>>  *   degenerate into an doc id array/list when the bitmap is sparse
>>  *   using some compressed bitmap, e.g. roaring bitmaps
>> 
>> which one is correct? or another answer, thanks you very much!
> 
>

RE: A question about solr filter cache

2020-02-17 Thread Vadim Ivanov

You can easily check amount of RAM used by core filterCache in Admin UI:
Choose core - Plugins/Stats - Cache - filterCache
It shows useful information on configuration, statistics and current RAM
usage by filter cache,
as well as some examples of current filtercaches in RAM
Core, for ex, with 10 mln docs uses 1.3 MB of Ram for every filterCache


> -Original Message-
> From: Hongxu Ma [mailto:inte...@outlook.com]
> Sent: Monday, February 17, 2020 12:13 PM
> To: solr-user@lucene.apache.org
> Subject: A question about solr filter cache
> 
> Hi
> I want to know the internal of solr filter cache, especially its memory
usage.
> 
> I googled some pages:
> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html
> (Erick Erickson's answer)
> 
> All of them said its structure is: fq => a bitmap (total doc number bits),
but I
> think it's not so simple, reason:
> Given total doc number is 1 billion, each filter cache entry will use
nearly
> 1GB(10/8 bit), it's too big and very easy to make solr OOM (I have
a
> 1 billion doc cluster, looks it works well)
> 
> And I also checked solr node, but cannot find the details (only saw using
> DocSets structure)
> 
> So far, I guess:
> 
>   *   degenerate into an doc id array/list when the bitmap is sparse
>   *   using some compressed bitmap, e.g. roaring bitmaps
> 
> which one is correct? or another answer, thanks you very much!

Re: A question about solr filter cache

2020-02-17 Thread Mikhail Khludnev

Hello,
The former
https://github.com/apache/lucene-solr/blob/188f620208012ba1d726b743c5934abf01988d57/solr/core/src/java/org/apache/solr/search/DocSetCollector.java#L84
More efficient sets (roaring and/or elias-fano, iirc) present in Lucene,
but not yet being used in Solr.

On Mon, Feb 17, 2020 at 1:13 AM Hongxu Ma  wrote:

> Hi
> I want to know the internal of solr filter cache, especially its memory
> usage.
>
> I googled some pages:
> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html
> (Erick Erickson's answer)
>
> All of them said its structure is: fq => a bitmap (total doc number bits),
> but I think it's not so simple, reason:
> Given total doc number is 1 billion, each filter cache entry will use
> nearly 1GB(10/8 bit), it's too big and very easy to make solr OOM
> (I have a 1 billion doc cluster, looks it works well)
>
> And I also checked solr node, but cannot find the details (only saw using
> DocSets structure)
>
> So far, I guess:
>
>   *   degenerate into an doc id array/list when the bitmap is sparse
>   *   using some compressed bitmap, e.g. roaring bitmaps
>
> which one is correct? or another answer, thanks you very much!
>
>

-- 
Sincerely yours
Mikhail Khludnev

Re: A question about solr filter cache

2020-02-17 Thread Nicolas Franck

If 1GB would make solr go out of memory by using a filter query cache,
then it would have already happened during the initial upload of the
solr documents. Imagine the amount of memory you need for one billion 
documents..
A filter cache would be the least of your problems. 1GB is small in comparison
to the entire solr index.

> On 17 Feb 2020, at 10:13, Hongxu Ma  wrote:
> 
> Hi
> I want to know the internal of solr filter cache, especially its memory usage.
> 
> I googled some pages:
> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html 
> (Erick Erickson's answer)
> 
> All of them said its structure is: fq => a bitmap (total doc number bits), 
> but I think it's not so simple, reason:
> Given total doc number is 1 billion, each filter cache entry will use nearly 
> 1GB(10/8 bit), it's too big and very easy to make solr OOM (I have a 
> 1 billion doc cluster, looks it works well)
> 
> And I also checked solr node, but cannot find the details (only saw using 
> DocSets structure)
> 
> So far, I guess:
> 
>  *   degenerate into an doc id array/list when the bitmap is sparse
>  *   using some compressed bitmap, e.g. roaring bitmaps
> 
> which one is correct? or another answer, thanks you very much!
>

A question about solr filter cache

2020-02-17 Thread Hongxu Ma

Hi
I want to know the internal of solr filter cache, especially its memory usage.

I googled some pages:
https://teaspoon-consulting.com/articles/solr-cache-tuning.html
https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html 
(Erick Erickson's answer)

All of them said its structure is: fq => a bitmap (total doc number bits), but 
I think it's not so simple, reason:
Given total doc number is 1 billion, each filter cache entry will use nearly 
1GB(10/8 bit), it's too big and very easy to make solr OOM (I have a 1 
billion doc cluster, looks it works well)

And I also checked solr node, but cannot find the details (only saw using 
DocSets structure)

So far, I guess:

  *   degenerate into an doc id array/list when the bitmap is sparse
  *   using some compressed bitmap, e.g. roaring bitmaps

which one is correct? or another answer, thanks you very much!

Re: Question about the max num of solr node

2020-01-03 Thread Jörn Franke

Why do you want to set up so many? What are your designs in terms of volumes / 
no of documents etc? 


> Am 03.01.2020 um 10:32 schrieb Hongxu Ma :
> 
> Hi community
> I plan to set up a 128 host cluster: 2 solr nodes on each host.
> But I have a little concern about whether solr can support so many nodes.
> 
> I searched on wiki and found:
> https://cwiki.apache.org/confluence/display/SOLR/2019-11+Meeting+on+SolrCloud+and+project+health
> "If you create thousands of collections, it’ll lock up and become inoperable. 
>  Scott reported that If you boot up a 100+ node cluster, SolrCloud won’t get 
> to a happy state; currently you need to start them gradually."
> 
> I wonder to know:
> Beside the quoted items, does solr have known issues in a big cluster?
> And does solr have a hard limit number of max node?
> 
> Thanks.

Question about the max num of solr node

2020-01-03 Thread Hongxu Ma

Hi community
I plan to set up a 128 host cluster: 2 solr nodes on each host.
But I have a little concern about whether solr can support so many nodes.

I searched on wiki and found:
https://cwiki.apache.org/confluence/display/SOLR/2019-11+Meeting+on+SolrCloud+and+project+health
"If you create thousands of collections, it’ll lock up and become inoperable.  
Scott reported that If you boot up a 100+ node cluster, SolrCloud won’t get to 
a happy state; currently you need to start them gradually."

I wonder to know:
Beside the quoted items, does solr have known issues in a big cluster?
And does solr have a hard limit number of max node?

Thanks.

Re: A question of solr recovery

2019-12-12 Thread Hongxu Ma

Thank you @Erick Erickson for your explanation! (although I don't fully 
understand all details  ).

I am using Solr 6.6, so I think there is only NRT replica in this version, and 
I understand the whole recovery process now.

Maybe I will upgrade to Solr 7+ in future, and try the new TLOG/PULL replica.
Thanks.



From: Erick Erickson 
Sent: Thursday, December 12, 2019 22:49
To: Hongxu Ma 
Subject: Re: A question of solr recovery

If you’re using TLOG/PULL replica types, then only changed segments
are downloaded. That replication pattern has a very different
algorithm. The problem with NRT replicas is that segments on
different replicas may not contain the same documents (in fact,
almost all the time won’t). This is because the wall-clock time
that the autocommit interval expires at, which closes segments,
will be different due to network delays and the like. This was
a deliberate design choice to make indexing as fast as possible
in distributed situations. If the leader coordinated all the commits,
it’d introduce a delay, potentially quite long if, say, the leader
needed to wait for a timeout.

Even if commits were exactly synchronous over all replicas in a shard,
the leader indexes a document and forwards it to the replica. The
commit could expire on both while the doc was in-flight.

Best,
Erick

On Dec 12, 2019, at 5:37 AM, Hongxu Ma  wrote:

Thank you very much @Erick Erickson
It's very clear.

And I found my "full sync" log:
"IndexFetcher Total time taken for download 
(fullCopy=true,bytesDownloaded=178161685180) : 4377 secs (40704063 bytes/sec) 
to NIOFSDirectory@..."

A more question:
Form the log, looks it downloaded all segment files (178GB), it's very big and 
took a long time.
Is it possible only download the segment file which contains the missing part? 
No need all files, maybe it can save time?

For example, there is my fabricated algorithm (like database does):
• recovery form local tlog as much as possible
• calculate the latest version
• only download the segment file which contains data > this version
Thanks.

From: Erick Erickson 
Sent: Wednesday, December 11, 2019 20:56
To: solr-user@lucene.apache.org 
Subject: Re: A question of solr recovery

Updates in this context are individual documents, either new ones
or a new version of an existing document. Long recoveries are
quite unlikely to be replaying a few documents from the tlog.

My bet is that you had to do a “full sync” (there should be messages
to that effect in the Solr log). This means that the replica had to
copy the entire index from the leader, and that varies with the size
of the index, network speed and contention, etc.

And to make it more complicated, and despite the comment about 100
docs and the tlog…. while that copy is going on, _new_ updates are
written to the tlog of the recovering replica and after the index
has been copied, those new updates are replayed locally. The 100
doc limit does _not_ apply in this case. So say the recovery starts
at time T and lasts for 60 seconds. All updates sent to the shard
leader over that 60 seconds are put in the local tlog and after the
copy is done, they’re replayed. And then, you guessed it, any
updates received by the leader over that 60 second period are written
to the recovering replica’s tlog and replayed… Under heavy
indexing loads, this can go no for quite a long time. Not certain
that’s what’s happening, but something to be aware of.

Best,
Erick

On Dec 10, 2019, at 10:39 PM, Hongxu Ma  wrote:

Hi all
In my cluster, Solr node turned into long time recovery sometimes.
So I want to know more about recovery and have read a good blog:
https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

It mentioned in the recovery section:
"Replays the documents from its own tlog if < 100 new updates have been 
received by the leader. "

My question: what's the meaning of "updates"? commits? or documents?
I refered solr code but still not sure about it.

Hope you can help, thanks.

Re: A question of solr recovery

2019-12-12 Thread Shawn Heisey


On 12/12/2019 8:53 AM, Shawn Heisey wrote:
I do not think the replication handler deals with tlog files at all. The 
transaction log capability did not exist when the replication handler 
was built.


I may have mixed up your message with a different one.  Looking back 
over this, I don't see any mention of the replication handler ... 
that's in another thread about making backups.  Apologies.


Thanks,
Shawn

Re: A question of solr recovery

2019-12-12 Thread Shawn Heisey


On 12/12/2019 3:37 AM, Hongxu Ma wrote:

And I found my "full sync" log:
"IndexFetcher Total time taken for download 
(fullCopy=true,bytesDownloaded=178161685180) : 4377 secs (40704063 bytes/sec) to 
NIOFSDirectory@..."

A more question:
Form the log, looks it downloaded all segment files (178GB), it's very big and 
took a long time.
Is it possible only download the segment file which contains the missing part? 
No need all files, maybe it can save time?


I'm finding the code in the replication handler to be difficult to 
follow ... but I do not think there is any way to have it do a partial 
or catchup backup copy.  I believe it must copy the entire index every time.


To achieve what you're after, using an alternate backup strategy would 
be advisable.  There are two good ways:


1) Assuming you're on an OS that supports them, make a hardlink copy of 
the data directory to an alternate location on the same filesystem. 
This should happen almost instantly.  Then at your leisure, copy from 
the hardlink copy to your final destination.  When finished, delete the 
hardlink copy.


2) Use rsync (probably with "-avH --delete" options) to copy from the 
current index directory to another location.  This will skip any 
segments that already exist in the destination.


You could potentially even combine those two options in your final solution.

I do not think the replication handler deals with tlog files at all. 
The transaction log capability did not exist when the replication 
handler was built.


Thanks,
Shawn

Re: A question of solr recovery

2019-12-12 Thread Hongxu Ma

Thank you very much @Erick Erickson<mailto:erickerick...@gmail.com>
It's very clear.

And I found my "full sync" log:
"IndexFetcher Total time taken for download 
(fullCopy=true,bytesDownloaded=178161685180) : 4377 secs (40704063 bytes/sec) 
to NIOFSDirectory@..."

A more question:
Form the log, looks it downloaded all segment files (178GB), it's very big and 
took a long time.
Is it possible only download the segment file which contains the missing part? 
No need all files, maybe it can save time?

For example, there is my fabricated algorithm (like database does):

  *   recovery form local tlog as much as possible
  *   calculate the latest version
  *   only download the segment file which contains data > this version

Thanks.

From: Erick Erickson 
Sent: Wednesday, December 11, 2019 20:56
To: solr-user@lucene.apache.org 
Subject: Re: A question of solr recovery

Updates in this context are individual documents, either new ones
or a new version of an existing document. Long recoveries are
quite unlikely to be replaying a few documents from the tlog.

My bet is that you had to do a “full sync” (there should be messages
to that effect in the Solr log). This means that the replica had to
copy the entire index from the leader, and that varies with the size
of the index, network speed and contention, etc.

And to make it more complicated, and despite the comment about 100
docs and the tlog…. while that copy is going on, _new_ updates are
written to the tlog of the recovering replica and after the index
has been copied, those new updates are replayed locally. The 100
doc limit does _not_ apply in this case. So say the recovery starts
at time T and lasts for 60 seconds. All updates sent to the shard
leader over that 60 seconds are put in the local tlog and after the
copy is done, they’re replayed. And then, you guessed it, any
updates received by the leader over that 60 second period are written
to the recovering replica’s tlog and replayed… Under heavy
indexing loads, this can go no for quite a long time. Not certain
that’s what’s happening, but something to be aware of.

Best,
Erick

> On Dec 10, 2019, at 10:39 PM, Hongxu Ma  wrote:
>
> Hi all
> In my cluster, Solr node turned into long time recovery sometimes.
> So I want to know more about recovery and have read a good blog:
> https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> It mentioned in the recovery section:
> "Replays the documents from its own tlog if < 100 new updates have been 
> received by the leader. "
>
> My question: what's the meaning of "updates"? commits? or documents?
> I refered solr code but still not sure about it.
>
> Hope you can help, thanks.
>

Re: A question of solr recovery

2019-12-11 Thread Erick Erickson

Updates in this context are individual documents, either new ones
or a new version of an existing document. Long recoveries are
quite unlikely to be replaying a few documents from the tlog.

My bet is that you had to do a “full sync” (there should be messages
to that effect in the Solr log). This means that the replica had to
copy the entire index from the leader, and that varies with the size
of the index, network speed and contention, etc.

And to make it more complicated, and despite the comment about 100
docs and the tlog…. while that copy is going on, _new_ updates are
written to the tlog of the recovering replica and after the index
has been copied, those new updates are replayed locally. The 100
doc limit does _not_ apply in this case. So say the recovery starts
at time T and lasts for 60 seconds. All updates sent to the shard
leader over that 60 seconds are put in the local tlog and after the
copy is done, they’re replayed. And then, you guessed it, any
updates received by the leader over that 60 second period are written
to the recovering replica’s tlog and replayed… Under heavy
indexing loads, this can go no for quite a long time. Not certain
that’s what’s happening, but something to be aware of.

Best,
Erick

> On Dec 10, 2019, at 10:39 PM, Hongxu Ma  wrote:
> 
> Hi all
> In my cluster, Solr node turned into long time recovery sometimes.
> So I want to know more about recovery and have read a good blog:
> https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> 
> It mentioned in the recovery section:
> "Replays the documents from its own tlog if < 100 new updates have been 
> received by the leader. "
> 
> My question: what's the meaning of "updates"? commits? or documents?
> I refered solr code but still not sure about it.
> 
> Hope you can help, thanks.
>

A question of solr recovery

2019-12-10 Thread Hongxu Ma

Hi all
In my cluster, Solr node turned into long time recovery sometimes.
So I want to know more about recovery and have read a good blog:
https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

It mentioned in the recovery section:
"Replays the documents from its own tlog if < 100 new updates have been 
received by the leader. "

My question: what's the meaning of "updates"? commits? or documents?
I refered solr code but still not sure about it.

Hope you can help, thanks.

Re: hi question about solr

2019-12-03 Thread Paras Lehana

That's not my question. It's a suggestion. I was asking if Highlighting
could fulfill your requirement?

On Tue, 3 Dec 2019 at 17:31, Bernd Fehling 
wrote:

> No, I don't use any highlighting.
>
> Am 03.12.19 um 12:28 schrieb Paras Lehana:
> > Hi Bernd,
> >
> > Have you gone through Highlighting
> > <https://lucene.apache.org/solr/guide/8_3/highlighting.html>?
> >
> > On Mon, 2 Dec 2019 at 17:00, eli chen  wrote:
> >
> >> yes
> >>
> >> On Mon, 2 Dec 2019 at 13:29, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de
> >>>
> >> wrote:
> >>
> >>> In short,
> >>>
> >>> you are trying to use an indexer as a full-text search engine, right?
> >>>
> >>> Regards
> >>> Bernd
> >>>
> >>> Am 02.12.19 um 12:24 schrieb eli chen:
> >>>> hi im kind of new to solr so please be patient
> >>>>
> >>>> i'll try to explain what do i need and what im trying to do.
> >>>>
> >>>> we a have a lot of books content and we want to index them and allow
> >>> search
> >>>> in the books.
> >>>> when someone search for a term
> >>>> i need to get back the position of matchen word in the book
> >>>> for example
> >>>> if the book content is "hello my name is jeff" and someone search for
> >>> "my".
> >>>> i want to get back the position of my in the content field (which is 1
> >> in
> >>>> this case)
> >>>> i tried to do that with payloads but no success. and another problem i
> >>>> encourage is .
> >>>> lets say the content field is "hello my name is jeff what is your
> >> name".
> >>>> now if someone search for "name" i want to get back the index of all
> >>>> occurrences not just the first one
> >>>>
> >>>> is there any way to that with solr without develop new plugins
> >>>>
> >>>> thx
> >>>>
> >>>
> >>
> >
> >
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

 <https://www.facebook.com/IndiaMART/videos/578196442936091/>

Re: hi question about solr

2019-12-03 Thread Bernd Fehling

No, I don't use any highlighting.

Am 03.12.19 um 12:28 schrieb Paras Lehana:
> Hi Bernd,
> 
> Have you gone through Highlighting
> ?
> 
> On Mon, 2 Dec 2019 at 17:00, eli chen  wrote:
> 
>> yes
>>
>> On Mon, 2 Dec 2019 at 13:29, Bernd Fehling >>
>> wrote:
>>
>>> In short,
>>>
>>> you are trying to use an indexer as a full-text search engine, right?
>>>
>>> Regards
>>> Bernd
>>>
>>> Am 02.12.19 um 12:24 schrieb eli chen:
 hi im kind of new to solr so please be patient

 i'll try to explain what do i need and what im trying to do.

 we a have a lot of books content and we want to index them and allow
>>> search
 in the books.
 when someone search for a term
 i need to get back the position of matchen word in the book
 for example
 if the book content is "hello my name is jeff" and someone search for
>>> "my".
 i want to get back the position of my in the content field (which is 1
>> in
 this case)
 i tried to do that with payloads but no success. and another problem i
 encourage is .
 lets say the content field is "hello my name is jeff what is your
>> name".
 now if someone search for "name" i want to get back the index of all
 occurrences not just the first one

 is there any way to that with solr without develop new plugins

 thx

>>>
>>
> 
>

Re: hi question about solr

2019-12-03 Thread Paras Lehana

Hi Bernd,

Have you gone through Highlighting
?

On Mon, 2 Dec 2019 at 17:00, eli chen  wrote:

> yes
>
> On Mon, 2 Dec 2019 at 13:29, Bernd Fehling  >
> wrote:
>
> > In short,
> >
> > you are trying to use an indexer as a full-text search engine, right?
> >
> > Regards
> > Bernd
> >
> > Am 02.12.19 um 12:24 schrieb eli chen:
> > > hi im kind of new to solr so please be patient
> > >
> > > i'll try to explain what do i need and what im trying to do.
> > >
> > > we a have a lot of books content and we want to index them and allow
> > search
> > > in the books.
> > > when someone search for a term
> > > i need to get back the position of matchen word in the book
> > > for example
> > > if the book content is "hello my name is jeff" and someone search for
> > "my".
> > > i want to get back the position of my in the content field (which is 1
> in
> > > this case)
> > > i tried to do that with payloads but no success. and another problem i
> > > encourage is .
> > > lets say the content field is "hello my name is jeff what is your
> name".
> > > now if someone search for "name" i want to get back the index of all
> > > occurrences not just the first one
> > >
> > > is there any way to that with solr without develop new plugins
> > >
> > > thx
> > >
> >
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

Re: hi question about solr

2019-12-02 Thread eli chen

first of all thank you very much. i was looking for good resource to read
on solr.

i actually already tried the term vector. but for it to work i had to set
the fl=content which response with the value of content field (which really
really big)

Re: hi question about solr

2019-12-02 Thread Charlie Hull


Hi,

https://livebook.manning.com/book/solr-in-action/chapter-3 may help (I'd 
suggest reading the whole book as well).


Basically what you're looking for is the 'term position'. The 
TermVectorComponent in Solr will allow you to return this for each result.


Cheers

Charlie

On 02/12/2019 11:24, eli chen wrote:

hi im kind of new to solr so please be patient

i'll try to explain what do i need and what im trying to do.

we a have a lot of books content and we want to index them and allow search
in the books.
when someone search for a term
i need to get back the position of matchen word in the book
for example
if the book content is "hello my name is jeff" and someone search for "my".
i want to get back the position of my in the content field (which is 1 in
this case)
i tried to do that with payloads but no success. and another problem i
encourage is .
lets say the content field is "hello my name is jeff what is your name".
now if someone search for "name" i want to get back the index of all
occurrences not just the first one

is there any way to that with solr without develop new plugins

thx



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: hi question about solr

2019-12-02 Thread eli chen

yes

On Mon, 2 Dec 2019 at 13:29, Bernd Fehling 
wrote:

> In short,
>
> you are trying to use an indexer as a full-text search engine, right?
>
> Regards
> Bernd
>
> Am 02.12.19 um 12:24 schrieb eli chen:
> > hi im kind of new to solr so please be patient
> >
> > i'll try to explain what do i need and what im trying to do.
> >
> > we a have a lot of books content and we want to index them and allow
> search
> > in the books.
> > when someone search for a term
> > i need to get back the position of matchen word in the book
> > for example
> > if the book content is "hello my name is jeff" and someone search for
> "my".
> > i want to get back the position of my in the content field (which is 1 in
> > this case)
> > i tried to do that with payloads but no success. and another problem i
> > encourage is .
> > lets say the content field is "hello my name is jeff what is your name".
> > now if someone search for "name" i want to get back the index of all
> > occurrences not just the first one
> >
> > is there any way to that with solr without develop new plugins
> >
> > thx
> >
>

Re: hi question about solr

2019-12-02 Thread Bernd Fehling

In short,

you are trying to use an indexer as a full-text search engine, right?

Regards
Bernd

Am 02.12.19 um 12:24 schrieb eli chen:
> hi im kind of new to solr so please be patient
> 
> i'll try to explain what do i need and what im trying to do.
> 
> we a have a lot of books content and we want to index them and allow search
> in the books.
> when someone search for a term
> i need to get back the position of matchen word in the book
> for example
> if the book content is "hello my name is jeff" and someone search for "my".
> i want to get back the position of my in the content field (which is 1 in
> this case)
> i tried to do that with payloads but no success. and another problem i
> encourage is .
> lets say the content field is "hello my name is jeff what is your name".
> now if someone search for "name" i want to get back the index of all
> occurrences not just the first one
> 
> is there any way to that with solr without develop new plugins
> 
> thx
>

hi question about solr

2019-12-02 Thread eli chen

hi im kind of new to solr so please be patient

i'll try to explain what do i need and what im trying to do.

we a have a lot of books content and we want to index them and allow search
in the books.
when someone search for a term
i need to get back the position of matchen word in the book
for example
if the book content is "hello my name is jeff" and someone search for "my".
i want to get back the position of my in the content field (which is 1 in
this case)
i tried to do that with payloads but no success. and another problem i
encourage is .
lets say the content field is "hello my name is jeff what is your name".
now if someone search for "name" i want to get back the index of all
occurrences not just the first one

is there any way to that with solr without develop new plugins

thx

Re: Question about Luke

2019-11-20 Thread Tomoko Uchida

Hello,

> Is it different from checkIndex -exorcise option?
> (As far as I recently leaned, checkIndex -exorcise will delete unreadable 
> indices. )

If you mean desktop app Luke, "Repair" is just a wrapper of
CheckIndex.exorciseIndex(). There is no difference between doing
"Repair" from Luke GUI and calling "CheckIndex -exorcise" from CLI.


2019年11月11日(月) 20:36 Kayak28 :
>
> Hello, Community:
>
> I am using Solr7.4.0 currently, and I was testing how Solr actually behaves 
> when it has a corrupted index.
> And I used Luke to fix the broken index from GUI.
> I just came up with the following questions.
> Is it possible to use the repair index tool from CLI? (in the case, Solr was 
> on AWS for example.)
> Is it different from checkIndex -exorcise option?
> (As far as I recently leaned, checkIndex -exorcise will delete unreadable 
> indices. )
>
> If anyone gives me a reply, I would be very thankful.
>
> Sincerely,
> Kaya Ota

Re: Question about startup memory usage

2019-11-14 Thread Shawn Heisey


On 11/14/2019 1:46 AM, Hongxu Ma wrote:

Thank you @Shawn Heisey , you help me many times.

My -xms=1G
When restart solr, I can see the progress of memory increasing (from 1G to 9G, 
took near 10s).

I have a guess: maybe solr is loading some needed files into heap memory, e.g. 
*.tip : term index file. What's your thoughts?


Solr's basic operation involves quite a lot of Java memory allocation. 
Most of what gets allocated turns into garbage almost immediately, but 
Java does not reuse that memory right away ... it can only be reused 
after garbage collection on the appropriate memory region runs.


The algorithms in Java that decide between either grabbing more memory 
(up to the configured heap limit) or running garbage collection are 
beyond my understanding.  For programs with heavy memory allocation, 
like Solr, the preference does seem to lean towards allocating more 
memory if it's available than performing garbage collection.


I can imagine that initial loading of indexes containing billions of 
documents will require quite a bit of heap.  I do not know what data is 
stored in that memory.


Thanks,
Shawn

Re: Question about startup memory usage

2019-11-14 Thread Hongxu Ma

Thank you @Shawn Heisey<mailto:apa...@elyograg.org> , you help me many times.

My -xms=1G
When restart solr, I can see the progress of memory increasing (from 1G to 9G, 
took near 10s).

I have a guess: maybe solr is loading some needed files into heap memory, e.g. 
*.tip : term index file. What's your thoughts?

thanks.

From: Shawn Heisey 
Sent: Thursday, November 14, 2019 1:15
To: solr-user@lucene.apache.org 
Subject: Re: Question about startup memory usage

On 11/13/2019 2:03 AM, Hongxu Ma wrote:
> I have a solr-cloud cluster with a big collection, after startup (no any 
> search/index operations), its jvm memory usage is 9GB (via top: RES).
>
> Cluster and collection info:
> each host: total 64G mem, two solr nodes with -xmx=15G
> collection: total 9B billion docs (but each doc is very small: only some 
> bytes), total size 3TB.
>
> My question is:
> Is the 9G mem usage after startup normal? If so, I am worried that the follow 
> up index/search operations will cause an OOM error.
> And how can I reduce the memory usage? Maybe I should introduce more host 
> with nodes, but besides this, is there any other solution?

With the "-Xmx=15G" option, you've told Java that it can use up to 15GB
for heap.  It's total resident memory usage is eventually going to reach
a little over 15GB and probably never go down.  This is how Java works.

The amount of memory that Java allocates immediately on program startup
is related to the -Xms setting.  Normally Solr uses the same number for
both -Xms and -Xmx, but that can be changed if you desire.  We recommend
using the same number.  If -Xms is smaller than -Xmx, Java may allocate
less memory as soon as it starts, then Solr is going to run through its
startup procedure.  We will not know exactly how much memory allocation
is going to occur when that happens ... but with billions of documents,
it's not going to be small.

Thanks,
Shawn

Re: Question about startup memory usage

2019-11-13 Thread Shawn Heisey


On 11/13/2019 2:03 AM, Hongxu Ma wrote:

I have a solr-cloud cluster with a big collection, after startup (no any 
search/index operations), its jvm memory usage is 9GB (via top: RES).

Cluster and collection info:
each host: total 64G mem, two solr nodes with -xmx=15G
collection: total 9B billion docs (but each doc is very small: only some 
bytes), total size 3TB.

My question is:
Is the 9G mem usage after startup normal? If so, I am worried that the follow 
up index/search operations will cause an OOM error.
And how can I reduce the memory usage? Maybe I should introduce more host with 
nodes, but besides this, is there any other solution?


With the "-Xmx=15G" option, you've told Java that it can use up to 15GB 
for heap.  It's total resident memory usage is eventually going to reach 
a little over 15GB and probably never go down.  This is how Java works.


The amount of memory that Java allocates immediately on program startup 
is related to the -Xms setting.  Normally Solr uses the same number for 
both -Xms and -Xmx, but that can be changed if you desire.  We recommend 
using the same number.  If -Xms is smaller than -Xmx, Java may allocate 
less memory as soon as it starts, then Solr is going to run through its 
startup procedure.  We will not know exactly how much memory allocation 
is going to occur when that happens ... but with billions of documents, 
it's not going to be small.


Thanks,
Shawn

Question about startup memory usage

2019-11-13 Thread Hongxu Ma

Hi
I have a solr-cloud cluster with a big collection, after startup (no any 
search/index operations), its jvm memory usage is 9GB (via top: RES).

Cluster and collection info:
each host: total 64G mem, two solr nodes with -xmx=15G
collection: total 9B billion docs (but each doc is very small: only some 
bytes), total size 3TB.

My question is:
Is the 9G mem usage after startup normal? If so, I am worried that the follow 
up index/search operations will cause an OOM error.
And how can I reduce the memory usage? Maybe I should introduce more host with 
nodes, but besides this, is there any other solution?

Thanks.

Re: Question about memory usage and file handling

2019-11-11 Thread Erick Erickson

(1) no. The internal Ram buffer will pretty much limit the amount of heap used 
however.

(2) You actually have several segments. “.cfs” stands for “Compound File”, see: 

https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/codecs/lucene70/package-summary.html
"An optional "virtual" file consisting of all the other index files for systems 
that frequently run out of file handles.”

IOW, _0.cfs is a complete segment. _1.cfs is a different, complete segment etc. 
The merge policy (TieredMergePolicy) controls when these are used .vs. the 
segment being kept in separate files.

New segments are created whenever the ram buffer is flushed or whenever you do 
a commit (closing the IW also creates a segment IIUC). However, under control 
of the merge policy, segments are merged. See: 
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

You’re confusing closing a writer with merging segments. Essentially, every 
time a commit happens, the merge policy is called to determine if segments 
should be merged, see Mike’s blog above.

Additionally, you say "I was hoping there would be only _0.cfs file”. This’ll 
pretty much never happen. Segment names always increase, at best you’d have 
something like _ab.cfs, if not 10-15 _ab* files.

Lucene likes file handles, essentially when searching a file handle will be 
open for _every_ file in your index all the time.

All that said, counting the number of files seems like a waste of time. If 
you’re running on a *nix box, the usual (Solr I’ll admit, but I think it 
applies to Lucene as well) is to set the limit to 65K or so.

And if you’re truly concerned, and since you say this is an immutable, you can 
do a forceMerge. Prior to Lucene 7.5, the would by default form exactly one 
segment. For Lucene 7.5 and later, it’ll respect max segment size (a parameter 
in TMP, defaults to 5g) unless you specify a segment count of 1.

Best,
Erick

> On Nov 11, 2019, at 5:47 PM, Shawn Heisey  wrote:
> 
> On 11/11/2019 1:40 PM, siddharth teotia wrote:
>> I have a few questions about Lucene indexing and file handling. It would be
>> great if someone can help with these. I had earlier asked these questions
>> on gene...@lucene.apache.org but was asked to seek help here.
> 
> This mailing list (solr-user) is for Solr.  Questions about Lucene do not 
> belong on this list.
> 
> You should ask on the java-user mailing list, which is for questions related 
> to the core (Java) version of Lucene.
> 
> http://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
> 
> I have put the original sender address in the BCC field just in case you are 
> not subscribed here.
> 
> Thanks,
> Shawn

Re: Question about memory usage and file handling

2019-11-11 Thread Shawn Heisey


On 11/11/2019 1:40 PM, siddharth teotia wrote:

I have a few questions about Lucene indexing and file handling. It would be
great if someone can help with these. I had earlier asked these questions
on gene...@lucene.apache.org but was asked to seek help here.


This mailing list (solr-user) is for Solr.  Questions about Lucene do 
not belong on this list.


You should ask on the java-user mailing list, which is for questions 
related to the core (Java) version of Lucene.


http://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg

I have put the original sender address in the BCC field just in case you 
are not subscribed here.


Thanks,
Shawn

Question about memory usage and file handling

2019-11-11 Thread siddharth teotia

Hi All,

I have a few questions about Lucene indexing and file handling. It would be
great if someone can help with these. I had earlier asked these questions
on gene...@lucene.apache.org but was asked to seek help here.


(1) During indexing, is there any knob to tell the writer to use off-heap
for buffering. I didn't find anything in the docs so probably the answer is
no. Just confirming.

(2) I did some experiments with buffering threshold using
setMaxRAMBufferSizeMB() on IndexWriterConfig. I varied it from 16MB
(default), 128MB, 256MB and 512MB. The experiment was ingesting 5million
documents. It turns out that buffering threshold also controls the number
of files that are created in the index directory. In all the cases, I see
only 1 segment (since there was just one segments_1) file but there were
multiple .cfs files  -- _0.cfs, _1.cfs, _2.cfs, _3.cfs.

How can there be multiple cfs files when there is just one segment? My
understanding from the documentation was that all files for each segment
will have the same name but different extension. In this case, even though
there is only 1 segment, there are still cfs files. Does each flush result
in a new file?

The reason to do this experiment is to understand the number of open files
both while building the index and querying. I am not quite sure why I am
seeing multiple CFS files when there is only 1 segment. I was hoping there
would be only_0.cfs file.  This is true when buffer threshold is 512MB, but
there are 2 cfs files when threshold is set to 256MB, 5 cfs files when set
to 128MB and I didn't see the CFS file for the default 16MB threshold.
There were individual files (.fdx, .fdt, .tip etc). I thought by default
Lucene creates a compound file at least after the writer closes. Is that
not true?

I can see that during querying, only the cfs file is kept opened. But I
would like to understand a little bit about the number of cfs files and
based on that we can set the buffering threshold to control the heap
overhead while building the index.

(2) In my experiments, the writer commits and is closed after ingesting all
the 5million documents and after that there is no need for us to index
more. So essentially it is an immutable index. However, I want to
understand the threshold for creating a new segment. Is that pretty high?
Or if the writer is reopened, then the next set of documents will go into
the next segment and so on?

I would really appreciate some help with above questions.

Thanks,
Siddharth

Question about Luke

2019-11-11 Thread Kayak28

Hello, Community:

I am using Solr7.4.0 currently, and I was testing how Solr actually behaves
when it has a corrupted index.
And I used Luke to fix the broken index from GUI.
I just came up with the following questions.
Is it possible to use the repair index tool from CLI? (in the case, Solr
was on AWS for example.)
Is it different from checkIndex -exorcise option?
(As far as I recently leaned, checkIndex -exorcise will delete unreadable
indices. )

If anyone gives me a reply, I would be very thankful.

Sincerely,
Kaya Ota

Re: Solr 7.6 query performace question

2019-10-13 Thread Erick Erickson

Well, It Depends (tm).

Certainly 2 and 3 are _not_ memory intensive

4 depends on the number of terms in the fields.

But I suspect your real problem has nothing to do with memory and is <1>. Try 
q=*:* rather than q=*. In case your e-mail tries to make things bold, that’s 
q=asterisk-colon-asterisk, not q=asterisk

Best,
Erick

> On Oct 13, 2019, at 12:59 AM, harjags  wrote:
> 
> We are upgrading to solr 7.6 from 6.1
> Our query has below pattern predominantly
> 
> 1.q is * as we filter based on a department of products always
> 2. 100+ bq's to boost certain document
> 3. Collapsing using a non DocValue field
> 4.Many Facet Fields and Many Facet queries
> 
> Which of the above is the most memory consuming operations?
> 
> Below errors are very common in 7.6 and we have solr nodes failing with
> tanking memory.
> 
> The request took too long to iterate over terms. Timeout: timeoutAt:
> 162874656583645 (System.nanoTime(): 162874701942020),
> TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@74507f4a
> 
> or 
> 
> #*BitSetDocTopFilter*]; The request took too long to iterate over terms.
> Timeout: timeoutAt: 33288640223586 (System.nanoTime(): 33288700895778),
> TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@5e458644
> 
> 
> or 
> 
> #SortedIntDocSetTopFilter]; The request took too long to iterate over terms.
> Timeout: timeoutAt: 552497919389297 (System.nanoTime(): 552508251053558),
> TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@60b7186e
> 
> 
> 
> OR
> 
> 
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Solr 7.6 query performace question

2019-10-13 Thread harjags

We are upgrading to solr 7.6 from 6.1
Our query has below pattern predominantly

1.q is * as we filter based on a department of products always
2. 100+ bq's to boost certain document
3. Collapsing using a non DocValue field
4.Many Facet Fields and Many Facet queries

Which of the above is the most memory consuming operations?

Below errors are very common in 7.6 and we have solr nodes failing with
tanking memory.

The request took too long to iterate over terms. Timeout: timeoutAt:
162874656583645 (System.nanoTime(): 162874701942020),
TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@74507f4a

or 

#*BitSetDocTopFilter*]; The request took too long to iterate over terms.
Timeout: timeoutAt: 33288640223586 (System.nanoTime(): 33288700895778),
TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@5e458644


or 

#SortedIntDocSetTopFilter]; The request took too long to iterate over terms.
Timeout: timeoutAt: 552497919389297 (System.nanoTime(): 552508251053558),
TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@60b7186e


 
OR





--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Question regarding subqueries

2019-10-03 Thread Bram Biesbrouck

Hi Mikhail,

You're right, I'm probably over-complicating things. I was stuck trying to
combine a function in a regular query using a local variable, but Solr
doesn't seem to bend the way my mind did ;-)
Anyway, I worked around it using your suggestion and/or a slightly modified
prefix parser plugin.
Thanks for taking the time to reply, btw!

best,

b.

On Wed, Oct 2, 2019 at 9:05 PM Mikhail Khludnev  wrote:

> Hello, Bram.
>
> Something like that is possible in principle, but it will take enormous
> efforts to tackle exact syntax.
> Why not something like children.fq=-parent:true ?
>
> On Wed, Oct 2, 2019 at 8:52 PM Bram Biesbrouck <
> bram.biesbro...@reinvention.be> wrote:
>
> > Hi all,
> >
> > I'm struggling with a little period-sign difficulty and instead of
> pulling
> > out my hair, I wonder if any of you could help me out...
> >
> > Here's the query:
> > q=uri:"/en/blah"=id,uri,children:[subquery]={!prefix f=id
> v=$
> > row.id}=*
> >
> > It just searches for a document with the field "uri" set to "/en/blah".
> > For every hit (just one), it tries to manually fetch the subdocuments
> using
> > the id field of the hit since its children have id's like
> > ..
> > Note that I know this should be done with nested documents and the
> > ChildDocTransformer... this is just an exercise to train my brain...
> >
> > The query above works fine. However, it also returns the parent document,
> > because the prefix search includes it as well, of course. However, if I'm
> > changing the subquery to something along the lines of this:
> >
> > {!prefix f=id v=concat($row.id,".")}
> > or
> > {!prefix f=id v="$row.id\.")}
> > or
> > {!query defType=lucene v=concat("id:",$row.id,".")}
> >
> > I get no results back.
> >
> > I feel like I'm missing only a simple thing here, but can't seem to
> > pinpoint it.
> >
> > Any help?
> >
> > b.
> >  *We do video technology*
> > Visit our new website!  *Bram Biesbrouck*
> > bram.biesbro...@reinvention.be
> > +32 486 118280 <0032%20486%20118280>
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Question regarding subqueries

2019-10-02 Thread Mikhail Khludnev

Hello, Bram.

Something like that is possible in principle, but it will take enormous
efforts to tackle exact syntax.
Why not something like children.fq=-parent:true ?

On Wed, Oct 2, 2019 at 8:52 PM Bram Biesbrouck <
bram.biesbro...@reinvention.be> wrote:

> Hi all,
>
> I'm struggling with a little period-sign difficulty and instead of pulling
> out my hair, I wonder if any of you could help me out...
>
> Here's the query:
> q=uri:"/en/blah"=id,uri,children:[subquery]={!prefix f=id v=$
> row.id}=*
>
> It just searches for a document with the field "uri" set to "/en/blah".
> For every hit (just one), it tries to manually fetch the subdocuments using
> the id field of the hit since its children have id's like
> ..
> Note that I know this should be done with nested documents and the
> ChildDocTransformer... this is just an exercise to train my brain...
>
> The query above works fine. However, it also returns the parent document,
> because the prefix search includes it as well, of course. However, if I'm
> changing the subquery to something along the lines of this:
>
> {!prefix f=id v=concat($row.id,".")}
> or
> {!prefix f=id v="$row.id\.")}
> or
> {!query defType=lucene v=concat("id:",$row.id,".")}
>
> I get no results back.
>
> I feel like I'm missing only a simple thing here, but can't seem to
> pinpoint it.
>
> Any help?
>
> b.
>  *We do video technology*
> Visit our new website!  *Bram Biesbrouck*
> bram.biesbro...@reinvention.be
> +32 486 118280 <0032%20486%20118280>
>


-- 
Sincerely yours
Mikhail Khludnev

Question regarding subqueries

2019-10-02 Thread Bram Biesbrouck

Hi all,

I'm struggling with a little period-sign difficulty and instead of pulling
out my hair, I wonder if any of you could help me out...

Here's the query:
q=uri:"/en/blah"=id,uri,children:[subquery]={!prefix f=id v=$
row.id}=*

It just searches for a document with the field "uri" set to "/en/blah".
For every hit (just one), it tries to manually fetch the subdocuments using
the id field of the hit since its children have id's like
..
Note that I know this should be done with nested documents and the
ChildDocTransformer... this is just an exercise to train my brain...

The query above works fine. However, it also returns the parent document,
because the prefix search includes it as well, of course. However, if I'm
changing the subquery to something along the lines of this:

{!prefix f=id v=concat($row.id,".")}
or
{!prefix f=id v="$row.id\.")}
or
{!query defType=lucene v=concat("id:",$row.id,".")}

I get no results back.

I feel like I'm missing only a simple thing here, but can't seem to
pinpoint it.

Any help?

b.
 *We do video technology*
Visit our new website!  *Bram Biesbrouck*
bram.biesbro...@reinvention.be
+32 486 118280 <0032%20486%20118280>

Re: auto scaling question - solr 8.2.0

2019-09-26 Thread Joe Obernberger

Just as another data point.  I just tried again, and this time, I got an 
error from one of the remaining 3 nodes:


Error while trying to recover. 
core=UNCLASS_2019_6_8_36_shard2_replica_n21:java.util.concurrent.ExecutionException:
 org.apache.solr.client.solrj.SolrServerException: IOException occurred when 
talking to server at: http://telesto:9100/solr
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:902)
at 
org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:603)
at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:336)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:317)
at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException 
occurred when talking to server at: http://telesto:9100/solr
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:670)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.lambda$httpUriRequest$0(HttpSolrClient.java:306)
... 5 more
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:204)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at 
org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at 
org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
at 
org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
at 
org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
at 
org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
at 
org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120)
at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:555)
... 6 more


At this point, no nodes are hosting one of the collections.

-Joe

On 9/26/2019 1:32 PM, Joe Obernberger wrote:
Hi all - I have a 4 node cluster for test, and created several solr 
collections with 2 shards and 2 replicas each.


I'd like the global policy to be to not place more than one replica of 
the same shard on the same node.  I did this with this curl command:
curl -X POST -H 'Content-type:application/json' --data-binary 
'{"set-cluster-policy":[{"replica": "<2", "shard": "#EACH", "node": 
"#ANY"}]}' http://localhost:9100/solr/admin/autoscaling


Creating the collections works great - they are distributed across the 
nodes nicely.  When I turn a node off, however, (going from 4 nodes to 
3), the same node was assigned to not only be both replicas of a 
shard, but one node is now hosting all of the replicas of a collection 
ie:

collection->shard1>replica1,replica2
collection->shard2->replica1,replica2

all of those replicas above are hosted by the same node.  What am I 
doing wrong here?

auto scaling question - solr 8.2.0

2019-09-26 Thread Joe Obernberger

Hi all - I have a 4 node cluster for test, and created several solr 
collections with 2 shards and 2 replicas each.


I'd like the global policy to be to not place more than one replica of 
the same shard on the same node.  I did this with this curl command:
curl -X POST -H 'Content-type:application/json' --data-binary 
'{"set-cluster-policy":[{"replica": "<2", "shard": "#EACH", "node": 
"#ANY"}]}' http://localhost:9100/solr/admin/autoscaling


Creating the collections works great - they are distributed across the 
nodes nicely.  When I turn a node off, however, (going from 4 nodes to 
3), the same node was assigned to not only be both replicas of a shard, 
but one node is now hosting all of the replicas of a collection ie:

collection->shard1>replica1,replica2
collection->shard2->replica1,replica2

all of those replicas above are hosted by the same node.  What am I 
doing wrong here?  Thank you!


-Joe

ASCIIFoldingFilter question

2019-09-25 Thread Jarett Lear

Hope this is the right list to ask this, not sure if this is a bug or if
I'm doing something wrong.

We're running some text with some emojis through this filter and if I'm
reading the code right when it finds a U+203C (:bangbang: | double
exclamation) it replaces that with an appropriate !! ASCII characters, but
if its a "fully qualified" emoji then it also includes U+FE0E after, which
is a zero length "VARIATION SELECTOR-16".

The issue we are running into is that the emoji is replaced with !! like it
should be, but then directly after the ASCII !! there is this character
that's just hanging out now because it's not matched or changed into
anything. This causes some weird behavior down the line in other filters
and trying to strip off punctuation, for some reason it doesn't seem to be
detected as punctuation anymore. Ultimately we are trying to get down to an
array of meaningful tokens out of the content, but we are getting certain
emoji's all the way through the filters and we aren't sure why these ones
that are ASCII folded are making it through, where the ones that aren't are
filtered out like normal.

Thanks,
Jarett

Re: Question about "No registered leader" error

2019-09-19 Thread Hongxu Ma

@Shawn @Erick Thanks for your kindle help!

No OOM log and I confirm there was no OOM happened.

My ZK ticktime is set to 5000, so 5000*20 = 100s > 60s, and I checked solr 
code: the leader waiting time: 4000ms is a const variable, is not configurable. 
(why it isn't a configurable param?)

My solr version is 7.3.1, xmx = 3MB (via solr UI, peak memory is 22GB)
I have already used CMS GC tuning (param has a little difference from your wiki 
page).

I will try the following advice:

  *   lower heap size
  *   turn to G1 (the same param as wiki)
  *   try to restart one SOLR node when this error happens.

Thanks again.

From: Shawn Heisey 
Sent: Wednesday, September 18, 2019 20:21
To: solr-user@lucene.apache.org 
Subject: Re: Question about "No registered leader" error

On 9/18/2019 6:11 AM, Shawn Heisey wrote:
> On 9/17/2019 9:35 PM, Hongxu Ma wrote:
>> My questions:
>>
>>*   Is this error possible caused by "long gc pause"? my solr
>> zkClientTimeout=6
>
> It's possible.  I can't say for sure that this is the issue, but it
> might be.

A followup.  I was thinking about the interactions here.  It looks like
Solr only waits four seconds for the leader election, and both of the
pauses you mentioned are longer than that.

Four seconds is probably too short a time to wait, and I do not think
that timeout is configurable anywhere.

> What version of Solr do you have, and what is your max heap?  The CMS
> garbage collection that Solr 5.0 and later incorporate by default is
> pretty good.  My G1 settings might do slightly better, but the
> improvement won't be dramatic unless your existing commandline has
> absolutely no gc tuning at all.

That question will be important.  If you already have our CMS GC tuning,
switching to G1 probably is not going to solve this.  Lowering the max
heap might be the only viable solution in that case, and depending on
what you're dealing with, it will either be impossible or it will
require more servers.

Thanks,
Shawn

Re: Question about "No registered leader" error

2019-09-18 Thread Erick Erickson

Check whether the oom killer script was called. If so, there will be
log files obviously relating to that. I've seen nodes mysteriously
disappear as a result of this with no message in the regular solr
logs. If that's the case, you need to increase your heap.

Erick

On Wed, Sep 18, 2019 at 8:21 AM Shawn Heisey  wrote:
>
> On 9/18/2019 6:11 AM, Shawn Heisey wrote:
> > On 9/17/2019 9:35 PM, Hongxu Ma wrote:
> >> My questions:
> >>
> >>*   Is this error possible caused by "long gc pause"? my solr
> >> zkClientTimeout=6
> >
> > It's possible.  I can't say for sure that this is the issue, but it
> > might be.
>
> A followup.  I was thinking about the interactions here.  It looks like
> Solr only waits four seconds for the leader election, and both of the
> pauses you mentioned are longer than that.
>
> Four seconds is probably too short a time to wait, and I do not think
> that timeout is configurable anywhere.
>
> > What version of Solr do you have, and what is your max heap?  The CMS
> > garbage collection that Solr 5.0 and later incorporate by default is
> > pretty good.  My G1 settings might do slightly better, but the
> > improvement won't be dramatic unless your existing commandline has
> > absolutely no gc tuning at all.
>
> That question will be important.  If you already have our CMS GC tuning,
> switching to G1 probably is not going to solve this.  Lowering the max
> heap might be the only viable solution in that case, and depending on
> what you're dealing with, it will either be impossible or it will
> require more servers.
>
> Thanks,
> Shawn

Re: Question about "No registered leader" error

2019-09-18 Thread Shawn Heisey


On 9/18/2019 6:11 AM, Shawn Heisey wrote:

On 9/17/2019 9:35 PM, Hongxu Ma wrote:

My questions:

   *   Is this error possible caused by "long gc pause"? my solr 
zkClientTimeout=6


It's possible.  I can't say for sure that this is the issue, but it 
might be.


A followup.  I was thinking about the interactions here.  It looks like 
Solr only waits four seconds for the leader election, and both of the 
pauses you mentioned are longer than that.


Four seconds is probably too short a time to wait, and I do not think 
that timeout is configurable anywhere.


What version of Solr do you have, and what is your max heap?  The CMS 
garbage collection that Solr 5.0 and later incorporate by default is 
pretty good.  My G1 settings might do slightly better, but the 
improvement won't be dramatic unless your existing commandline has 
absolutely no gc tuning at all.


That question will be important.  If you already have our CMS GC tuning, 
switching to G1 probably is not going to solve this.  Lowering the max 
heap might be the only viable solution in that case, and depending on 
what you're dealing with, it will either be impossible or it will 
require more servers.


Thanks,
Shawn

Re: Question about "No registered leader" error

2019-09-18 Thread Shawn Heisey


On 9/17/2019 9:35 PM, Hongxu Ma wrote:

My questions:

   *   Is this error possible caused by "long gc pause"? my solr 
zkClientTimeout=6


It's possible.  I can't say for sure that this is the issue, but it 
might be.



   *   If so, how can I prevent this error happen? My thoughts: using G1 
collector (as 
https://cwiki.apache.org/confluence/display/SOLR/ShawnHeisey#ShawnHeisey-GCTuningforSolr)
 or enlarge zkClientTimeout again, what's your idea?


If your ZK server ticktime setting is the typical value of 2000, that 
means that the largest value you can use for the ZK timeout (which 
Solr's zkClientTimeout value ultimately gets used to set) is 40 seconds 
-- 20 times the ticktime is the biggest value ZK will allow.


So if your ZK server ticktime is 2000 milliseconds, you're not actually 
getting 60 seconds, and I don't know what happens when you try ... I 
would expect ZK to either just use its max value or ignore the setting 
entirely, and I do not know which it is.  That's something we should ask 
the ZK mailing list and/or do testing on.


Dealing with the the "no registered leader" problem probably will 
involve restarting at least one of the Solr server JVMs in the cloud, 
and if that doesn't work, restart all of them.


What version of Solr do you have, and what is your max heap?  The CMS 
garbage collection that Solr 5.0 and later incorporate by default is 
pretty good.  My G1 settings might do slightly better, but the 
improvement won't be dramatic unless your existing commandline has 
absolutely no gc tuning at all.


Thanks,
Shawn

Question about "No registered leader" error

2019-09-17 Thread Hongxu Ma

Hi all
I got an error when I was doing index operation:

"2019-09-18 02:35:44.427244 ... No registered leader was found after waiting 
for 4000ms , collection: foo slice: shard2"

Beside it, there is no other error in solr log.

Collection foo have 2 shards, then I check their jvm gc log:

  *   2019-09-18T02:34:08.252+: 150961.017: Total time for which 
application threads were stopped: 10.4617864 seconds, Stopping threads took: 
0.0005226 seconds

  *   2019-09-18T02:34:30.194+: 151014.108: Total time for which 
application threads were stopped: 44.4809415 seconds, Stopping threads took: 
0.0005976 seconds

I saw there are long gc pauses at the near timepoint.

My questions:

  *   Is this error possible caused by "long gc pause"? my solr 
zkClientTimeout=6
  *   If so, how can I prevent this error happen? My thoughts: using G1 
collector (as 
https://cwiki.apache.org/confluence/display/SOLR/ShawnHeisey#ShawnHeisey-GCTuningforSolr)
 or enlarge zkClientTimeout again, what's your idea?


Thanks.

Re: Question: Solr perform well with thousands of replicas?

2019-09-04 Thread Hongxu Ma

Hi Erick
Thanks for your help.

Before I visit wiki/maillist, I knew solr is unstable in 1000+ collections, and 
should be safe in 10~100 collections.
But in a specific env, what's the exact number which solr begin to become 
unstable? I don't know.

So I try to deploy a test cluster to get the number and try to optimize it 
bigger. (save my cost)
That's my purpose: quantitative analysis --> How many replicas can be supported 
in my env?
After get it, I will adjust my application: (when it's near the max number) 
prevent the creation of too many indexes or give a warning message to user.

From: Erick Erickson 
Sent: Monday, September 2, 2019 21:20
To: solr-user@lucene.apache.org 
Subject: Re: Question: Solr perform well with thousands of replicas?

> why so many collection/replica: it's our customer needs, for example: each 
> database table mappings a collection.

I always cringe when I see statements like this. What this means is that your 
customer doesn’t understand search and needs guidance in the proper use of any 
search technology, Solr included.

Solr is _not_ an RDBMS. Simply mapping the DB tables onto collections will 
almost certainly result in a poor experience. Next the customer will want to 
ask Solr to do the same thing a DB does, i.e. run a join across 10 tables etc., 
which will be abysmal. Solr isn’t designed for that. Some brilliant RDBMS 
people have spent many years making DBs to what they do and do it well.

That said, RDBMSs have poor search capabilities, they aren’t built to solve the 
search problem.

I suspect the time you spend making Solr load a thousand cores will be wasted. 
Once you do get them loaded, performance will be horrible. IMO you’d be far 
better off helping the customer define their problem so they properly model 
their search problem. This may mean that the result will be a hybrid where Solr 
is used for the free-text search and the RDBMS uses the results of the search 
to do something. Or vice versa.

FWIW
Erick

> On Sep 2, 2019, at 5:55 AM, Hongxu Ma  wrote:
>
> Thanks @Jörn and @Erick
> I enlarged my JVM memory, so far it's stable (but used many memory).
> And I will check lower-level errors according to your suggestion if error 
> happens.
>
> About my scenario:
>
>  *   why so many collection/replica: it's our customer needs, for example: 
> each database table mappings a collection.
>  *   this env is just a test cluster: I want to verify the max collection 
> number solr can support stably.
>
>
> 
> From: Erick Erickson 
> Sent: Friday, August 30, 2019 20:05
> To: solr-user@lucene.apache.org 
> Subject: Re: Question: Solr perform well with thousands of replicas?
>
> “no registered leader” is the effect of some problem usually, not the root 
> cause. In this case, for instance, you could be running out of file handles 
> and see other errors like “too many open files”. That’s just one example.
>
> One common problem is that Solr needs a lot of file handles and the system 
> defaults are too low. We usually recommend you start with 65K file handles 
> (ulimit) and bump up the number of processes to 65K too.
>
> So to throw some numbers out. With 1,000 replicas, and let’s say you have 50 
> segments in the index in each replica. Each segment consists of multiple 
> files (I’m skipping “compound files” here as an advanced topic), so each 
> segment has, let’s say, 10 segments. 1,000 * 50 * 10 would require 500,000 
> file handles on your system.
>
> Bottom line: look for other, lower-level errors in the log to try to 
> understand what limit you’re running into.
>
> All that said, there’ll be a number of “gotchas” when running that many 
> replicas on a particular node, I second Jörn;’s question...
>
> Best,
> Erick
>
>> On Aug 30, 2019, at 3:18 AM, Jörn Franke  wrote:
>>
>> What is the reason for this number of replicas? Solr should work fine, but 
>> maybe it is worth to consolidate some collections to avoid also 
>> administrative overhead.
>>
>>> Am 29.08.2019 um 05:27 schrieb Hongxu Ma :
>>>
>>> Hi
>>> I have a solr-cloud cluster, but it's unstable when collection number is 
>>> big: 1000 replica/core per solr node.
>>>
>>> To solve this issue, I have read the performance guide:
>>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems
>>>
>>> I noted there is a sentence on solr-cloud section:
>>> "Recent Solr versions perform well with thousands of replicas."
>>>
>>> I want to know does it mean a single solr node can handle thousands of 
>>> replicas? or a solr cluster can (if so, what's the size of the cluster?)
>>>
>>> My solr version is 7.3.1 and 6.6.2 (looks they are the same in performance)
>>>
>>> Thanks for you help.
>>>
>

Re: Question: Solr perform well with thousands of replicas?

2019-09-02 Thread Erick Erickson

> why so many collection/replica: it's our customer needs, for example: each 
> database table mappings a collection.

I always cringe when I see statements like this. What this means is that your 
customer doesn’t understand search and needs guidance in the proper use of any 
search technology, Solr included.

Solr is _not_ an RDBMS. Simply mapping the DB tables onto collections will 
almost certainly result in a poor experience. Next the customer will want to 
ask Solr to do the same thing a DB does, i.e. run a join across 10 tables etc., 
which will be abysmal. Solr isn’t designed for that. Some brilliant RDBMS 
people have spent many years making DBs to what they do and do it well. 

That said, RDBMSs have poor search capabilities, they aren’t built to solve the 
search problem.

I suspect the time you spend making Solr load a thousand cores will be wasted. 
Once you do get them loaded, performance will be horrible. IMO you’d be far 
better off helping the customer define their problem so they properly model 
their search problem. This may mean that the result will be a hybrid where Solr 
is used for the free-text search and the RDBMS uses the results of the search 
to do something. Or vice versa.

FWIW
Erick

> On Sep 2, 2019, at 5:55 AM, Hongxu Ma  wrote:
> 
> Thanks @Jörn and @Erick
> I enlarged my JVM memory, so far it's stable (but used many memory).
> And I will check lower-level errors according to your suggestion if error 
> happens.
> 
> About my scenario:
> 
>  *   why so many collection/replica: it's our customer needs, for example: 
> each database table mappings a collection.
>  *   this env is just a test cluster: I want to verify the max collection 
> number solr can support stably.
> 
> 
> 
> From: Erick Erickson 
> Sent: Friday, August 30, 2019 20:05
> To: solr-user@lucene.apache.org 
> Subject: Re: Question: Solr perform well with thousands of replicas?
> 
> “no registered leader” is the effect of some problem usually, not the root 
> cause. In this case, for instance, you could be running out of file handles 
> and see other errors like “too many open files”. That’s just one example.
> 
> One common problem is that Solr needs a lot of file handles and the system 
> defaults are too low. We usually recommend you start with 65K file handles 
> (ulimit) and bump up the number of processes to 65K too.
> 
> So to throw some numbers out. With 1,000 replicas, and let’s say you have 50 
> segments in the index in each replica. Each segment consists of multiple 
> files (I’m skipping “compound files” here as an advanced topic), so each 
> segment has, let’s say, 10 segments. 1,000 * 50 * 10 would require 500,000 
> file handles on your system.
> 
> Bottom line: look for other, lower-level errors in the log to try to 
> understand what limit you’re running into.
> 
> All that said, there’ll be a number of “gotchas” when running that many 
> replicas on a particular node, I second Jörn;’s question...
> 
> Best,
> Erick
> 
>> On Aug 30, 2019, at 3:18 AM, Jörn Franke  wrote:
>> 
>> What is the reason for this number of replicas? Solr should work fine, but 
>> maybe it is worth to consolidate some collections to avoid also 
>> administrative overhead.
>> 
>>> Am 29.08.2019 um 05:27 schrieb Hongxu Ma :
>>> 
>>> Hi
>>> I have a solr-cloud cluster, but it's unstable when collection number is 
>>> big: 1000 replica/core per solr node.
>>> 
>>> To solve this issue, I have read the performance guide:
>>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems
>>> 
>>> I noted there is a sentence on solr-cloud section:
>>> "Recent Solr versions perform well with thousands of replicas."
>>> 
>>> I want to know does it mean a single solr node can handle thousands of 
>>> replicas? or a solr cluster can (if so, what's the size of the cluster?)
>>> 
>>> My solr version is 7.3.1 and 6.6.2 (looks they are the same in performance)
>>> 
>>> Thanks for you help.
>>> 
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3577 matches

Mail list logo