RE: Solr working £ Symbol

2018-05-02 Thread Mohan Cheema
>> We are using Solr to index our data. The data contains £ symbol within the 
>> text and for currency. When data is exported from the source system data 
>> contains £ symbol, however, when the data is imported into the Solr £ symbol 
>> is converted to  .
>>
> >How can we keep the £ symbol as is when importing data?
>
>What tools are you using to look at Solr results?  What tools are you using to 
>send update data to Solr?
We have our application written in python which is using UTF-8 charset. We are 
using Solr post tool to send data to Solr.

>
>Solr expects and delivers UTF-8 characters.  If the data you're sending to 
>Solr is using another character set, Java may not interpret it correctly.
The JSON file generated does show £ symbol. The post tool used IMHO will use 
system LANG setting which is set to ' LANG=en_GB.UTF-8'
>
>Conversely, if whatever you're using to look at Solr's results is also not 
>expecting/displaying UTF-8, you might not be shown correct characters.
When we check the data using the Solr webapp there also we cannot see the £ 
symbol.

Regards,

Mohan
Disclaimer: www.arrkgroup.com/EmailDisclaimer


Re: SolrCloud replicaition

2018-05-02 Thread Greenhorn Techie
Shalin,

Given the earlier response by Erick, wondering when this scenario occurs
i.e. when the replica node recovers after a time period, wouldn’t it
automatically recover all the missed updates by connecting to the leader?
My understanding is the below from the responses so far (assuming
replication factor of 2 for simplicity purposes):

1. Client tries an update request which is received by the shard leader
2. Leader once it updates on its own node, send the update to the
unavailable replica node
3. Leader keeps trying to send the update to the replica node
4. After a while leader gives up and communicates to the client (not sure
what kind of message will the client receive in this case?)
5. Replica node recovers and then realises that it needs to catch-up and
hence receives all the updates in recovery mode

Correct me if I am wrong in my understanding.

Thnx!!


On 3 May 2018 at 04:10:12, Shalin Shekhar Mangar (shalinman...@gmail.com)
wrote:

The min_rf parameter does not fail indexing. It only tells you how many
replicas received the live update. So if the value is less than what you
wanted then it is up to you to retry the update later.

On Wed, May 2, 2018 at 3:33 PM, Greenhorn Techie 

wrote:

> Hi,
>
> Good Morning!!
>
> In the case of a SolrCloud setup with sharing and replication in place,
> when a document is sent for indexing, what happens when only the shard
> leader has indexed the document, but the replicas failed, for whatever
> reason. Will the document be resent by the leader to the replica shards
to
> index the document after sometime or how is scenario addressed?
>
> Also, given the above context, when I set the value of min_rf parameter
to
> say 2, does that mean the calling application will be informed that the
> indexing failed?
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: SolrCloud replicaition

2018-05-02 Thread Shalin Shekhar Mangar
The min_rf parameter does not fail indexing. It only tells you how many
replicas received the live update. So if the value is less than what you
wanted then it is up to you to retry the update later.

On Wed, May 2, 2018 at 3:33 PM, Greenhorn Techie 
wrote:

> Hi,
>
> Good Morning!!
>
> In the case of a SolrCloud setup with sharing and replication in place,
> when a document is sent for indexing, what happens when only the shard
> leader has indexed the document, but the replicas failed, for whatever
> reason. Will the document be resent by the leader to the replica shards to
> index the document after sometime or how is scenario addressed?
>
> Also, given the above context, when I set the value of min_rf parameter to
> say 2, does that mean the calling application will be informed that the
> indexing failed?
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: Load balanced Solr cluster not updating leader

2018-05-02 Thread Shawn Heisey
On 5/2/2018 6:23 PM, Erick Erickson wrote:
> Perhaps this is: SOLR-11660?

That definitely looks like the problem that Micheal describes.  And it
indicates that restarting Solr instances after restore is a workaround.

The issue also says something that might indicate that collection reload
after restore would also fix it, but I can't be sure about that part. 
If it works, that would be far less disruptive than a Solr restart.

I've tried to reproduce the issue with the cloud example on 7.3.0, but I
can't get the collection restore to work right and give me two replicas.

Thanks,
Shawn



Re: Load balanced Solr cluster not updating leader

2018-05-02 Thread Erick Erickson
Perhaps this is: SOLR-11660?

On Wed, May 2, 2018 at 4:46 PM, Shawn Heisey  wrote:
> On 5/2/2018 3:52 PM, Michael B. Klein wrote:
>> It works ALMOST perfectly. The restore operation reports success, and if I
>> look at the UI, everything looks great in the Cloud graph view. All green,
>> one leader and two other active instances per collection.
>>
>> But once we start updating, we run into problems. The two NON-leaders in
>> each collection get the updates, but the leader never does. Since the
>> instances are behind a round robin load balancer, every third query hits an
>> out-of-date core, with unfortunate (for our near-real-time indexing
>> dependent app) results.
>
> That is completely backwards from what I would expect in a problem
> report.  The leader coordinates all indexing, so if the two other
> replicas are getting the updates, that means that at least part of the
> functionality of the leader replica *IS* working.
>
> Side FYI: Unless you're using preferLocalShards=true, Solr will actually
> load balance your load balanced requests.  If your external load
> balancer sends queries to replica1, replica1 may forward the request to
> replica3 because of SolrCloud's own internal load balancing.  The
> preferLocalShards parameter will keep that from happening *if* the
> machine receiving the query has the replicas required to satisfy the query.
>
>> Reloading the collection doesn't seem to help, but if I use the Collections
>> API to DELETEREPLICA the leader of each collection and follow it with an
>> ADDREPLICA, everything syncs up (with a new leader) and stays in sync from
>> there on out.
>>
>> I don't know what to look for in my settings or my logs to diagnose or try
>> to fix this issue. It only affects collections that have been restored from
>> backup. Any suggestions or guidance would be a big help.
>
> I don't know what to look for in the logs either, but the first thing to
> check for is any messages at WARN or ERROR logging levels.  These kind
> of messages should also show up in the admin UI logging tab, but
> recovering the full text of those messages is much easier in the logfile
> than the admin UI.
>
> Have you tried restarting the Solr instances after restoring the
> collection?  This shouldn't be required, but at this point I'm hoping to
> at least get you limping along, even if it requires steps that are
> obvious indications of a bug.
>
> Since you're running 6.6 and 6.x is in maintenance mode, it's not likely
> that any bugs revealed will be fixed on 6.x, but maybe we can track it
> down and see if it's still a problem in 7.x.  How much pain will it
> cause you to get upgraded?
>
> Also FYI:  Two zookeeper servers is actually LESS fault tolerant than
> only having one, because if either server goes down, quorum is lost.
> You need at least three for fault tolerance.
>
> Thanks,
> Shawn
>


Re: Load balanced Solr cluster not updating leader

2018-05-02 Thread Shawn Heisey
On 5/2/2018 3:52 PM, Michael B. Klein wrote:
> It works ALMOST perfectly. The restore operation reports success, and if I
> look at the UI, everything looks great in the Cloud graph view. All green,
> one leader and two other active instances per collection.
>
> But once we start updating, we run into problems. The two NON-leaders in
> each collection get the updates, but the leader never does. Since the
> instances are behind a round robin load balancer, every third query hits an
> out-of-date core, with unfortunate (for our near-real-time indexing
> dependent app) results.

That is completely backwards from what I would expect in a problem
report.  The leader coordinates all indexing, so if the two other
replicas are getting the updates, that means that at least part of the
functionality of the leader replica *IS* working.

Side FYI: Unless you're using preferLocalShards=true, Solr will actually
load balance your load balanced requests.  If your external load
balancer sends queries to replica1, replica1 may forward the request to
replica3 because of SolrCloud's own internal load balancing.  The
preferLocalShards parameter will keep that from happening *if* the
machine receiving the query has the replicas required to satisfy the query.

> Reloading the collection doesn't seem to help, but if I use the Collections
> API to DELETEREPLICA the leader of each collection and follow it with an
> ADDREPLICA, everything syncs up (with a new leader) and stays in sync from
> there on out.
>
> I don't know what to look for in my settings or my logs to diagnose or try
> to fix this issue. It only affects collections that have been restored from
> backup. Any suggestions or guidance would be a big help.

I don't know what to look for in the logs either, but the first thing to
check for is any messages at WARN or ERROR logging levels.  These kind
of messages should also show up in the admin UI logging tab, but
recovering the full text of those messages is much easier in the logfile
than the admin UI.

Have you tried restarting the Solr instances after restoring the
collection?  This shouldn't be required, but at this point I'm hoping to
at least get you limping along, even if it requires steps that are
obvious indications of a bug.

Since you're running 6.6 and 6.x is in maintenance mode, it's not likely
that any bugs revealed will be fixed on 6.x, but maybe we can track it
down and see if it's still a problem in 7.x.  How much pain will it
cause you to get upgraded?

Also FYI:  Two zookeeper servers is actually LESS fault tolerant than
only having one, because if either server goes down, quorum is lost. 
You need at least three for fault tolerance.

Thanks,
Shawn



Re: Too many commits

2018-05-02 Thread Shawn Heisey
On 5/2/2018 11:45 AM, Patrick Recchia wrote:
> Is there any logging I can turn on to know when a commit happens and/or
> when a segment is flushed?

The normal INFO-level logging that Solr ships with will log all
commits.  It probably doesn't log segment flushes unless they happen as
a result of a commit, though.  The infoStream logging would have that
information.

Your autoCommit settings are ensuring that commitWithin is never going
to actually cause a commit.  Your interval for autoCommit is 6 (one
minute), commitWithin is 50 (a little over eight minutes).  The
autoCommit has openSearcher set to true, so there will always be a
commit with a new searcher occurring within one minute after an update
is sent, and commitWithin will never be needed.

Here's what I think I would try:  On autoCommit, set openSearcher to
false.  If you want to have less than an eight minute window for
document visibility, then reduce commitWithin to 12.  Increase
ramBufferSizeMB to 256 or 512, which might require an increase in heap
size as well.  Instead of using commitWithin, you could configure
autoSoftCommit with a maxTime of 12.

Here's some additional info about commits:

https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

The title says "SolrCloud" but the concepts are equally applicable when
not running in cloud mode.

Thanks,
Shawn



Re: Way for DataImportHandler to use bind variables

2018-05-02 Thread Shawn Heisey
On 5/2/2018 1:03 PM, Mike Konikoff wrote:
> Is there a way to configure the DataImportHandler to use bind variables for
> the entity queries? To improve database performance.

Can you clarify where these variables would come from and precisely what
you want to do?

>From what I can tell, you're talking about ? placeholders in a
PreparedStatement.  Is that correct?  This works well for situations
where you are writing JDBC code, but DIH is a configuration-based setup
where the user cannot write the JDBC code.

The only DIH-related code where PreparedStatement or prepareStatement
appears is in a test for DIH, not in DIH code itself.  I don't think DIH
has any support for what you want, but until you clarify exactly what
your intent is, I can't say for sure.

Thanks,
Shawn



Re: Faceting question

2018-05-02 Thread Shawn Heisey
On 5/2/2018 2:56 PM, Weffelmeyer, Stacie wrote:
> Question on faceting.  We have a dynamicField that we want to facet
> on. Below is the field and the type of information that field generates.
>
>  
>
> cid:image001.png@01D3E22D.DE028870
>

This image is not available.  This mailing list will almost always strip
attachments from email that it receives.

>    
> "*customMetadata*":["{\"controlledContent\":{\"metadata\":{\"programs\":[\"program1\"],\"departments\":[\"department1\"],\"locations\":[\"location1\"],\"functions\":[\"function1\"],\"customTags\":[\"customTag1\",\"customTag2\"],\"corporate\":false,\"redline\":false},\"who\":{\"lastUpdateDate\":\"2018-04-26T14:35:02.268Z\",\"creationDate\":\"2018-04-26T14:35:01.445Z\",\"createdBy\":38853},\"clientOwners\":[38853],\"clientLastUpdateDate\":\"2018-04-25T21:15:06.000Z\",\"clientCreationDate\":\"2018-04-25T20:58:34.000Z\",\"clientContentId\":\"DOC-8030\",\"type\":{\"applicationId\":2574,\"code\":\"WI\",\"name\":\"Work
> Instruction\",\"id\":\"5ac3d4d111570f0047a8ceb9\"},\"status\":\"active\",\"version\":1}}"],
>

I do not know what this is.  It looks a little like JSON.  But if it's
json, there are a lot of escaped quotes in it, and I don't really know
what I'm looking at.

>  
>
> It will always have customMetadata.controlledContent.metadata
>
>  
>
> Then from metadata, it could be anything, which is why it is a
> dynamicField.
>
>  
>
> In this example there is
>
> customMetadata.controlledContent.metadata.programs
>
> customMetadata.controlledContent.metadata.departments
>
> customMetadata.controlledContent.metadata.locations
>

Solr does not have the concept of a nested data type.  So how are you
getting from all that text above to period-delimited strings in a
hierarchy?  If you're using some kind of custom plugin for Solr to have
it support something it doesn't do out of the box, you're probably going
to need to talk to the author of that plugin.

Solr's dynamicField support is only dynamic in the sense that the
precise field name is not found in the schema.  The field name is
dynamic.  When it comes to what's IN the field, it doesn't matter
whether it's a dynamic field or not.

> If I enable faceting, it will do so with the field customMetadata. But
> it doesn’t help because it separates every space as a term.  But
> ideally I want to facet on customMetadata.controlledContent.metadata.
> Doing so brings back no facets.
>
>  
>
> Is this possible?  How can we best accomplish this?
>

We will need to understand exactly what you are indexing, what's in your
schema, the exact query requests you are sending, and what you are
expecting back.

Thanks,
Shawn



Re: Indexing throughput

2018-05-02 Thread Shawn Heisey
On 5/2/2018 10:58 AM, Greenhorn Techie wrote:
> The current hardware profile for our production cluster is 20 nodes, each
> with 24cores and 256GB memory. Data being indexed is very structured in
> nature and is about 30 columns or so, out of which half of them are
> categorical with a defined list of values. The expected peak indexing
> throughput is to be about *5* documents per second (expected to be done
> at off-peak hours so that search requests will be minimal during this time)
> and the average throughput around *1* documents (normal business
> hours).
>
> Given the hardware profile, is it realistic and practical to achieve the
> desired throughput? What factors affect the performance of indexing apart
> from the above hardware characteristics? I understand that its very
> difficult to provide any guidance unless a prototype is done. But wondering
> what are the considerations and dependencies we need to be aware of and
> whether our throughput expectations are realistic or not.

5 docs per second is not a slow indexing rate.  It has been
achieved, and as Erick noted, surpassed by a very large margin.  Whether
you can get there with your planned hardware on your index is not a
question that I can answer.  If I had to guess, I think that as long as
the source system can push the data that fast, it SHOULD be possible to
create an indexing system that can do it.

The important thing to do for fast indexing with Solr is to have a lot
of threads or processes indexing all at the same time.  Indexing with a
single thread will not achieve the fastest possible performance.

Since you're planning SolrCloud, you should put some effort into having
your indexing system be aware of your cluster state and the shard
routing so that it can send indexing requests directly to shard
leaders.  Indexing is faster if Solr doesn't need to forward requests. 
The SolrJ client named "CloudSolrClient" is always aware of the
clusterstate.  So if you can use that, updates can always be sent to the
leaders.

Thanks,
Shawn



Re: Introducing a stopword in a query causes ExtendedDismaxQueryParser to produce a radically different parsed query

2018-05-02 Thread Doug Turnbull
This is a problem that we’ve noted too.

This blog post discusses the underlying cause
https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/

Hope that helps
On Wed, May 2, 2018 at 3:07 PM Chris Wilt  wrote:

> I began with a 7.2.1 solr instance using the techproducts sample data.
> Next, I added "a" as a stopword (there were originally no stopwords).
>
>
>
> I tried two queries: "x a b" and "x b".
>
> Here is the raw query parameters:
> q=x b=id,score,price=score desc=name^0.75 manu cat^3.0
> features^10.0=edismax
>
>
> and
> q=x a b=id,score,price=score desc=name^0.75 manu cat^3.0
> features^10.0=edismax
>
>
> The idea is that I want different weights for the different fields, and I
> want to be able to take the score of each term from its best field, i.e.
> score the "x" from its match against the "cat" field and the "b" against
> the "features" field.
>
> When I have "x b" I get this behavior exactly, with the parsed query as
> follows:
> +(((name:x)^0.75 | manu:x | (features:x)^10.0 | (cat:x)^3.0)
> ((name:b)^0.75 | manu:b | (features:b)^10.0 | (cat:b)^3.0))
>
>
> When I use "x a b" I instead get:
> +((name:x name:b)^0.75 | (manu:x manu:b) | (features:x features:b)^10.0 |
> (cat:x cat:a cat:b)^3.0)
>
>
> With the "x a b" query suppose document 1 matches "x" in "features" and
> matches "b" in "cat". This document will get a single score based upon
> either its "x" or its "b", but the score will not be the sum, as would have
> been the case had the query been, "x b".
>
>
>
>
> How do I get the edismax parser to behave the same for queries with
> stopwords as it does without stopwords, keeping the behavior constant for
> queries with no stopwords?
>
>
> I tried using the stopwords parameter, but I get the same results with
> that parameter taking the value of true or false. I also tried using the
> tie parameter, but the tie parameter seem to change a max to a sum (it sums
> up the scores of each field for each query term, rather than taking the max
> of all fields for how well they match a query term).
>
-- 
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug


Load balanced Solr cluster not updating leader

2018-05-02 Thread Michael B. Klein
Hi all,

I've encountered a reproducible and confusing issue with our Solr 6.6
cluster. (Updating to 7.x is an option, but not an immediate one.) This is
in our staging environment, running on AWS. To save money, we scale our
entire stack down to zero instances every night and spin it back up every
morning. Here's the process:

SCALE DOWN:
1) Commit & Optimize all collections.
2) Back up each collection to a shared volume (using the Collections API).
3) Spin down all (3) solr instances.
4) Spin down all (2) zookeeper instances.

SPIN UP:
1) Spin up zookeeper instances; wait for the instances to find each other
and the ensemble to stabilize.
2) Spin up solr instances; wait for them all to stabilize and for zookeeper
to recognize them as live nodes.
3) Restore each collection (using the Collections API).

It works ALMOST perfectly. The restore operation reports success, and if I
look at the UI, everything looks great in the Cloud graph view. All green,
one leader and two other active instances per collection.

But once we start updating, we run into problems. The two NON-leaders in
each collection get the updates, but the leader never does. Since the
instances are behind a round robin load balancer, every third query hits an
out-of-date core, with unfortunate (for our near-real-time indexing
dependent app) results.

Reloading the collection doesn't seem to help, but if I use the Collections
API to DELETEREPLICA the leader of each collection and follow it with an
ADDREPLICA, everything syncs up (with a new leader) and stays in sync from
there on out.

I don't know what to look for in my settings or my logs to diagnose or try
to fix this issue. It only affects collections that have been restored from
backup. Any suggestions or guidance would be a big help.

Thanks,
Michael

-- 
Michael B. Klein
Lead Developer, Repository Development and Administration
Northwestern University Libraries


Faceting question

2018-05-02 Thread Weffelmeyer, Stacie
Hi,

Question on faceting.  We have a dynamicField that we want to facet on. Below 
is the field and the type of information that field generates.

[cid:image001.png@01D3E22D.DE028870]


"customMetadata":["{\"controlledContent\":{\"metadata\":{\"programs\":[\"program1\"],\"departments\":[\"department1\"],\"locations\":[\"location1\"],\"functions\":[\"function1\"],\"customTags\":[\"customTag1\",\"customTag2\"],\"corporate\":false,\"redline\":false},\"who\":{\"lastUpdateDate\":\"2018-04-26T14:35:02.268Z\",\"creationDate\":\"2018-04-26T14:35:01.445Z\",\"createdBy\":38853},\"clientOwners\":[38853],\"clientLastUpdateDate\":\"2018-04-25T21:15:06.000Z\",\"clientCreationDate\":\"2018-04-25T20:58:34.000Z\",\"clientContentId\":\"DOC-8030\",\"type\":{\"applicationId\":2574,\"code\":\"WI\",\"name\":\"Work
 
Instruction\",\"id\":\"5ac3d4d111570f0047a8ceb9\"},\"status\":\"active\",\"version\":1}}"],



It will always have customMetadata.controlledContent.metadata

Then from metadata, it could be anything, which is why it is a dynamicField.

In this example there is
customMetadata.controlledContent.metadata.programs
customMetadata.controlledContent.metadata.departments
customMetadata.controlledContent.metadata.locations
etc.

If I enable faceting, it will do so with the field customMetadata. But it 
doesn’t help because it separates every space as a term.  But ideally I want to 
facet on customMetadata.controlledContent.metadata. Doing so brings back no 
facets.

Is this possible?  How can we best accomplish this?

Thank you,
Stacie Weffelmeyer
World Wide Technology, Inc.


Re: Median Date

2018-05-02 Thread Jim Freeby
 All,
percentiles only work with numbers, not dates.
If I use the ms function, I can get the number of milliseconds between NOW and 
the import date.  Then we can use that result in calculating the median age of 
the documents using percentiles.
rows=0=true={!tag=piv1 percentiles='50' func}ms(NOW, 
importDate)=true={!stats=piv1 }status
I hope this helps someone else :)  Also, let me know if there's a better way to 
do this.
Cheers,

Jim


On ‎Tuesday‎, ‎May‎ ‎1‎, ‎2018‎ ‎03‎:‎27‎:‎10‎ ‎PM‎ ‎PDT, Jim Freeby 
 wrote:  
 
 All,
We have a dateImported field in our schema.
I'd like to generate a statistic showing the median dateImported (actually we 
want median age of the documents, based on the dateImported value).
I have other stats that calculate the median value of numbers (like price).
This was achieved with something like:
rows=0=true={!tag=piv1 
percentiles='50'}price=true={!stats=piv1 }status
I have not found a way to calculate the median dateImported.  The mean works, 
but we  need median.
Any help would be appreciated?
Cheers,

Jim  

RE: User queries end up in filterCache if facetting is enabled

2018-05-02 Thread Markus Jelsma
Hello,

Anyone here to reproduce this oddity? It shows up in all our collections once 
we enable the stats page to show filterCache entries.

Is this normal? Am i completely missing something?

Thanks,
Markus

 
 
-Original message-
> From:Markus Jelsma 
> Sent: Tuesday 1st May 2018 17:32
> To: Solr-user 
> Subject: User queries end up in filterCache if facetting is enabled
> 
> Hello,
> 
> We noticed the number of entries of the filterCache to be higher than we 
> expected, using showItems="1024" something unexpected was listed as entries 
> of the filterCache, the complete Query.toString() of our user queries, 
> massive entries, a lot of them.
> 
> We also spotted all entries of fields we facet on, even though we don't use 
> them as filtes, but that is caused by facet.field=enum, and should be 
> expected, right?
> 
> Now, the user query entries are not expected. In the simplest set up, 
> searching for something and only enabling the facet engine with facet=true 
> causes it to appears in the cache as an entry. The following queries:
> 
> http://localhost:8983/solr/search/select?q=content_nl:nog=true
> http://localhost:8983/solr/search/select?q=*:*=true
> 
> become listed as:
> 
> CACHE.searcher.filterCache.item_*:*:
> org.apache.solr.search.BitDocSet@70051ee0
> 
> CACHE.searcher.filterCache.item_content_nl:nog:
> org.apache.solr.search.BitDocSet@13150cf6
> 
> This is on 7.3, but 7.2.1 does this as well. 
> 
> So, should i expect this? Can i disable this? Bug?
> 
> 
> Thanks,
> Markus
> 
> 
> 
> 


Re: Solr Heap usage

2018-05-02 Thread Greenhorn Techie
Thanks Shawn for the inputs, which will definitely help us to scale our
cluster better.

Regards


On 2 May 2018 at 18:15:12, Shawn Heisey (apa...@elyograg.org) wrote:

On 5/1/2018 5:33 PM, Greenhorn Techie wrote:
> Wondering what are the considerations to be aware to arrive at an optimal
> heap size for Solr JVM? Though I did discuss this on the IRC, I am still
> unclear on how Solr uses the JVM heap space. Are there any pointers to
> understand this aspect better?

I'm one of the people you've been chatting with on IRC.

I also wrote the wiki page that Susheel has recommended to you.

> Given that Solr requires an optimally configured heap, so that the
> remaining unused memory can be used for OS disk cache, I wonder how to
best
> configure Solr heap. Also, on the IRC it was discussed that having 31GB
of
> heap is better than having 32GB due to Java’s internal usage of heap. Can
> anyone guide further on heap configuration please?

With the index size you mentioned on IRC, it's very difficult to project
how much heap you're going to need. Actually setting up a system,
putting data on it, and firing real queries at it may be the only way to
be sure.

The only concrete advice I can give you with the information available
is this: Install as much memory as you can. It is extremely unlikely
that you would ever have too much memory when you're dealing with
terabyte-scale indexes.

Heavy indexing (which you have mentioned as a requirement in another
thread) will tend to require a larger heap.

Thanks,
Shawn


Re: Indexing throughput

2018-05-02 Thread Greenhorn Techie
Thanks Walter and Erick for the valuable suggestions. We shall try out
various values for shards and as well other tuning metrics I discussed in
various threads earlier.

Kind Regards


On 2 May 2018 at 18:24:31, Erick Erickson (erickerick...@gmail.com) wrote:

I've seen 1.5 M docs/second. Basically the indexing throughput is gated
by two things:
1> the number of shards. Indexing throughput essentially scales up
reasonably linearly with the number of shards.
2> the indexing program that pushes data to Solr. Before thinking Solr
is the bottleneck, check how fast your ETL process is pushing docs.

This pre-supposes using SolrJ and CloudSolrClient for the final push
to Solr. This pre-buckets the updates and sends the updates for each
shard to the shard leader, thus reducing the amount of work Solr has
to do. If you use SolrJ, you can easily do <2> above by just
commenting out the single call that pushes the docs to Solr in your
program.

Speaking of which, it's definitely best to batch the updates, see:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

Best,
Erick

On Wed, May 2, 2018 at 10:07 AM, Walter Underwood 
wrote:
> We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb
RAM each
> (EC2 C4.8xlarge). The collection is 24 million documents with four
shards. The cluster
> is Solr 6.6.2. All storage is SSD EBS.
>
> We built a simple batch loader in Java. We get about one million
documents per minute
> with 64 threads. We do not use the cloud-smart SolrJ client. We just send
all the
> batches to the load balancer and let Solr sort it out.
>
> You are looking for 3 million documents per minute. You will just have to
test that.
>
> I haven’t tested it, but indexing should speed up linearly with the
number of shards,
> because those are indexing in parallel.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
>> On May 2, 2018, at 9:58 AM, Greenhorn Techie 
wrote:
>>
>> Hi,
>>
>> The current hardware profile for our production cluster is 20 nodes,
each
>> with 24cores and 256GB memory. Data being indexed is very structured in
>> nature and is about 30 columns or so, out of which half of them are
>> categorical with a defined list of values. The expected peak indexing
>> throughput is to be about *5* documents per second (expected to be
done
>> at off-peak hours so that search requests will be minimal during this
time)
>> and the average throughput around *1* documents (normal business
>> hours).
>>
>> Given the hardware profile, is it realistic and practical to achieve the
>> desired throughput? What factors affect the performance of indexing
apart
>> from the above hardware characteristics? I understand that its very
>> difficult to provide any guidance unless a prototype is done. But
wondering
>> what are the considerations and dependencies we need to be aware of and
>> whether our throughput expectations are realistic or not.
>>
>> Thanks
>


Introducing a stopword in a query causes ExtendedDismaxQueryParser to produce a radically different parsed query

2018-05-02 Thread Chris Wilt
I began with a 7.2.1 solr instance using the techproducts sample data. Next, I 
added "a" as a stopword (there were originally no stopwords). 



I tried two queries: "x a b" and "x b". 

Here is the raw query parameters: 
q=x b=id,score,price=score desc=name^0.75 manu cat^3.0 
features^10.0=edismax 


and 
q=x a b=id,score,price=score desc=name^0.75 manu cat^3.0 
features^10.0=edismax 


The idea is that I want different weights for the different fields, and I want 
to be able to take the score of each term from its best field, i.e. score the 
"x" from its match against the "cat" field and the "b" against the "features" 
field. 

When I have "x b" I get this behavior exactly, with the parsed query as 
follows: 
+(((name:x)^0.75 | manu:x | (features:x)^10.0 | (cat:x)^3.0) ((name:b)^0.75 | 
manu:b | (features:b)^10.0 | (cat:b)^3.0)) 


When I use "x a b" I instead get: 
+((name:x name:b)^0.75 | (manu:x manu:b) | (features:x features:b)^10.0 | 
(cat:x cat:a cat:b)^3.0) 


With the "x a b" query suppose document 1 matches "x" in "features" and matches 
"b" in "cat". This document will get a single score based upon either its "x" 
or its "b", but the score will not be the sum, as would have been the case had 
the query been, "x b". 




How do I get the edismax parser to behave the same for queries with stopwords 
as it does without stopwords, keeping the behavior constant for queries with no 
stopwords? 


I tried using the stopwords parameter, but I get the same results with that 
parameter taking the value of true or false. I also tried using the tie 
parameter, but the tie parameter seem to change a max to a sum (it sums up the 
scores of each field for each query term, rather than taking the max of all 
fields for how well they match a query term). 


Way for DataImportHandler to use bind variables

2018-05-02 Thread Mike Konikoff
Is there a way to configure the DataImportHandler to use bind variables for
the entity queries? To improve database performance.

Thanks,

Mike


Re: Too many commits

2018-05-02 Thread Erick Erickson
Youcan turn on "infostream", but that is _very_ voluminous. The
regular Solr logs at INFO level should show commits though



On Wed, May 2, 2018 at 10:45 AM, Patrick Recchia
 wrote:
> Swawn,
> thanks you very much for your answer.
>
>
> On Wed, May 2, 2018 at 6:27 PM, Shawn Heisey  wrote:
>
>> On 5/2/2018 4:54 AM, Patrick Recchia wrote:
>> > I'm seeing way too many commits on our solr cluster, and I don't know
>> why.
>>
>> Are you sure there are commits happening?  Do you have logs actually
>> saying that a commit is occurring?  The creation of a new segment does
>> not necessarily mean a commit happened -- this can happen even without a
>> commit.
>>
>
> You're right, I assumed a new segment would be created only as part of a
> commit; but I realize now that there can be other situations.
>
> Is there any logging I can turn on to know when a commit happens and/or
> when a segment is flushed?
>
> I would be very interested in that
> I've already enabled InfoStream logging from the IndexWriter, but have
> found nothing yet there to help me understand that
>
>
>
>> > - IndexConfig is set to autoCommit every minute:
>> >
>> >  ${solr.autoCommit.maxTime:6} <
>> > openSearcher>true 
>> >
>> > (solr.autoCommit.maxTime is not set)
>>
>> It's recommended to set openSearcher to false on autoCommit.  Do you
>> have autoSoftCommit configured?
>>
>
> autoSoftCommit is left at its default '-1' (which means infinity, I
> suppose).
>
>
>
>>
>> > There is nothing else customized (when it comes to IndexWriter, at least)
>> > within solrconfig.xml
>> >
>> > The data is sent without commit, but with commitWithin=50 ms.
>> >
>> > All that said, I would have expected a rate of about 1 segment created
>> epr
>> > minute; of about 100MB.
>>
>> One of the events that can cause a new segment to be flushed is the ram
>> buffer filling up.  Solr defaults to a ramBufferSizeMB value of 100.
>> But that does not translate to a segment size of 100MB -- it's merely
>> the size of the ram buffer that Lucene uses for all the work related to
>> building a segment.  A segment resulting from a full memory buffer is
>> going to be smaller than the buffer.  I do not know how MUCH smaller, or
>> what causes variations in that size.
>>
>> The general advice is to leave the buffer size alone.  But with the high
>> volume you've got, you might want to increase it so segments are not
>> flushed as frequently.  Be aware that increasing it will have an impact
>> on how much heap memory gets used.  Every Solr core (shard replica in
>> SolrCloud terminology) that does indexing is going to need one of these
>> ram buffers.
>>
>
> I will definitely investigate this ramBufferSizeMB.
> And, see through lucene code when a segment is flushed.
>
> Again, many thanks.
> Patrick


Re: Too many commits

2018-05-02 Thread Patrick Recchia
Swawn,
thanks you very much for your answer.


On Wed, May 2, 2018 at 6:27 PM, Shawn Heisey  wrote:

> On 5/2/2018 4:54 AM, Patrick Recchia wrote:
> > I'm seeing way too many commits on our solr cluster, and I don't know
> why.
>
> Are you sure there are commits happening?  Do you have logs actually
> saying that a commit is occurring?  The creation of a new segment does
> not necessarily mean a commit happened -- this can happen even without a
> commit.
>

You're right, I assumed a new segment would be created only as part of a
commit; but I realize now that there can be other situations.

Is there any logging I can turn on to know when a commit happens and/or
when a segment is flushed?

I would be very interested in that
I've already enabled InfoStream logging from the IndexWriter, but have
found nothing yet there to help me understand that



> > - IndexConfig is set to autoCommit every minute:
> >
> >  ${solr.autoCommit.maxTime:6} <
> > openSearcher>true 
> >
> > (solr.autoCommit.maxTime is not set)
>
> It's recommended to set openSearcher to false on autoCommit.  Do you
> have autoSoftCommit configured?
>

autoSoftCommit is left at its default '-1' (which means infinity, I
suppose).



>
> > There is nothing else customized (when it comes to IndexWriter, at least)
> > within solrconfig.xml
> >
> > The data is sent without commit, but with commitWithin=50 ms.
> >
> > All that said, I would have expected a rate of about 1 segment created
> epr
> > minute; of about 100MB.
>
> One of the events that can cause a new segment to be flushed is the ram
> buffer filling up.  Solr defaults to a ramBufferSizeMB value of 100.
> But that does not translate to a segment size of 100MB -- it's merely
> the size of the ram buffer that Lucene uses for all the work related to
> building a segment.  A segment resulting from a full memory buffer is
> going to be smaller than the buffer.  I do not know how MUCH smaller, or
> what causes variations in that size.
>
> The general advice is to leave the buffer size alone.  But with the high
> volume you've got, you might want to increase it so segments are not
> flushed as frequently.  Be aware that increasing it will have an impact
> on how much heap memory gets used.  Every Solr core (shard replica in
> SolrCloud terminology) that does indexing is going to need one of these
> ram buffers.
>

I will definitely investigate this ramBufferSizeMB.
And, see through lucene code when a segment is flushed.

Again, many thanks.
Patrick


Re: Shard size variation

2018-05-02 Thread Erick Erickson
You can always increase the maximum segment size. For large indexes
that should reduce the number of segments. But watch your indexing
stats, I can't predict the consequences of bumping it to 100G for
instance. I'd _expect_  bursty I/O whne those large segments started
to be created or merged

You'll be interested in LUCENE-7976 (Solr 7.4?), especially (probably)
the idea of increasing the segment sizes and/or a related JIRA that
allows you to tweak how aggressively solr merges segments that have
deleted docs.

NOTE: that JIRA has the consequence that _by default_ the optimize
with no parameters respects the maximum segment size, which is a
change from now.

Finally, expungeDeletes may be useful as that too will respect max
segment size, again after LUCENE-7976 is committed.

Best,
Erick

On Wed, May 2, 2018 at 9:22 AM, Michael Joyner  wrote:
> The main reason we go this route is that after awhile (with default
> settings) we end up with hundreds of shards and performance of course drops
> abysmally as a result. By using a stepped optimize a) we don't run into the
> we need the 3x+ head room issue, b) optimize performance penalty during
> optimize is less than the hundreds of shards not being optimized performance
> penalty.
>
> BTW, as we use batched a batch insert/update cycle [once daily] we only do
> optimize to a segment of 1 after a complete batch has been run. Though
> during the batch we reduce segment counts down to a max of 16 every 250K
> insert/updates to prevent the large segment count performance penalty.
>
>
> On 04/30/2018 07:10 PM, Erick Erickson wrote:
>>
>> There's really no good way to purge deleted documents from the index
>> other than to wait until merging happens.
>>
>> Optimize/forceMerge and expungeDeletes both suffer from the problem
>> that they create massive segments that then stick around for a very
>> long time, see:
>>
>> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
>>
>> Best,
>> Erick
>>
>> On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner 
>> wrote:
>>>
>>> Based on experience, 2x head room is room is not always enough, sometimes
>>> not even 3x, if you are optimizing from many segments down to 1 segment
>>> in a
>>> single go.
>>>
>>> We have however figured out a way that can work with as little as 51%
>>> free
>>> space via the following iteration cycle:
>>>
>>> public void solrOptimize() {
>>>  int initialMaxSegments = 256;
>>>  int finalMaxSegments = 1;
>>>  if (isShowSegmentCounter()) {
>>>  log.info("Optimizing ...");
>>>  }
>>>  try (SolrClient solrServerInstance = getSolrClientInstance()){
>>>  for (int segments=initialMaxSegments;
>>> segments>=finalMaxSegments; segments--) {
>>>  if (isShowSegmentCounter()) {
>>>  System.out.println("Optimizing to a max of
>>> "+segments+"
>>> segments.");
>>>  }
>>>  solrServerInstance.optimize(true, true, segments);
>>>  }
>>>  } catch (SolrServerException | IOException e) {
>>>  throw new RuntimeException(e);
>>>
>>>  }
>>>  }
>>>
>>>
>>> On 04/30/2018 04:23 PM, Walter Underwood wrote:

 You need 2X the minimum index size in disk space anyway, so don’t worry
 about keeping the indexes as small as possible. Worry about having
 enough
 headroom.

 If your indexes are 250 GB, you need 250 GB of free space.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)

> On Apr 30, 2018, at 1:13 PM, Antony A  wrote:
>
> Thanks Erick/Deepak.
>
> The cloud is running on baremetal (128 GB/24 cpu).
>
> Is there an option to run a compact on the data files to make the size
> equal on both the clouds? I am trying find all the options before I add
> the
> new fields into the production cloud.
>
> Thanks
> AA
>
> On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson
> 
> wrote:
>
>> Anthony:
>>
>> You are probably seeing the results of removing deleted documents from
>> the shards as they're merged. Even on replicas in the same _shard_,
>> the size of the index on disk won't necessarily be identical. This has
>> to do with which segments are selected for merging, which are not
>> necessarily coordinated across replicas.
>>
>> The test is if the number of docs on each collection is the same. If
>> it is, then don't worry about index sizes.
>>
>> Best,
>> Erick
>>
>> On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel 
>> wrote:
>>>
>>> Could you please also give the machine details of the two clouds you
>>> are
>>> running?
>>>
>>>
>>>
>>> Deepak
>>> "The greatness of a 

Re: Indexing throughput

2018-05-02 Thread Erick Erickson
I've seen 1.5 M docs/second. Basically the indexing throughput is gated
by two things:
1> the number of shards. Indexing throughput essentially scales up
reasonably linearly with the number of shards.
2> the indexing program that pushes data to Solr. Before thinking Solr
is the bottleneck, check how fast your ETL process is pushing docs.

This pre-supposes using SolrJ and CloudSolrClient for the final push
to Solr. This pre-buckets the updates and sends the updates for each
shard to the shard leader, thus reducing the amount of work Solr has
to do. If you use SolrJ, you can easily do <2> above by just
commenting out the single call that pushes the docs to Solr in your
program.

Speaking of which, it's definitely best to batch the updates, see:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

Best,
Erick

On Wed, May 2, 2018 at 10:07 AM, Walter Underwood  wrote:
> We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb RAM 
> each
> (EC2 C4.8xlarge). The collection is 24 million documents with four shards. 
> The cluster
> is Solr 6.6.2. All storage is SSD EBS.
>
> We built a simple batch loader in Java. We get about one million documents 
> per minute
> with 64 threads. We do not use the cloud-smart SolrJ client. We just send all 
> the
> batches to the load balancer and let Solr sort it out.
>
> You are looking for 3 million documents per minute. You will just have to 
> test that.
>
> I haven’t tested it, but indexing should speed up linearly with the number of 
> shards,
> because those are indexing in parallel.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On May 2, 2018, at 9:58 AM, Greenhorn Techie  
>> wrote:
>>
>> Hi,
>>
>> The current hardware profile for our production cluster is 20 nodes, each
>> with 24cores and 256GB memory. Data being indexed is very structured in
>> nature and is about 30 columns or so, out of which half of them are
>> categorical with a defined list of values. The expected peak indexing
>> throughput is to be about *5* documents per second (expected to be done
>> at off-peak hours so that search requests will be minimal during this time)
>> and the average throughput around *1* documents (normal business
>> hours).
>>
>> Given the hardware profile, is it realistic and practical to achieve the
>> desired throughput? What factors affect the performance of indexing apart
>> from the above hardware characteristics? I understand that its very
>> difficult to provide any guidance unless a prototype is done. But wondering
>> what are the considerations and dependencies we need to be aware of and
>> whether our throughput expectations are realistic or not.
>>
>> Thanks
>


Re: Solr Heap usage

2018-05-02 Thread Shawn Heisey
On 5/1/2018 5:33 PM, Greenhorn Techie wrote:
> Wondering what are the considerations to be aware to arrive at an optimal
> heap size for Solr JVM? Though I did discuss this on the IRC, I am still
> unclear on how Solr uses the JVM heap space. Are there any pointers to
> understand this aspect better?

I'm one of the people you've been chatting with on IRC.

I also wrote the wiki page that Susheel has recommended to you.

> Given that Solr requires an optimally configured heap, so that the
> remaining unused memory can be used for OS disk cache, I wonder how to best
> configure Solr heap. Also, on the IRC it was discussed that having 31GB of
> heap is better than having 32GB due to Java’s internal usage of heap. Can
> anyone guide further on heap configuration please?

With the index size you mentioned on IRC, it's very difficult to project
how much heap you're going to need.  Actually setting up a system,
putting data on it, and firing real queries at it may be the only way to
be sure.

The only concrete advice I can give you with the information available
is this:  Install as much memory as you can.  It is extremely unlikely
that you would ever have too much memory when you're dealing with
terabyte-scale indexes.

Heavy indexing (which you have mentioned as a requirement in another
thread) will tend to require a larger heap.

Thanks,
Shawn


Re: Indexing throughput

2018-05-02 Thread Walter Underwood
We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb RAM each
(EC2 C4.8xlarge). The collection is 24 million documents with four shards. The 
cluster
is Solr 6.6.2. All storage is SSD EBS.

We built a simple batch loader in Java. We get about one million documents per 
minute
with 64 threads. We do not use the cloud-smart SolrJ client. We just send all 
the
batches to the load balancer and let Solr sort it out.

You are looking for 3 million documents per minute. You will just have to test 
that.

I haven’t tested it, but indexing should speed up linearly with the number of 
shards,
because those are indexing in parallel.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 2, 2018, at 9:58 AM, Greenhorn Techie  
> wrote:
> 
> Hi,
> 
> The current hardware profile for our production cluster is 20 nodes, each
> with 24cores and 256GB memory. Data being indexed is very structured in
> nature and is about 30 columns or so, out of which half of them are
> categorical with a defined list of values. The expected peak indexing
> throughput is to be about *5* documents per second (expected to be done
> at off-peak hours so that search requests will be minimal during this time)
> and the average throughput around *1* documents (normal business
> hours).
> 
> Given the hardware profile, is it realistic and practical to achieve the
> desired throughput? What factors affect the performance of indexing apart
> from the above hardware characteristics? I understand that its very
> difficult to provide any guidance unless a prototype is done. But wondering
> what are the considerations and dependencies we need to be aware of and
> whether our throughput expectations are realistic or not.
> 
> Thanks



Indexing throughput

2018-05-02 Thread Greenhorn Techie
Hi,

The current hardware profile for our production cluster is 20 nodes, each
with 24cores and 256GB memory. Data being indexed is very structured in
nature and is about 30 columns or so, out of which half of them are
categorical with a defined list of values. The expected peak indexing
throughput is to be about *5* documents per second (expected to be done
at off-peak hours so that search requests will be minimal during this time)
and the average throughput around *1* documents (normal business
hours).

Given the hardware profile, is it realistic and practical to achieve the
desired throughput? What factors affect the performance of indexing apart
from the above hardware characteristics? I understand that its very
difficult to provide any guidance unless a prototype is done. But wondering
what are the considerations and dependencies we need to be aware of and
whether our throughput expectations are realistic or not.

Thanks


Re: Learning to Rank (LTR) with grouping

2018-05-02 Thread ilayaraja
Figured out that offset is used as part of the grouping patch which I applied
(SOLR-8776) :
solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java
+  if (query instanceof AbstractReRankQuery){
+topNGroups = cmd.getOffset() +
((AbstractReRankQuery)query).getReRankDocs();
+  } else {
+topNGroups = cmd.getOffset() + cmd.getLen();






-
--Ilay
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Too many commits

2018-05-02 Thread Shawn Heisey
On 5/2/2018 4:54 AM, Patrick Recchia wrote:
> I'm seeing way too many commits on our solr cluster, and I don't know why.

Are you sure there are commits happening?  Do you have logs actually
saying that a commit is occurring?  The creation of a new segment does
not necessarily mean a commit happened -- this can happen even without a
commit.

> - IndexConfig is set to autoCommit every minute:
>
>  ${solr.autoCommit.maxTime:6} <
> openSearcher>true 
>
> (solr.autoCommit.maxTime is not set)

It's recommended to set openSearcher to false on autoCommit.  Do you
have autoSoftCommit configured?

> There is nothing else customized (when it comes to IndexWriter, at least)
> within solrconfig.xml
>
> The data is sent without commit, but with commitWithin=50 ms.
>
> All that said, I would have expected a rate of about 1 segment created epr
> minute; of about 100MB.

One of the events that can cause a new segment to be flushed is the ram
buffer filling up.  Solr defaults to a ramBufferSizeMB value of 100. 
But that does not translate to a segment size of 100MB -- it's merely
the size of the ram buffer that Lucene uses for all the work related to
building a segment.  A segment resulting from a full memory buffer is
going to be smaller than the buffer.  I do not know how MUCH smaller, or
what causes variations in that size.

The general advice is to leave the buffer size alone.  But with the high
volume you've got, you might want to increase it so segments are not
flushed as frequently.  Be aware that increasing it will have an impact
on how much heap memory gets used.  Every Solr core (shard replica in
SolrCloud terminology) that does indexing is going to need one of these
ram buffers.

Thanks,
Shawn



Re: Shard size variation

2018-05-02 Thread Michael Joyner
The main reason we go this route is that after awhile (with default 
settings) we end up with hundreds of shards and performance of course 
drops abysmally as a result. By using a stepped optimize a) we don't run 
into the we need the 3x+ head room issue, b) optimize performance 
penalty during optimize is less than the hundreds of shards not being 
optimized performance penalty.


BTW, as we use batched a batch insert/update cycle [once daily] we only 
do optimize to a segment of 1 after a complete batch has been run. 
Though during the batch we reduce segment counts down to a max of 16 
every 250K insert/updates to prevent the large segment count performance 
penalty.



On 04/30/2018 07:10 PM, Erick Erickson wrote:

There's really no good way to purge deleted documents from the index
other than to wait until merging happens.

Optimize/forceMerge and expungeDeletes both suffer from the problem
that they create massive segments that then stick around for a very
long time, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

Best,
Erick

On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner  wrote:

Based on experience, 2x head room is room is not always enough, sometimes
not even 3x, if you are optimizing from many segments down to 1 segment in a
single go.

We have however figured out a way that can work with as little as 51% free
space via the following iteration cycle:

public void solrOptimize() {
 int initialMaxSegments = 256;
 int finalMaxSegments = 1;
 if (isShowSegmentCounter()) {
 log.info("Optimizing ...");
 }
 try (SolrClient solrServerInstance = getSolrClientInstance()){
 for (int segments=initialMaxSegments;
segments>=finalMaxSegments; segments--) {
 if (isShowSegmentCounter()) {
 System.out.println("Optimizing to a max of "+segments+"
segments.");
 }
 solrServerInstance.optimize(true, true, segments);
 }
 } catch (SolrServerException | IOException e) {
 throw new RuntimeException(e);

 }
 }


On 04/30/2018 04:23 PM, Walter Underwood wrote:

You need 2X the minimum index size in disk space anyway, so don’t worry
about keeping the indexes as small as possible. Worry about having enough
headroom.

If your indexes are 250 GB, you need 250 GB of free space.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Apr 30, 2018, at 1:13 PM, Antony A  wrote:

Thanks Erick/Deepak.

The cloud is running on baremetal (128 GB/24 cpu).

Is there an option to run a compact on the data files to make the size
equal on both the clouds? I am trying find all the options before I add
the
new fields into the production cloud.

Thanks
AA

On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson

wrote:


Anthony:

You are probably seeing the results of removing deleted documents from
the shards as they're merged. Even on replicas in the same _shard_,
the size of the index on disk won't necessarily be identical. This has
to do with which segments are selected for merging, which are not
necessarily coordinated across replicas.

The test is if the number of docs on each collection is the same. If
it is, then don't worry about index sizes.

Best,
Erick

On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel  wrote:

Could you please also give the machine details of the two clouds you
are
running?



Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please stop cruelty to Animals, become a Vegan"

+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Make In India : http://www.makeinindia.com/home

On Mon, Apr 30, 2018 at 9:51 PM, Antony A 

wrote:

Hi Shawn,

The cloud is running version 6.2.1. with ClassicIndexSchemaFactory

The sum of size from admin UI on all the shards is around 265 G vs 224
G
between the two clouds.

I created the collection using "numShards" so compositeId router.

If you need more information, please let me know.

Thanks
AA

On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey 
wrote:


On 4/30/2018 9:51 AM, Antony A wrote:


I am running two separate solr clouds. I have 8 shards in each with
a
total
of 300 million documents. Both the clouds are indexing the document

from

the same source/configuration.

I am noticing there is a difference in the size of the collection

between

them. I am planning to add more shards to see if that helps solve
the
issue. Has anyone come across similar issue?


There's no information here about exactly what you are seeing, what

you

are expecting to see, and why you believe that what you are seeing is

wrong.

You did say that there is "a difference in size".  That is a very

vague


RE: Collection reload leaves dangling SolrCore instances

2018-05-02 Thread Markus Jelsma
Sounds just like it, i will check it out!

Thanks both!
Markus

 
 
-Original message-
> From:Erick Erickson 
> Sent: Wednesday 2nd May 2018 17:21
> To: solr-user 
> Subject: Re: Collection reload leaves dangling SolrCore instances
> 
> Markus:
> 
> You may well be hitting SOLR-11882.
> 
> On Wed, May 2, 2018 at 8:18 AM, Shawn Heisey  wrote:
> > On 5/2/2018 4:40 AM, Markus Jelsma wrote:
> >> One of our collections, that is heavy with tons of TokenFilters using 
> >> large dictionaries, has a lot of trouble dealing with collection reload. I 
> >> removed all custom plugins from solrconfig, dumbed the schema down and 
> >> removed all custom filters and replaced a customized decompounder with 
> >> Lucene's vanilla filter, and the problem still exists.
> >>
> >> After collection reload a second SolrCore instance appears for each real 
> >> core in use, each next reload causes the number of instances to grow. The 
> >> dangling instances are eventually removed except for one or two. When 
> >> working locally with for example two shards/one replica in one JVM, a 
> >> single reload eats about 500 MB for each reload.
> >>
> >> How can we force Solr to remove those instances sooner? Forcing a GC won't 
> >> do it so it seems Solr itself actively keeps some stale instances alive.
> >
> > Custom plugins, which you did mention, would be the most likely
> > culprit.  Those sometimes have bugs where they don't properly close
> > resources.  Are you absolutely sure that there is no custom software
> > loading at all?  Removing the jars entirely (not just the config that
> > might use the jars) might be required.
> >
> > Have you been able to get heap dumps and figure out what object is
> > keeping the SolrCore alive?
> >
> > Thanks,
> > Shawn
> >
> 


Re: SolrCloud replicaition

2018-05-02 Thread Erick Erickson
That's a pretty open-ended question. The short form
is when the replica switches back to "active" (or green
on the admin UI) then it's been caught up.

This is all about NRT replicas.

PULL and TLOG replicas pull the segments from the
leader so the idea of "sending a doc to the replica"
doesn't really apply. Well, TLOG replicas get a copy of
the doc for _their_ tlogs but there is no active
indexing going on.

Best,
Erick

On Wed, May 2, 2018 at 8:43 AM, kumar gaurav  wrote:
> Hi Erick
>
> What will happen after replica recovered ? Is leader continuously
> checks status of replica and send again after recovered or replica will
> pull document for indexing after recovering ?
>
> Please clarify this behavior for all of Replica types i.e. NRT, TLOG and
> PULL. (i have implemented solr 7.3 )
>
> Thanks .
> Kumar Gaurav
>
>
> On Wed, May 2, 2018 at 9:04 PM, Erick Erickson 
> wrote:
>
>> 1> When the replica fails, the leader tries to resend it, and if the
>> resends fail,
>>  then the follower goes into recovery which will eventually get the
>> document
>>  caught up.
>>
>> 2> Yes, the client will get a failure indication.
>>
>> Best,
>> Erick
>>
>> On Wed, May 2, 2018 at 3:03 AM, Greenhorn Techie
>>  wrote:
>> > Hi,
>> >
>> > Good Morning!!
>> >
>> > In the case of a SolrCloud setup with sharing and replication in place,
>> > when a document is sent for indexing, what happens when only the shard
>> > leader has indexed the document, but the replicas failed, for whatever
>> > reason. Will the document be resent by the leader to the replica shards
>> to
>> > index the document after sometime or how is scenario addressed?
>> >
>> > Also, given the above context, when I set the value of min_rf parameter
>> to
>> > say 2, does that mean the calling application will be informed that the
>> > indexing failed?
>>


Re: SolrCloud replicaition

2018-05-02 Thread kumar gaurav
Hi Erick

What will happen after replica recovered ? Is leader continuously
checks status of replica and send again after recovered or replica will
pull document for indexing after recovering ?

Please clarify this behavior for all of Replica types i.e. NRT, TLOG and
PULL. (i have implemented solr 7.3 )

Thanks .
Kumar Gaurav


On Wed, May 2, 2018 at 9:04 PM, Erick Erickson 
wrote:

> 1> When the replica fails, the leader tries to resend it, and if the
> resends fail,
>  then the follower goes into recovery which will eventually get the
> document
>  caught up.
>
> 2> Yes, the client will get a failure indication.
>
> Best,
> Erick
>
> On Wed, May 2, 2018 at 3:03 AM, Greenhorn Techie
>  wrote:
> > Hi,
> >
> > Good Morning!!
> >
> > In the case of a SolrCloud setup with sharing and replication in place,
> > when a document is sent for indexing, what happens when only the shard
> > leader has indexed the document, but the replicas failed, for whatever
> > reason. Will the document be resent by the leader to the replica shards
> to
> > index the document after sometime or how is scenario addressed?
> >
> > Also, given the above context, when I set the value of min_rf parameter
> to
> > say 2, does that mean the calling application will be informed that the
> > indexing failed?
>


Re: SolrCloud replicaition

2018-05-02 Thread Erick Erickson
1> When the replica fails, the leader tries to resend it, and if the
resends fail,
 then the follower goes into recovery which will eventually get the document
 caught up.

2> Yes, the client will get a failure indication.

Best,
Erick

On Wed, May 2, 2018 at 3:03 AM, Greenhorn Techie
 wrote:
> Hi,
>
> Good Morning!!
>
> In the case of a SolrCloud setup with sharing and replication in place,
> when a document is sent for indexing, what happens when only the shard
> leader has indexed the document, but the replicas failed, for whatever
> reason. Will the document be resent by the leader to the replica shards to
> index the document after sometime or how is scenario addressed?
>
> Also, given the above context, when I set the value of min_rf parameter to
> say 2, does that mean the calling application will be informed that the
> indexing failed?


Re: Too many commits

2018-05-02 Thread Erick Erickson
Two possibilities:
1> you have multiple replicas in the same JVM and are seeing commits
happen withall of them.

2> ramBufferSizeMB. when you index docs, segments are flushed when the
in-memory structures exceed this limit, is this perhaps what you're
seeing?

Best,
Erick

On Wed, May 2, 2018 at 3:54 AM, Patrick Recchia
 wrote:
> Hello,
>
> I'm seeing way too many commits on our solr cluster, and I don't know why.
>
> Here is the landscape:
> - Each collection we create (one per day) is created with 10 shards with 2
> replicas each.
> - we send live data, 2B records / day. so on average 200M records/shard per
> day - for a size of approx 180GB/sahrd*Day.
> on peak hours that makes approx 10M records/hour;
> - so approx. 15 records/minute. For a size of ~115MB/Minute?
>
> - IndexConfig is set to autoCommit every minute:
>
>  ${solr.autoCommit.maxTime:6} <
> openSearcher>true 
>
> (solr.autoCommit.maxTime is not set)
>
> There is nothing else customized (when it comes to IndexWriter, at least)
> within solrconfig.xml
>
> The data is sent without commit, but with commitWithin=50 ms.
>
> All that said, I would have expected a rate of about 1 segment created epr
> minute; of about 100MB.
>
> Instead of that, I a lot of very small segments (between a few KB to a few
> MB) with a very high rate.
>
> And I have no idea why this would happen.
> Where I can look to explain such a rate of segments being written?
>
>
>
>
>
> --
> One way of describing a computer is as an electric box which hums.
> Never ascribe to malice what can be explained by stupidity
> --
> Patrick Recchia
> GSM (BE): +32 486 828311
> GSM(IT): +39 347 2300830


Re: Collection reload leaves dangling SolrCore instances

2018-05-02 Thread Erick Erickson
Markus:

You may well be hitting SOLR-11882.

On Wed, May 2, 2018 at 8:18 AM, Shawn Heisey  wrote:
> On 5/2/2018 4:40 AM, Markus Jelsma wrote:
>> One of our collections, that is heavy with tons of TokenFilters using large 
>> dictionaries, has a lot of trouble dealing with collection reload. I removed 
>> all custom plugins from solrconfig, dumbed the schema down and removed all 
>> custom filters and replaced a customized decompounder with Lucene's vanilla 
>> filter, and the problem still exists.
>>
>> After collection reload a second SolrCore instance appears for each real 
>> core in use, each next reload causes the number of instances to grow. The 
>> dangling instances are eventually removed except for one or two. When 
>> working locally with for example two shards/one replica in one JVM, a single 
>> reload eats about 500 MB for each reload.
>>
>> How can we force Solr to remove those instances sooner? Forcing a GC won't 
>> do it so it seems Solr itself actively keeps some stale instances alive.
>
> Custom plugins, which you did mention, would be the most likely
> culprit.  Those sometimes have bugs where they don't properly close
> resources.  Are you absolutely sure that there is no custom software
> loading at all?  Removing the jars entirely (not just the config that
> might use the jars) might be required.
>
> Have you been able to get heap dumps and figure out what object is
> keeping the SolrCore alive?
>
> Thanks,
> Shawn
>


Re: SorCloud Sharding

2018-05-02 Thread Erick Erickson
1> You have to prototype, see:
https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

2> No. It could be done, but it'd take some very careful work.
Basically you'd have to merge "adjacent" shards where "adjacent" is
measured by the shard range of each replica, then fiddle with the
state.json file and hope you get it all right. I'm not sure whether
the new autoscaling stuff will handle this or not.

3> Yep.

But why bother reducing the number of shards? Agreed, there's a little
overhead in having more shards than you need, but you can host
multiple replicas in the same JVM so as long as you get satisfactory
performance, there's no particularly good reason to merge them that
pops to mind.

Best,
Erick

On Wed, May 2, 2018 at 6:22 AM, Greenhorn Techie
 wrote:
> Hi,
>
> I have few questions on sharding in a SolrCloud setup:
>
> 1. How to know the optimal number of shards required for a SolrCloud setup?
> What are the factors to consider to decide on the value for *numShards*
> parameter?
> 2. In case if over sharding has been done i.e. if numShards has been set to
> a very high value, is there a mechanism to merge multiple shards in a
> SolrCloud setup?
> 3. In case if no such merge mechanism is available, is reindexing the only
> option to set numShards to a new lower value?
>
> Thnx.


Re: Collection reload leaves dangling SolrCore instances

2018-05-02 Thread Shawn Heisey
On 5/2/2018 4:40 AM, Markus Jelsma wrote:
> One of our collections, that is heavy with tons of TokenFilters using large 
> dictionaries, has a lot of trouble dealing with collection reload. I removed 
> all custom plugins from solrconfig, dumbed the schema down and removed all 
> custom filters and replaced a customized decompounder with Lucene's vanilla 
> filter, and the problem still exists.
>
> After collection reload a second SolrCore instance appears for each real core 
> in use, each next reload causes the number of instances to grow. The dangling 
> instances are eventually removed except for one or two. When working locally 
> with for example two shards/one replica in one JVM, a single reload eats 
> about 500 MB for each reload.
>
> How can we force Solr to remove those instances sooner? Forcing a GC won't do 
> it so it seems Solr itself actively keeps some stale instances alive.

Custom plugins, which you did mention, would be the most likely
culprit.  Those sometimes have bugs where they don't properly close
resources.  Are you absolutely sure that there is no custom software
loading at all?  Removing the jars entirely (not just the config that
might use the jars) might be required.

Have you been able to get heap dumps and figure out what object is
keeping the SolrCore alive?

Thanks,
Shawn



Re: count mismatch: number of records indexed

2018-05-02 Thread Erick Erickson
And if you _do_ have a uniqueKey ("id" by default), subsequent records
will overwrite older records with the same key.

The tip from Annameneni is the first thing I'd try though, make sure
you've issued a commit.

Best,
Erick

On Wed, May 2, 2018 at 7:09 AM, ANNAMANENI RAVEENDRA
 wrote:
> Possible cases can be
>
> If you don’t have unique key then there are high chances that you will see
> less data
> Try hard commit or check your commit times (hard/soft)
>
>
> On Wed, May 2, 2018 at 9:30 AM Srinivas Kashyap <
> srini...@tradestonesoftware.com> wrote:
>
>> Hi,
>>
>> I have standalone solr index server 5.2.1 and have a core with 15
>> fields(all indexed and stored).
>>
>> Through DIH I'm indexing the data (around 65million records). The index
>> process took 6hours to complete. But after the completion when I checked
>> through Solr admin query console(*:*), numfound is only 41 thousand
>> records. Am I missing some configuration to index all records?
>>
>> Physical memory: 16GB
>> JVM memory: 4GB
>>
>> Thanks,
>> Srinivas
>>


Re: Query regarding solr 7.3.0

2018-05-02 Thread Erick Erickson
Just what it says. Solr/Lucene like lots of file handles, I regularly
see several thousand. If you run out of file handles Solr stops
working.

Ditto processes. Solr in particular spawns a lot of threads,
particularly when handling many incoming requests through Jetty. If
you exceed the limit, requests fail.

The comments about solr.in.sh are only if you want to stop the
warning. To really fix the underlying issue you need to talk to your
system administrator and up the ulimit values. This is a system-level
operation, not something configured in Solr.

Best,
Erick

On Wed, May 2, 2018 at 4:52 AM, Agarwal, Monica (Nokia - IN/Bangalore)
 wrote:
> Hi ,
>
> I am trying to upgrade solr from 7.1.0 to 7.3.0 .
>
> While trying to start the  solr process the below warnings are observed:
>
> *** [WARN] *** Your open file limit is currently 1024.
>  It should be set to 65000 to avoid operational disruption.
>  If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false 
> in your profile or solr.in.sh
> *** [WARN] ***  Your Max  Limit is currently 1024.
>  It should be set to 65000 to avoid operational disruption.
>  If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false 
> in your profile or solr.in.sh
>
> Could anyone of you help me in understanding these warnings if it could lead 
> to some issues.
> Also if I need to do any configuration changes in solr.in.sh file.
>
> Regards,
> Monica
>
>


Query regarding solr 7.3.0

2018-05-02 Thread Agarwal, Monica (Nokia - IN/Bangalore)
Hi ,

I am trying to upgrade solr from 7.1.0 to 7.3.0 .

While trying to start the  solr process the below warnings are observed:

*** [WARN] *** Your open file limit is currently 1024.
 It should be set to 65000 to avoid operational disruption.
 If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in 
your profile or solr.in.sh
*** [WARN] ***  Your Max  Limit is currently 1024.
 It should be set to 65000 to avoid operational disruption.
 If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in 
your profile or solr.in.sh

Could anyone of you help me in understanding these warnings if it could lead to 
some issues.
Also if I need to do any configuration changes in solr.in.sh file.

Regards,
Monica




Re: Solr working £ Symbol

2018-05-02 Thread Shawn Heisey
On 5/2/2018 3:13 AM, Mohan Cheema wrote:
> We are using Solr to index our data. The data contains £ symbol within the 
> text and for currency. When data is exported from the source system data 
> contains £ symbol, however, when the data is imported into the Solr £ symbol 
> is converted to �.
>
> How can we keep the £ symbol as is when importing data?

What tools are you using to look at Solr results?  What tools are you
using to send update data to Solr?

Solr expects and delivers UTF-8 characters.  If the data you're sending
to Solr is using another character set, Java may not interpret it correctly.

Conversely, if whatever you're using to look at Solr's results is also
not expecting/displaying UTF-8, you might not be shown correct characters.

Thanks,
Shawn



Re: count mismatch: number of records indexed

2018-05-02 Thread ANNAMANENI RAVEENDRA
Possible cases can be

If you don’t have unique key then there are high chances that you will see
less data
Try hard commit or check your commit times (hard/soft)


On Wed, May 2, 2018 at 9:30 AM Srinivas Kashyap <
srini...@tradestonesoftware.com> wrote:

> Hi,
>
> I have standalone solr index server 5.2.1 and have a core with 15
> fields(all indexed and stored).
>
> Through DIH I'm indexing the data (around 65million records). The index
> process took 6hours to complete. But after the completion when I checked
> through Solr admin query console(*:*), numfound is only 41 thousand
> records. Am I missing some configuration to index all records?
>
> Physical memory: 16GB
> JVM memory: 4GB
>
> Thanks,
> Srinivas
>


Is it normal for BlendedInfixLookupFactory to not show terms?

2018-05-02 Thread O. Klein
BlendedInfixLookupFactory is not returning terms, but returns the field
value. If I change to FuzzyLookupFactory it works fine. Am I doing something
wrong?

   
  
default
BlendedInfixLookupFactory
position_linear
DocumentDictionaryFactory
weight
text_suggest
language
textSuggest
true
  




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


count mismatch: number of records indexed

2018-05-02 Thread Srinivas Kashyap
Hi,

I have standalone solr index server 5.2.1 and have a core with 15 fields(all 
indexed and stored).

Through DIH I'm indexing the data (around 65million records). The index process 
took 6hours to complete. But after the completion when I checked through Solr 
admin query console(*:*), numfound is only 41 thousand records. Am I missing 
some configuration to index all records?

Physical memory: 16GB
JVM memory: 4GB

Thanks,
Srinivas


SorCloud Sharding

2018-05-02 Thread Greenhorn Techie
Hi,

I have few questions on sharding in a SolrCloud setup:

1. How to know the optimal number of shards required for a SolrCloud setup?
What are the factors to consider to decide on the value for *numShards*
parameter?
2. In case if over sharding has been done i.e. if numShards has been set to
a very high value, is there a mechanism to merge multiple shards in a
SolrCloud setup?
3. In case if no such merge mechanism is available, is reindexing the only
option to set numShards to a new lower value?

Thnx.


Re: Query Regarding Solr Garbage Collection

2018-05-02 Thread Susheel Kumar
A very high rate of indexing documents could cause heap usage to go high
(all temporary objects getting created are in JVM memory and with very high
rate heap utilization may go high)

Having Cache's not sized/set correctly would also return in high JVM usage
since as searches are happening, it will keep filling cache's thus JVM.
Other factors like sorting/faceting etc. would also require JVM memory and
deep paging could even cause JVM to run out of memory/OOM.

Thnx

On Tue, May 1, 2018 at 6:18 PM, Greenhorn Techie 
wrote:

> Hi,
>
> Following the https://wiki.apache.org/solr/SolrPerformanceFactors article,
> I understand that Garbage Collection might be triggered due to significant
> increase in JVM heap usage unless a commit is performed. Given this
> background, I am curious to understand the reasons / factors that
> contribute to increased heap usage of Solr JVM, which would thus force a
> Garbage Collection cycle.
>
> Especially, what are the factors that contribute to heap usage increase
> during indexing time and what factors contribute during search/query time?
>
> Thanks
>


Re: Solr Heap usage

2018-05-02 Thread Susheel Kumar
Take a look at https://wiki.apache.org/solr/SolrPerformanceProblems. The
section "how much heap do i need"  talks about that.
Cache also goes to JVM so take a look how much you need/allocating for
different cache's.

Thnx


On Tue, May 1, 2018 at 7:33 PM, Greenhorn Techie 
wrote:

> Hi,
>
> Wondering what are the considerations to be aware to arrive at an optimal
> heap size for Solr JVM? Though I did discuss this on the IRC, I am still
> unclear on how Solr uses the JVM heap space. Are there any pointers to
> understand this aspect better?
>
> Given that Solr requires an optimally configured heap, so that the
> remaining unused memory can be used for OS disk cache, I wonder how to best
> configure Solr heap. Also, on the IRC it was discussed that having 31GB of
> heap is better than having 32GB due to Java’s internal usage of heap. Can
> anyone guide further on heap configuration please?
>
> Thanks
>


Autocomplete returning shingles

2018-05-02 Thread O. Klein
I need to use autocomplete with edismax (ngrams,edgegrams) to return shingled
suggestions. Field value "new york city" needs to return on query "ne" ->
"new","new york","new york city". With suggester this is easy. But im forced
to use edismax because I need to apply mutliple filter queries.

What is best approach to deal with this?

Any suggestions are appreciated.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Too many commits

2018-05-02 Thread Patrick Recchia
Hello,

I'm seeing way too many commits on our solr cluster, and I don't know why.

Here is the landscape:
- Each collection we create (one per day) is created with 10 shards with 2
replicas each.
- we send live data, 2B records / day. so on average 200M records/shard per
day - for a size of approx 180GB/sahrd*Day.
on peak hours that makes approx 10M records/hour;
- so approx. 15 records/minute. For a size of ~115MB/Minute?

- IndexConfig is set to autoCommit every minute:

 ${solr.autoCommit.maxTime:6} <
openSearcher>true 

(solr.autoCommit.maxTime is not set)

There is nothing else customized (when it comes to IndexWriter, at least)
within solrconfig.xml

The data is sent without commit, but with commitWithin=50 ms.

All that said, I would have expected a rate of about 1 segment created epr
minute; of about 100MB.

Instead of that, I a lot of very small segments (between a few KB to a few
MB) with a very high rate.

And I have no idea why this would happen.
Where I can look to explain such a rate of segments being written?





-- 
One way of describing a computer is as an electric box which hums.
Never ascribe to malice what can be explained by stupidity
--
Patrick Recchia
GSM (BE): +32 486 828311
GSM(IT): +39 347 2300830


Collection reload leaves dangling SolrCore instances

2018-05-02 Thread Markus Jelsma
Hello,

One of our collections, that is heavy with tons of TokenFilters using large 
dictionaries, has a lot of trouble dealing with collection reload. I removed 
all custom plugins from solrconfig, dumbed the schema down and removed all 
custom filters and replaced a customized decompounder with Lucene's vanilla 
filter, and the problem still exists.

After collection reload a second SolrCore instance appears for each real core 
in use, each next reload causes the number of instances to grow. The dangling 
instances are eventually removed except for one or two. When working locally 
with for example two shards/one replica in one JVM, a single reload eats about 
500 MB for each reload.

How can we force Solr to remove those instances sooner? Forcing a GC won't do 
it so it seems Solr itself actively keeps some stale instances alive.

Many thanks,
Markus


SolrCloud replicaition

2018-05-02 Thread Greenhorn Techie
Hi,

Good Morning!!

In the case of a SolrCloud setup with sharing and replication in place,
when a document is sent for indexing, what happens when only the shard
leader has indexed the document, but the replicas failed, for whatever
reason. Will the document be resent by the leader to the replica shards to
index the document after sometime or how is scenario addressed?

Also, given the above context, when I set the value of min_rf parameter to
say 2, does that mean the calling application will be informed that the
indexing failed?


Solr working £ Symbol

2018-05-02 Thread Mohan Cheema
Hi There,

We are using Solr to index our data. The data contains £ symbol within the text 
and for currency. When data is exported from the source system data contains £ 
symbol, however, when the data is imported into the Solr £ symbol is converted 
to �.

How can we keep the £ symbol as is when importing data?

Note: When a file is viewed using less the pound symbol is displayed as  
and when viewed in vi editor it shows up properly.

Regards,

Mohan
Disclaimer: www.arrkgroup.com/EmailDisclaimer


Re: Regarding LTR feature

2018-05-02 Thread prateek . agarwal
Hi Alessandro,

Thanks for responding.

Let me take a step back and tell you the problem I have been facing with
this.So one of the features in my LTR model is:

{
"store" : "my_feature_store",
"name" : "in_aggregated_terms",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : { "q" : "{!func}scale(query({!payload_score
f=aggregated_terms func=max v=${query}}),0,100)" }
}

so now with this feature if i apply FQ in solr it will scale the
values for all the documents irrespective of the FQ filter.

But if I change the feature to something like this:

{
"store" : "my_feature_store",
"name" : "in_aggregated_terms",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : { "q" : "{!func}scale(query({!field f=aggregated_terms
v=${query}}),0,100)" }
}

Then the it scales properly with FQ aswell.

And about that verification I simply check the results returned like
in Case 1 after applying the FQ filter that feature score doesn't
scale to its maximum value of 100 which i think is because of the fact
that it scales over all the documents and returns only the subset with
the FQ filter applied.

Alternatively is their any way I can scale these value during
normalization time with a customized class which iterates over all the
re-ranked documents only.

Thanks a lot in advance.

Looking forward to hearing back from you soon.


Regards,

Prateek


Re: Regarding LTR feature

2018-05-02 Thread Prateek

Hi Alessandro,

Thanks for responding.

Let me take a step back and tell you the problem I have been facing with
this.So one of the features in my LTR model is:

{
"store" : "my_feature_store",
"name" : "in_aggregated_terms",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : { "q" : "{!func}scale(query({!payload_score
f=aggregated_terms func=max v=${query}}),0,100)" }
}

so now with this feature if i apply FQ in solr it will scale the
values for all the documents irrespective of the FQ filter.

But if I change the feature to something like this:

{
"store" : "my_feature_store",
"name" : "in_aggregated_terms",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : { "q" : "{!func}scale(query({!field f=aggregated_terms
v=${query}}),0,100)" }
}

Then the it scales properly with FQ aswell.

And about that verification I simply check the results returned like
in Case 1 after applying the FQ filter that feature score doesn't
scale to its maximum value of 100 which i think is because of the fact
that it scales over all the documents and returns only the subset with
the FQ filter applied.

Alternatively is their any way I can scale these value during
normalization time with a customized class which iterates over all the
re-ranked documents only.

Thanks a lot in advance.

Looking forward to hearing back from you soon.


Regards,

Prateek

On 2018/04/30 11:58:44, Alessandro Benedetti  wrote: > Hi 
Prateek,> > with query and FQ Solr is expected to score a document only 
if that document> > is a match of all the FQ results intersected with 
the query results [1].> > Then re-ranking happens, so effectively, only 
the top K intersected> > documents will be re-ranked.> > > If you are 
curious about the code, this can be debugged running a variation> > of 
org.apache.solr.ltr.TestLTRWithFacet#testRankingSolrFacet (introducing> 
> filter queries ) and setting the breakpoint somewhere around :> > 
org/apache/solr/ltr/LTRRescorer.java:181> > > Can you elaborate how you 
have verified that is currently not working like> > that ?> > I am 
familiar with LTR code and I would be surprised to see this different> > 
behavior> > > [1] 
https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/> 
> > > > -> > ---> > Alessandro Benedetti> > Search 
Consultant, R Software Engineer, Director> > Sease Ltd. - 
www.sease.io> > --> > Sent from: 
http://lucene.472066.n3.nabble.com/Solr-User-f472068.html> >


Re: Regarding LTR feature

2018-05-02 Thread Prateek

Hi Alessandro,

Thanks for responding.

Let me take a step back and tell you the problem I have been facing with
this.So one of the features in my LTR model is:

{
"store" : "my_feature_store",
"name" : "in_aggregated_terms",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : { "q" : "{!func}scale(query({!payload_score
f=aggregated_terms func=max v=${query}}),0,100)" }
}

so now with this feature if i apply FQ in solr it will scale the
values for all the documents irrespective of the FQ filter.

But if I change the feature to something like this:

{
"store" : "my_feature_store",
"name" : "in_aggregated_terms",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : { "q" : "{!func}scale(query({!field f=aggregated_terms
v=${query}}),0,100)" }
}

Then the it scales properly with FQ aswell.

And about that verification I simply check the results returned like
in Case 1 after applying the FQ filter that feature score doesn't
scale to its maximum value of 100 which i think is because of the fact
that it scales over all the documents and returns only the subset with
the FQ filter applied.

Alternatively is their any way I can scale these value during
normalization time with a customized class which iterates over all the
re-ranked documents only.

Thanks a lot in advance.

Looking forward to hearing back from you soon.


Regards,

Prateek

On 2018/04/30 11:58:44, Alessandro Benedetti  wrote: > Hi 
Prateek,> > with query and FQ Solr is expected to score a document only 
if that document> > is a match of all the FQ results intersected with 
the query results [1].> > Then re-ranking happens, so effectively, only 
the top K intersected> > documents will be re-ranked.> > > If you are 
curious about the code, this can be debugged running a variation> > of 
org.apache.solr.ltr.TestLTRWithFacet#testRankingSolrFacet (introducing> 
> filter queries ) and setting the breakpoint somewhere around :> > 
org/apache/solr/ltr/LTRRescorer.java:181> > > Can you elaborate how you 
have verified that is currently not working like> > that ?> > I am 
familiar with LTR code and I would be surprised to see this different> > 
behavior> > > [1] 
https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/> 
> > > > -> > ---> > Alessandro Benedetti> > Search 
Consultant, R Software Engineer, Director> > Sease Ltd. - 
www.sease.io> > --> > Sent from: 
http://lucene.472066.n3.nabble.com/Solr-User-f472068.html> >


Re: Regarding LTR feature

2018-05-02 Thread Prateek Agarwal
Hi Alessandro,

Thanks for responding.

Let me take a step back and tell you the problem I have been facing with
this.So one of the features in my LTR model is:

{
  "store" : "my_feature_store",
  "name" : "in_aggregated_terms",
  "class" : "org.apache.solr.ltr.feature.SolrFeature",
  "params" : { "q" : "{!func}scale(query({!payload_score
f=aggregated_terms func=max v=${query}}),0,100)" }
}

so now with this feature if i apply FQ in solr it will scale the
values for all the documents irrespective of the FQ filter.

But if I change the feature to something like this:

{
  "store" : "my_feature_store",
  "name" : "in_aggregated_terms",
  "class" : "org.apache.solr.ltr.feature.SolrFeature",
  "params" : { "q" : "{!func}scale(query({!field f=aggregated_terms
v=${query}}),0,100)" }
}

Then the it scales properly with FQ aswell.

And about that verification I simply check the results returned like
in Case 1 after applying the FQ filter that feature score doesn't
scale to its maximum value of 100 which i think is because of the fact
that it scales over all the documents and returns only the subset with
the FQ filter applied.

Alternatively is their any way I can scale these value during
normalization time with a customized class which iterates over all the
re-ranked documents only.

Thanks a lot in advance.

Looking forward to hearing back from you soon.


Regards,

Prateek