[ANNOUNCE] Apache Solr 6.6.1 released

2017-09-07 Thread Varun Thacker
7 September 2017, Apache Solr™ 6.6.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 6.6.1

Solr is the popular, blazing fast, open source NoSQL search platform from
the
Apache Lucene project. Its major features include powerful full-text
search,
hit highlighting, faceted search and analytics, rich document parsing,
geospatial search, extensive REST APIs as well as parallel SQL. Solr is
enterprise grade, secure and highly scalable, providing fault tolerant
distributed search and indexing, and powers the search and navigation
features
of many of the world's largest internet sites.

This release includes 15 bug fixes since the 6.6.0 release. Some of the
major fixes are:

* Standalone Solr loads UNLOADed core on request

* ParallelStream should set the StreamContext when constructing SolrStreams

* CloudSolrStream.toExpression incorrectly handles fq clauses

* CoreContainer.load needs to send lazily loaded core descriptors to the
proper list rather than send them all to the transient lists

* Creating a core should write a core.properties file first and clean up on
failure

* Clean up a few details left over from pluggable transient core and
untangling

* Provide a way to know when Core Discovery is finished and when all async
cores are done loading

* CDCR bootstrapping can get into an infinite loop when a core is reloaded

* SolrJmxReporter is broken on core reload. This resulted in some or most
metrics not being reported via JMX after core reloads, depending on timing

* Creating a core.properties fails if the parent of core.properties is a
symlinked directory

* StreamHandler should allow connections to be closed early

* Certain admin UI pages would not load up correctly with kerberos enabled

* Fix DOWNNODE -> queue-work znode explosion in ZooKeeper

* Upgrade to Hadoop 2.7.4 to fix incompatibility with Java 9

* Fix bin/solr.cmd so it can run properly on Java 9

Furthermore, this release includes Apache Lucene 6.6.1 which includes 2 bug
fixes since the 6.6.0 release.

The release is available for immediate download at:

  http://www.apache.org/dyn/closer.lua/lucene/solr/6.6.1

Please read CHANGES.txt for a detailed list of changes:

  https://lucene.apache.org/solr/6_6_1/changes/Changes.html

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.


Re: Solr Commit Thread Blocked because of excessive number of merging threads

2017-09-07 Thread Walter Underwood
Agree, if the merge tuning isn’t working, then stop tuning the merges and go 
back to defaults. I’ve been running Solr in production for about eight of the 
last ten years and I’ve never tuned merges.

Are your PHP clients sending batches or single documents?

1 k documents per minute seems very, very slow. Our Solr cluster indexes 1000X 
faster, at about 1 million docs/minute.

We have a 16 node cluster, 4 shards, replication factor 4, with about 20 
million documents. That is 5 million per shard, not that different from yours. 
Our documents average around 1 kbyte, but are quite variable. Our index gets 
pretty big because we have EdgeNGrams on the main field.

We load the cluster (Solr 6.5.1) through a dumb load balancer with a Java 
program. Batch size of 1000 docs, 64 threads (connections). Because we don’t 
use the cloud-aware client, roughly 75% of batches go to replicas and need to 
be forwarded to leaders. Also, docs don’t go to the right shard. It still runs 
like a bat out of hell.

We run Solr with an 8 Gb heap, Java 8u131, G1 collector with Shawn Heisey’s 
recommended GC settings. 

The hosts are Amazon instances with 36 CPUs and 59 Gb of RAM. Storage is SSD 
EBS.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 7, 2017, at 5:02 PM, Erick Erickson  wrote:
> 
> Skimming and to add to what Shawn said about ramBufferSizeMB.
> 
> It's totally wasted space pretty much since you've set maxDocs to 10,000.
> It doesn't matter how big ramBufferSizeMB is, when you reach 10,000 docs
> indexed the buffer will be flushed and set back to zero.
> 
> And +1 to all of Shawn's comments about the changes you've made to
> the merge policy. I'd set them all back to the defaults unless you have some
> kind of proof that they're helping since there's ample cause for concern that
> they're hurting instead.
> 
> Best,
> Erick
> 
> On Thu, Sep 7, 2017 at 3:53 PM, Shawn Heisey  wrote:
>> On 9/6/2017 11:54 PM, yasoobhaider wrote:
>>> My team has tasked me with upgrading Solr from the version we are using
>>> (5.4) to the latest stable version 6.6. I am stuck for a few days now on the
>>> indexing part.
>>> 
>>> So in total I'm indexing about 2.5million documents. The average document
>>> size is ~5KB. I have 10 (PHP) workers which are running in parallel, hitting
>>> Solr with ~1K docs/minute. (This sometimes goes up to ~3K docs/minute).
>>> 
>>> System specifications:
>>> RAM: 120G
>>> Processors: 16
>>> 
>>> Solr configuration:
>>> Heap size: 80G
>> 
>> That's an ENORMOUS heap.  Why is it that big? If the index only has 2.5
>> million documents and reaches a size of 10GB, I cannot imagine that
>> index ever needing a heap that big.  That's just asking for extreme (but
>> perhaps infrequent) garbage collection pauses.  Assuming those numbers
>> for all your index data are correct, I'd drop it to something like 4GB.
>> If your queries are particularly complex, you might want to go to 8GB.
>> Note that this is also going to require that you significantly reduce
>> your ramBufferSizeMB value, which I already advised you to do on another
>> thread.
>> 
>>> 
>>> solrconfig.xml: (Relevant parts; please let me know if there's anything else
>>> you would like to look at)
>>> 
>>> 
>>>  1
>>>  380
>>>  true
>>> 
>>> 
>>> 
>>>  ${solr.autoSoftCommit.maxTime:-1}
>>> 
>>> 
>>> 5000
>>> 1
>>> 
>>> 
>>>  30
>>>  30
>>> 
>>> 
>>> 
>>>  8
>>>  7
>>> 
>> 
>> I've given you suggestions on how to change this part of the config.
>> See the message that I sent earlier on another thread -- at 14:21 UTC
>> today.  If you change those settings as I recommended, the merging is
>> less likely to overwhelm your system.
>> 
>>> 
>>> 
>>> The main problem:
>>> 
>>> When I start indexing everything is good until I reach about 2 million docs,
>>> which takes ~10 hours. But then the  commitscheduler thread gets blocked. It
>>> is stuck at doStall() in ConcurrentMergeScheduler(CMS). Looking at the logs
>>> from InfoStream, I found "too many merges; stalling" message from the
>>> commitscheduler thread, post which it gets stuck in the while loop forever.
>> 
>> This means that there are more merges scheduled than you have allowed
>> with maxMergeCount, so the thread that's doing the actual indexing is
>> paused.
>> 
>> Best guess is that you are overwhelming your disks with multiple merge
>> threads, because you've set maxMergeThreads to 7.  In most situations
>> that should be 1, so multiple merges are not running simultaneously.
>> Instead, they will be run one at a time so that each one can complete
>> faster.  You may have plenty of CPU power to run multiple threads, but
>> when multiple threads 

Re: Solr Commit Thread Blocked because of excessive number of merging threads

2017-09-07 Thread Erick Erickson
Skimming and to add to what Shawn said about ramBufferSizeMB.

It's totally wasted space pretty much since you've set maxDocs to 10,000.
It doesn't matter how big ramBufferSizeMB is, when you reach 10,000 docs
indexed the buffer will be flushed and set back to zero.

And +1 to all of Shawn's comments about the changes you've made to
the merge policy. I'd set them all back to the defaults unless you have some
kind of proof that they're helping since there's ample cause for concern that
they're hurting instead.

Best,
Erick

On Thu, Sep 7, 2017 at 3:53 PM, Shawn Heisey  wrote:
> On 9/6/2017 11:54 PM, yasoobhaider wrote:
>> My team has tasked me with upgrading Solr from the version we are using
>> (5.4) to the latest stable version 6.6. I am stuck for a few days now on the
>> indexing part.
>>
>> So in total I'm indexing about 2.5million documents. The average document
>> size is ~5KB. I have 10 (PHP) workers which are running in parallel, hitting
>> Solr with ~1K docs/minute. (This sometimes goes up to ~3K docs/minute).
>>
>> System specifications:
>> RAM: 120G
>> Processors: 16
>>
>> Solr configuration:
>> Heap size: 80G
>
> That's an ENORMOUS heap.  Why is it that big? If the index only has 2.5
> million documents and reaches a size of 10GB, I cannot imagine that
> index ever needing a heap that big.  That's just asking for extreme (but
> perhaps infrequent) garbage collection pauses.  Assuming those numbers
> for all your index data are correct, I'd drop it to something like 4GB.
> If your queries are particularly complex, you might want to go to 8GB.
> Note that this is also going to require that you significantly reduce
> your ramBufferSizeMB value, which I already advised you to do on another
> thread.
>
>> 
>> solrconfig.xml: (Relevant parts; please let me know if there's anything else
>> you would like to look at)
>>
>> 
>>   1
>>   380
>>   true
>> 
>>
>> 
>>   ${solr.autoSoftCommit.maxTime:-1}
>> 
>>
>> 5000
>> 1
>>
>> 
>>   30
>>   30
>> 
>>
>> 
>>   8
>>   7
>> 
>
> I've given you suggestions on how to change this part of the config.
> See the message that I sent earlier on another thread -- at 14:21 UTC
> today.  If you change those settings as I recommended, the merging is
> less likely to overwhelm your system.
>
>> 
>>
>> The main problem:
>>
>> When I start indexing everything is good until I reach about 2 million docs,
>> which takes ~10 hours. But then the  commitscheduler thread gets blocked. It
>> is stuck at doStall() in ConcurrentMergeScheduler(CMS). Looking at the logs
>> from InfoStream, I found "too many merges; stalling" message from the
>> commitscheduler thread, post which it gets stuck in the while loop forever.
>
> This means that there are more merges scheduled than you have allowed
> with maxMergeCount, so the thread that's doing the actual indexing is
> paused.
>
> Best guess is that you are overwhelming your disks with multiple merge
> threads, because you've set maxMergeThreads to 7.  In most situations
> that should be 1, so multiple merges are not running simultaneously.
> Instead, they will be run one at a time so that each one can complete
> faster.  You may have plenty of CPU power to run multiple threads, but
> when multiple threads are accessing data on one disk volume, the random
> access can cause serious problems with disk I/O performance.
>
>> I also increased by maxMergeAtOnce and segmentsPerTier from 10 to 20 and
>> then to 30, in hopes of having fewer merging threads to be running at once,
>> but that just results in more segments to be created (not sure why this
>> would happen). I also tried going the other way by reducing it to 5, but
>> that experiment failed quickly (commit thread blocked).
>
> When you increase the values in the mergePolicy, you are explicitly
> telling Lucene to allow more segments in the index at any given moment.
> These settings should not be tweaked unless you know for sure that you
> can benefit from changing them.  Higher values should result in less
> merging, but the size of each merge that DOES happen will be larger, so
> it will take longer.
>
>> I increased the ramBufferSizeMB to 5000MB so that there are fewer flushes
>> happening, so that fewer segments are created, so that fewer merges happen
>> (I haven't dug deep here, so please correct me if this is something I should
>> revert. Our current (5.x) config has this set at 324MB).
>
> With large ram buffers, commits are more likely to control how big each
> segment is and how frequently they are flushed.  Tests by Solr and
> Lucene developers have shown that increasing the buffer size beyond
> 128MB rarely offers any advantage, unless the documents are huge.  At
> 5KB, yours aren't huge.
>
>> The 

Re: Conditions with multiple boosts in bf exists query

2017-09-07 Thread Erick Erickson
I'd sidestep the problem ;)

Are these scores
1> known at index time
2> unchanging (at least until the doc is re-indexed)?

If so, pre-compute your boost and put it in the doc at index time.

The other thing you can do is use payloads to add a float to specific
tokens and incorporate them in at scoring time. See the Solr
documentation, if you have a relatively recent one the payload support
has been built in to Solr, otherwise here's a primer:
https://lucidworks.com/2014/06/13/end-to-end-payload-example-in-solr/

Best,
Erick

On Thu, Sep 7, 2017 at 8:40 AM, Eric Kurzenberger  wrote:
> I need to do a bf exists query that matches the following conditions:
>
>
> -  IF a_score = 1 AND b_score = 2 THEN boost 30
>
> -  IF a_score = 3 AND b_score = 4 THEN boost 20
>
> So far, the bf portion of my query looks like this:
>
> if(exists(query({!v="a_score_is:1"})),30,0)
>
> But I’m having difficulty finding the correct syntax for the multiple 
> conditions and boosts.
>
> I was originally doing a bq query that looked like this:
>
> bq=(a_score_is:1 AND b_score_is:2)^30 OR (a_score_is:3 AND 
> b_score_is:4)^20
>
> but I found that idf was skewing my rexpected esults, as I don’t care about 
> document frequency.
>
> Can anyone assist?
>
> Cheers,
>
> Eric
>


Re: Sort across collapsed document is not working

2017-09-07 Thread Ray Niu
this is not sharded collection, it only had one shard. I want to use
collapse to replace current group query, but the result is not same, I feel
there are some function issue in collapse plugin

2017-09-07 14:59 GMT-07:00 Erick Erickson :

> Is this a sharded collection? group.ngroups isn't supported (see the
> docs, "group.ngroups and group.facet require that all documents in
> each group must be co-located on the same shard") in sharded
> situations so it's not surprising that the results differ.
>
> Best,
> Erick
>
> On Thu, Sep 7, 2017 at 10:35 AM, Ray Niu  wrote:
> > Hello:
> >I tried to use Collapsing Query Parser per following link:
> >
> > https://cwiki.apache.org/confluence/display/solr/
> Collapse+and+Expand+Results
> > here is the query I am using
> > http:///solr/collection/select?q=groupId:*&
> > fl=id,groupId,date=%7B!collapse%20field=groupId%
> 20sort=%27id%20asc%27%7D&
> > expand=true=3=date%20asc=id%20asc=3
> >
> > but I found the result is different from group query:
> > http:///solr/collection/select?q=groupId:*&
> > fl=id,date,groupId=true=groupId
> > limit=4=true=date%20asc=id%20asc=3
> >
> > it seems sort across collapsed document is not working.
> >
> > Can anyone help on this?
>


Re: Solr Commit Thread Blocked because of excessive number of merging threads

2017-09-07 Thread Shawn Heisey
On 9/6/2017 11:54 PM, yasoobhaider wrote:
> My team has tasked me with upgrading Solr from the version we are using
> (5.4) to the latest stable version 6.6. I am stuck for a few days now on the
> indexing part.
>
> So in total I'm indexing about 2.5million documents. The average document
> size is ~5KB. I have 10 (PHP) workers which are running in parallel, hitting
> Solr with ~1K docs/minute. (This sometimes goes up to ~3K docs/minute).
>
> System specifications:
> RAM: 120G
> Processors: 16
>
> Solr configuration:
> Heap size: 80G

That's an ENORMOUS heap.  Why is it that big? If the index only has 2.5
million documents and reaches a size of 10GB, I cannot imagine that
index ever needing a heap that big.  That's just asking for extreme (but
perhaps infrequent) garbage collection pauses.  Assuming those numbers
for all your index data are correct, I'd drop it to something like 4GB. 
If your queries are particularly complex, you might want to go to 8GB. 
Note that this is also going to require that you significantly reduce
your ramBufferSizeMB value, which I already advised you to do on another
thread.

> 
> solrconfig.xml: (Relevant parts; please let me know if there's anything else
> you would like to look at)
>
> 
>   1
>   380
>   true
> 
>
> 
>   ${solr.autoSoftCommit.maxTime:-1}
> 
>
> 5000
> 1
>
> 
>   30
>   30
> 
>
> 
>   8
>   7
> 

I've given you suggestions on how to change this part of the config. 
See the message that I sent earlier on another thread -- at 14:21 UTC
today.  If you change those settings as I recommended, the merging is
less likely to overwhelm your system.

> 
>
> The main problem:
>
> When I start indexing everything is good until I reach about 2 million docs,
> which takes ~10 hours. But then the  commitscheduler thread gets blocked. It
> is stuck at doStall() in ConcurrentMergeScheduler(CMS). Looking at the logs
> from InfoStream, I found "too many merges; stalling" message from the
> commitscheduler thread, post which it gets stuck in the while loop forever.

This means that there are more merges scheduled than you have allowed
with maxMergeCount, so the thread that's doing the actual indexing is
paused.

Best guess is that you are overwhelming your disks with multiple merge
threads, because you've set maxMergeThreads to 7.  In most situations
that should be 1, so multiple merges are not running simultaneously. 
Instead, they will be run one at a time so that each one can complete
faster.  You may have plenty of CPU power to run multiple threads, but
when multiple threads are accessing data on one disk volume, the random
access can cause serious problems with disk I/O performance.

> I also increased by maxMergeAtOnce and segmentsPerTier from 10 to 20 and
> then to 30, in hopes of having fewer merging threads to be running at once,
> but that just results in more segments to be created (not sure why this
> would happen). I also tried going the other way by reducing it to 5, but
> that experiment failed quickly (commit thread blocked).

When you increase the values in the mergePolicy, you are explicitly
telling Lucene to allow more segments in the index at any given moment. 
These settings should not be tweaked unless you know for sure that you
can benefit from changing them.  Higher values should result in less
merging, but the size of each merge that DOES happen will be larger, so
it will take longer.

> I increased the ramBufferSizeMB to 5000MB so that there are fewer flushes
> happening, so that fewer segments are created, so that fewer merges happen
> (I haven't dug deep here, so please correct me if this is something I should
> revert. Our current (5.x) config has this set at 324MB).

With large ram buffers, commits are more likely to control how big each
segment is and how frequently they are flushed.  Tests by Solr and
Lucene developers have shown that increasing the buffer size beyond
128MB rarely offers any advantage, unless the documents are huge.  At
5KB, yours aren't huge.

> The autoCommit and autoSoftCommit settings look good to me, as I've turned
> of softCommits, and I am autoCommitting at 1 docs (every 5-10 minutes),
> which finishes smoothly, unless it gets stuck in the first problem described
> above.

Your autoCommit has openSearcher set to true.  Commits that open a new
searcher are very expensive.  It should be set to false.  You can rely
on autoSoftCommit to make documents visible, with a much longer maxTime
than you use for autoCommit.

With a schema that's typical and documents that are not enormous, Solr
should be able to index at several thousand documents per second,
especially if there are multiple threads or multiple processes sending
documents.  A few thousand documents per minute 

Re: Consecutive calls to a query give different results

2017-09-07 Thread Erick Erickson
bq: So apparently it IS essential to run optimize after a data load

Don't do this if you can avoid it, you run the risk of excessive
amounts of your index consisting of deleted documents unless you are
following a process whereby you periodically (and I'm talking at least
hours, if not once per day) index data then don't change the index for
a bunch more hours.

You're missing the point when it comes to deleted docs. Different
replicas of the _same_ shard commit at different wall clock times due
to network delays. Therefore, which segments are merged will not be
identical between replicas when a commit happens, since commits are
local.

So replica1 may merge segments 1, 3, 6 in to segment 7
replica2 may merge segments 1, 2, 4 into segment 7

Here's the key: Now replica1 may have 100 deleted documents (ones
marked as deleted but still in segments 2, 4 and 5
 replica2 may have 90 deleted
documents (the ones still in segments 3, 5 and 6)

The statistics in the term frequency and document frequency for some
terms are _not_ the same. Therefore the scoring will be slightly
different. Therefore, depending on which replica serves the query, the
order of docs may be somewhat different if the scores are close.

optimizing squeezes all the deleted documents out of all the replicas
so the scores become identical.

This doesn't happen, of course, if you have only one replica.

Best,
Erick

On Thu, Sep 7, 2017 at 8:13 AM, Webster Homer  wrote:
> We have several solr clouds, a couple of them have only 1 replica per
> shard. We have never observed the problem when we have a single replica
> only when there are multiple replicas per shard.
>
> On Thu, Sep 7, 2017 at 10:08 AM, Webster Homer 
> wrote:
>
>> the scores are not the same
>> Doc
>> 305340 432.44238
>> C2646 428.24185
>> 12837 430.61722
>>
>> One other thing. I just ran optimize and now document 305340 is
>> consistently the top score.
>> So apparently it IS essential to run optimize after a data load
>>
>> Note we see this behavior fairly commonly on our solr cloud instances.
>> This was not the first time. This particular situation was on a development
>> system
>>
>> On Thu, Sep 7, 2017 at 10:04 AM, Webster Homer 
>> wrote:
>>
>>> the scores are not the same
>>> Doc
>>> 305340 432.44238
>>>
>>> On Thu, Sep 7, 2017 at 10:02 AM, David Hastings <
>>> hastings.recurs...@gmail.com> wrote:
>>>
 "I am concerned that the same
 search gives different results after each search. The top document seems
 to
 cycle between 3 different documents"


 if you do debug query on the search, are the scores for the top 3
 documents
 the same or not?  you can easily have three documents with the same
 score,
 so when you have a result set that is ranked 1-1-1-2-3-4 you can
 expect
 1-1-1 to rotate based on whatever.  use a second element like id to your
 ranking perhaps.




 On Thu, Sep 7, 2017 at 10:54 AM, Webster Homer 
 wrote:

 > I am not concerned about deleted documents. I am concerned that the
 same
 > search gives different results after each search. The top document
 seems to
 > cycle between 3 different documents
 >
 > I have an enhanced collections info api call that calls the core admin
 api
 > to get the index information for the replica.
 > When I said the numdocs were the same I meant exactly that. maxdocs and
 > deleted documents are not the same for the replicas, but the number of
 > numdocs is.
 >
 > Or are you saying that the search is looking at deleted documents
 wouldn't
 > that be a very significant bug?
 >
 > The four replicas:
 > shard1
 > core_node1
 > "numDocs": 383817,
 > "maxDocs": 611592,
 > "deletedDocs": 227775,
 > "size": "2.49 GB",
 > "lastModified": "2017-09-07T08:18:03.639Z",
 > "current": true,
 > "version": 35644,
 > "segmentCount": 28
 >
 > core_node3
 > "numDocs": 383817,
 > "maxDocs": 571737,
 > "deletedDocs": 187920,
 > "size": "2.85 GB",
 > "lastModified": "2017-09-07T08:18:03.634Z",
 > "current": false,
 > "version": 35562,
 > "segmentCount": 36
 > shard2
 > core_node2
 > "numDocs": 385326,
 > "maxDocs": 529214,
 > "deletedDocs": 143888,
 > "size": "2.13 GB",
 > "lastModified": "2017-09-07T08:18:03.632Z",
 > "current": true,
 > "version": 34783,
 > "segmentCount": 24
 > core_node4
 > "numDocs": 385326,
 > "maxDocs": 488201,
 > "deletedDocs": 102875,
 > "size": "1.96 GB",
 > "lastModified": "2017-09-07T08:18:03.633Z",
 > "current": true,
 > "version": 34932,
 > "segmentCount": 21
 >
 >
 > On Thu, Sep 7, 2017 at 7:58 AM, Yonik Seeley 
 wrote:
 >
 > > On 

Re: Sort across collapsed document is not working

2017-09-07 Thread Erick Erickson
Is this a sharded collection? group.ngroups isn't supported (see the
docs, "group.ngroups and group.facet require that all documents in
each group must be co-located on the same shard") in sharded
situations so it's not surprising that the results differ.

Best,
Erick

On Thu, Sep 7, 2017 at 10:35 AM, Ray Niu  wrote:
> Hello:
>I tried to use Collapsing Query Parser per following link:
>
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> here is the query I am using
> http:///solr/collection/select?q=groupId:*&
> fl=id,groupId,date=%7B!collapse%20field=groupId%20sort=%27id%20asc%27%7D&
> expand=true=3=date%20asc=id%20asc=3
>
> but I found the result is different from group query:
> http:///solr/collection/select?q=groupId:*&
> fl=id,date,groupId=true=groupId
> limit=4=true=date%20asc=id%20asc=3
>
> it seems sort across collapsed document is not working.
>
> Can anyone help on this?


origFreq/freq ratio for filtering spell-check suggestions

2017-09-07 Thread Arnold Bronley
Hi Solr users,

I can see there are some parameters that can help in controlling the
trigger condition for spellcheck mechanism or filter the spell suggestions
like maxQueryFrequency or thresholdTokenFrequency. I could not find a
parameter that will filter the suggestions based on (origFreq/freq) ratio.
Is there any parameter like this? Or will I need to add custom logic at
client side to handle this? Please help.


Re: Solr cloud optimizer

2017-09-07 Thread Tomas Fernandez Lobbe
By default Solr uses the “TieredMergePolicy”[1], but it can be configured in 
solrconfig, see [2].  Merges can be triggered for different reasons, but most 
commonly by segment flushes (commits) or other merges finishing.

Here is a nice visual demo of segment merging (a bit old but still mostly 
applies AFAIK): [3]

[1] 
https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/index/TieredMergePolicy.html
[2] https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html
[3] 
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Tomas

> On Sep 7, 2017, at 10:00 AM, calamita.agost...@libero.it wrote:
> 
> 
> Hi  all,
> I use SolrCloud with  some collections with 3  shards each. 
> Every day I insert and remove documents from collections. I  know that solr 
> starts optimizer in background to optimize indexes. 
> Which  is the policy that solr applies in order  to start optimizer 
> automatically ? Number of deleted documents? Number of segments? 
> Thanks.



Sort across collapsed document is not working

2017-09-07 Thread Ray Niu
Hello:
   I tried to use Collapsing Query Parser per following link:

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
here is the query I am using
http:///solr/collection/select?q=groupId:*&
fl=id,groupId,date=%7B!collapse%20field=groupId%20sort=%27id%20asc%27%7D&
expand=true=3=date%20asc=id%20asc=3

but I found the result is different from group query:
http:///solr/collection/select?q=groupId:*&
fl=id,date,groupId=true=groupId
limit=4=true=date%20asc=id%20asc=3

it seems sort across collapsed document is not working.

Can anyone help on this?


Solr cloud optimizer

2017-09-07 Thread calamita . agostino

Hi  all,
I use SolrCloud with  some collections with 3  shards each. 
Every day I insert and remove documents from collections. I  know that solr 
starts optimizer in background to optimize indexes. 
Which  is the policy that solr applies in order  to start optimizer 
automatically ? Number of deleted documents? Number of segments? 
Thanks.

RE: Customizing JSON response of a query

2017-09-07 Thread Davis, Daniel (NIH/NLM) [C]
Sarvo,

I agree with Rick.   It is better to put something in front of Solr (or any 
search engine), because the search engine sort of fits into a 3-tier hierarchy 
along with the database service:

Load Balancer/Httpd front-end > App -> RDBMS

Becomes:

Load Balancer/Httpd front-end -> App --+-> RDBMS
  |
  
+-> Search Engine


This page, https://lucene.apache.org/solr/guide/6_6/response-writers.html, 
covers the response writers available out of the box.   There is a an XLSX 
Response Writer, and so the way to make your request into a feature would be to 
have some sort of JSON transformation response writer.   

As a temporary fix, you could probably use the VelocityResponseWriter to return 
JSON structured the way you want, but this is a bit of a hack.   Note here how 
I have avoided the issue of avoiding trailing commas, which any real 
implementation would need to deal with:

   [
 #foreach($doc in $response.results)
   {
...
"subdoc": {
 "innerkey": "$doc.subdoc.innerkey",
 "multivalue": [
 #foreach($val in $doc.subdoc.multivale)
 "$val",  
 #end
  ]
 }   
   },
 #end
   ]

If you must do this transformation on the Solr Server, a better approach would 
be to deploy a servlet listener that transformed the JSON response.  In the 
bad-old days, implementing a custom response writer would be the recommended 
way to go, but I think Solr has grown up to the point where it would be better 
to handle this transformation outside of Solr.

One point I want to make is that JSON is very malleable in an application, 
whether front-end or not.   Check out http://jmespath.org/ - it functions as a 
sort of XPath for JSON documents, even if it isn't supported by a standards 
organization like W3C.   In front-end, there are also very useful 
transformational packages such as https://lodash.com/.   I prefer jmespath, 
because I am often jumping from Java to Python to JavaScript.


-Original Message-
From: Rick Leir [mailto:rl...@leirtech.com] 
Sent: Wednesday, September 06, 2017 8:21 PM
To: solr-user@lucene.apache.org
Subject: Re: Customizing JSON response of a query

Sarvo,
I hope the users do not read JSON. I would have thought you'd have a web app in 
front of Solr and some Javascript in the browser. Either would be able to 
transform Solr's output into a display format. But I suspect there is more to 
the problem, and I do not understand it all.
Cheers -- Rick

On September 6, 2017 4:42:03 PM EDT, Sarvothaman Madhavan  
wrote:
>Rick,
>
>My use case is this :
>
>I have a set of documents each of which have "sub documents" associated 
>with it. I have this in the json format and I am able to load this into 
>a solr collection. When I search within this set of documents using 
>solr, I want the response in "grouped" json format
>
>i.e
>
>{
>
>  "key": "value",
>
>  "sub_doc": [
>
>{
>
>  "inner_key": "inner_value"
>
>}
>
>  ]
>
>}
>
>
>
>instead of solrs default flat json format:
>
>i.e.
>
>{
>
>   "key":"value",
>
>   "subdoc.inner_key"= ["inner_value"]
>
>}
>
>
>
>I think the "grouped" json format will be much more intuitive to my end 
>users who are going to use the search
>
>
>
>P.S: Just to be clear I am not having any trouble querying 
>children/parent document since I have all of this stored using fully 
>qualified names in each document in the collection.
>
>
>
>
>
>Regards,
>
>Sarvo
>
>
>
>On Wed, Sep 6, 2017 at 3:52 PM, Rick Leir  wrote:
>
>> Sarvo,
>> What are you trying to achieve? Describe the use case.
>> Cheers -- Rick
>>
>> On September 6, 2017 12:36:08 PM EDT, "Davis, Daniel (NIH/NLM) [C]" < 
>> daniel.da...@nih.gov> wrote:
>> >It should be possible with a custom response handler.
>> >
>> >-Original Message-
>> >From: Sarvothaman Madhavan [mailto:relad...@gmail.com]
>> >Sent: Wednesday, September 06, 2017 10:17 AM
>> >To: solr-user@lucene.apache.org
>> >Subject: Customizing JSON response of a query
>> >
>> >Hello all,
>> >After a week of research I've come to the conclusion that there is
>no
>> >mechanism within solr where I can create a nested json response like
>> >this:
>> >https://pastebin.com/XavvUP94 . I am able to get something like this 
>> >https://pastebin.com/FeXRqG59.
>> >1. Am I right in assuming that within solr this is not possbile?
>> >2. Assuming it is, I imagine I would need to write custom response 
>> >writer in Java to customize the response. I am having a hard time 
>> >locating the right resource to get me started on writing this.
>> >
>> >Any ideas?
>> >
>> >Thanks,
>> >Sarvo
>>
>> --
>> Sorry for being brief. Alternate email is 

Conditions with multiple boosts in bf exists query

2017-09-07 Thread Eric Kurzenberger
I need to do a bf exists query that matches the following conditions:


-  IF a_score = 1 AND b_score = 2 THEN boost 30

-  IF a_score = 3 AND b_score = 4 THEN boost 20

So far, the bf portion of my query looks like this:

if(exists(query({!v="a_score_is:1"})),30,0)

But I’m having difficulty finding the correct syntax for the multiple 
conditions and boosts.

I was originally doing a bq query that looked like this:

bq=(a_score_is:1 AND b_score_is:2)^30 OR (a_score_is:3 AND b_score_is:4)^20

but I found that idf was skewing my rexpected esults, as I don’t care about 
document frequency.

Can anyone assist?

Cheers,

Eric



Re: Consecutive calls to a query give different results

2017-09-07 Thread Webster Homer
We have several solr clouds, a couple of them have only 1 replica per
shard. We have never observed the problem when we have a single replica
only when there are multiple replicas per shard.

On Thu, Sep 7, 2017 at 10:08 AM, Webster Homer 
wrote:

> the scores are not the same
> Doc
> 305340 432.44238
> C2646 428.24185
> 12837 430.61722
>
> One other thing. I just ran optimize and now document 305340 is
> consistently the top score.
> So apparently it IS essential to run optimize after a data load
>
> Note we see this behavior fairly commonly on our solr cloud instances.
> This was not the first time. This particular situation was on a development
> system
>
> On Thu, Sep 7, 2017 at 10:04 AM, Webster Homer 
> wrote:
>
>> the scores are not the same
>> Doc
>> 305340 432.44238
>>
>> On Thu, Sep 7, 2017 at 10:02 AM, David Hastings <
>> hastings.recurs...@gmail.com> wrote:
>>
>>> "I am concerned that the same
>>> search gives different results after each search. The top document seems
>>> to
>>> cycle between 3 different documents"
>>>
>>>
>>> if you do debug query on the search, are the scores for the top 3
>>> documents
>>> the same or not?  you can easily have three documents with the same
>>> score,
>>> so when you have a result set that is ranked 1-1-1-2-3-4 you can
>>> expect
>>> 1-1-1 to rotate based on whatever.  use a second element like id to your
>>> ranking perhaps.
>>>
>>>
>>>
>>>
>>> On Thu, Sep 7, 2017 at 10:54 AM, Webster Homer 
>>> wrote:
>>>
>>> > I am not concerned about deleted documents. I am concerned that the
>>> same
>>> > search gives different results after each search. The top document
>>> seems to
>>> > cycle between 3 different documents
>>> >
>>> > I have an enhanced collections info api call that calls the core admin
>>> api
>>> > to get the index information for the replica.
>>> > When I said the numdocs were the same I meant exactly that. maxdocs and
>>> > deleted documents are not the same for the replicas, but the number of
>>> > numdocs is.
>>> >
>>> > Or are you saying that the search is looking at deleted documents
>>> wouldn't
>>> > that be a very significant bug?
>>> >
>>> > The four replicas:
>>> > shard1
>>> > core_node1
>>> > "numDocs": 383817,
>>> > "maxDocs": 611592,
>>> > "deletedDocs": 227775,
>>> > "size": "2.49 GB",
>>> > "lastModified": "2017-09-07T08:18:03.639Z",
>>> > "current": true,
>>> > "version": 35644,
>>> > "segmentCount": 28
>>> >
>>> > core_node3
>>> > "numDocs": 383817,
>>> > "maxDocs": 571737,
>>> > "deletedDocs": 187920,
>>> > "size": "2.85 GB",
>>> > "lastModified": "2017-09-07T08:18:03.634Z",
>>> > "current": false,
>>> > "version": 35562,
>>> > "segmentCount": 36
>>> > shard2
>>> > core_node2
>>> > "numDocs": 385326,
>>> > "maxDocs": 529214,
>>> > "deletedDocs": 143888,
>>> > "size": "2.13 GB",
>>> > "lastModified": "2017-09-07T08:18:03.632Z",
>>> > "current": true,
>>> > "version": 34783,
>>> > "segmentCount": 24
>>> > core_node4
>>> > "numDocs": 385326,
>>> > "maxDocs": 488201,
>>> > "deletedDocs": 102875,
>>> > "size": "1.96 GB",
>>> > "lastModified": "2017-09-07T08:18:03.633Z",
>>> > "current": true,
>>> > "version": 34932,
>>> > "segmentCount": 21
>>> >
>>> >
>>> > On Thu, Sep 7, 2017 at 7:58 AM, Yonik Seeley 
>>> wrote:
>>> >
>>> > > On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson <
>>> erickerick...@gmail.com
>>> > >
>>> > > wrote:
>>> > > > bq: and deleted documents are irrelevant to term statistics...
>>> > > >
>>> > > > Did you mean "relevant"? Or do I have to adjust my thinking
>>> _again_?
>>> > >
>>> > > One can make it work either way ;-)
>>> > > Whether a document is marked as deleted or not has no effect on term
>>> > > statistics (i.e. irrelevant)
>>> > > OR documents marked for deletion still count in term statistics (i.e.
>>> > > relevant)
>>> > >
>>> > > I guess I used the former because we don't go out of our way to still
>>> > > include deleted documents... it's just a side effect of the index
>>> > > structure that we don't (and can't easily) update statistics when a
>>> > > document is marked as deleted.
>>> > >
>>> > > -Yonik
>>> > >
>>> > >
>>> > > > Erick
>>> > > >
>>> > > > On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley 
>>> > wrote:
>>> > > >> Different replicas of the same shard can have different numbers of
>>> > > >> deleted documents (really just marked as deleted), and deleted
>>> > > >> documents are irrelevant to term statistics (like the number of
>>> > > >> documents a term appears in).  Documents marked for deletion stop
>>> > > >> contributing to corpus statistics when they are actually removed
>>> (via
>>> > > >> expunge deletes, merges, optimizes).
>>> > > >> -Yonik
>>> > > >>
>>> > > >>
>>> > > >> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <
>>> webster.ho...@sial.com
>>> > >
>>> > > wrote:
>>> > > >>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards
>>> and 4
>>> > 

Re: Consecutive calls to a query give different results

2017-09-07 Thread Webster Homer
the scores are not the same
Doc
305340 432.44238
C2646 428.24185
12837 430.61722

One other thing. I just ran optimize and now document 305340 is
consistently the top score.
So apparently it IS essential to run optimize after a data load

Note we see this behavior fairly commonly on our solr cloud instances. This
was not the first time. This particular situation was on a development
system

On Thu, Sep 7, 2017 at 10:04 AM, Webster Homer 
wrote:

> the scores are not the same
> Doc
> 305340 432.44238
>
> On Thu, Sep 7, 2017 at 10:02 AM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
>
>> "I am concerned that the same
>> search gives different results after each search. The top document seems
>> to
>> cycle between 3 different documents"
>>
>>
>> if you do debug query on the search, are the scores for the top 3
>> documents
>> the same or not?  you can easily have three documents with the same score,
>> so when you have a result set that is ranked 1-1-1-2-3-4 you can
>> expect
>> 1-1-1 to rotate based on whatever.  use a second element like id to your
>> ranking perhaps.
>>
>>
>>
>>
>> On Thu, Sep 7, 2017 at 10:54 AM, Webster Homer 
>> wrote:
>>
>> > I am not concerned about deleted documents. I am concerned that the same
>> > search gives different results after each search. The top document
>> seems to
>> > cycle between 3 different documents
>> >
>> > I have an enhanced collections info api call that calls the core admin
>> api
>> > to get the index information for the replica.
>> > When I said the numdocs were the same I meant exactly that. maxdocs and
>> > deleted documents are not the same for the replicas, but the number of
>> > numdocs is.
>> >
>> > Or are you saying that the search is looking at deleted documents
>> wouldn't
>> > that be a very significant bug?
>> >
>> > The four replicas:
>> > shard1
>> > core_node1
>> > "numDocs": 383817,
>> > "maxDocs": 611592,
>> > "deletedDocs": 227775,
>> > "size": "2.49 GB",
>> > "lastModified": "2017-09-07T08:18:03.639Z",
>> > "current": true,
>> > "version": 35644,
>> > "segmentCount": 28
>> >
>> > core_node3
>> > "numDocs": 383817,
>> > "maxDocs": 571737,
>> > "deletedDocs": 187920,
>> > "size": "2.85 GB",
>> > "lastModified": "2017-09-07T08:18:03.634Z",
>> > "current": false,
>> > "version": 35562,
>> > "segmentCount": 36
>> > shard2
>> > core_node2
>> > "numDocs": 385326,
>> > "maxDocs": 529214,
>> > "deletedDocs": 143888,
>> > "size": "2.13 GB",
>> > "lastModified": "2017-09-07T08:18:03.632Z",
>> > "current": true,
>> > "version": 34783,
>> > "segmentCount": 24
>> > core_node4
>> > "numDocs": 385326,
>> > "maxDocs": 488201,
>> > "deletedDocs": 102875,
>> > "size": "1.96 GB",
>> > "lastModified": "2017-09-07T08:18:03.633Z",
>> > "current": true,
>> > "version": 34932,
>> > "segmentCount": 21
>> >
>> >
>> > On Thu, Sep 7, 2017 at 7:58 AM, Yonik Seeley  wrote:
>> >
>> > > On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson <
>> erickerick...@gmail.com
>> > >
>> > > wrote:
>> > > > bq: and deleted documents are irrelevant to term statistics...
>> > > >
>> > > > Did you mean "relevant"? Or do I have to adjust my thinking _again_?
>> > >
>> > > One can make it work either way ;-)
>> > > Whether a document is marked as deleted or not has no effect on term
>> > > statistics (i.e. irrelevant)
>> > > OR documents marked for deletion still count in term statistics (i.e.
>> > > relevant)
>> > >
>> > > I guess I used the former because we don't go out of our way to still
>> > > include deleted documents... it's just a side effect of the index
>> > > structure that we don't (and can't easily) update statistics when a
>> > > document is marked as deleted.
>> > >
>> > > -Yonik
>> > >
>> > >
>> > > > Erick
>> > > >
>> > > > On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley 
>> > wrote:
>> > > >> Different replicas of the same shard can have different numbers of
>> > > >> deleted documents (really just marked as deleted), and deleted
>> > > >> documents are irrelevant to term statistics (like the number of
>> > > >> documents a term appears in).  Documents marked for deletion stop
>> > > >> contributing to corpus statistics when they are actually removed
>> (via
>> > > >> expunge deletes, merges, optimizes).
>> > > >> -Yonik
>> > > >>
>> > > >>
>> > > >> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <
>> webster.ho...@sial.com
>> > >
>> > > wrote:
>> > > >>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards
>> and 4
>> > > >>> replicas (total of 4 nodes).
>> > > >>>
>> > > >>> If I run the query multiple times I see the three different top
>> > scoring
>> > > >>> results.
>> > > >>> No data load is running, all data has been commited
>> > > >>>
>> > > >>> I get these three different hits with their scores:
>> > > >>> copperiinitratehemipentahydrate2325919004194430.61722
>> > > >>> copperiinitrateoncelite1234598765
>> > >  432.44238
>> > > >>> 

Re: Consecutive calls to a query give different results

2017-09-07 Thread Webster Homer
the scores are not the same
Doc
305340 432.44238

On Thu, Sep 7, 2017 at 10:02 AM, David Hastings <
hastings.recurs...@gmail.com> wrote:

> "I am concerned that the same
> search gives different results after each search. The top document seems to
> cycle between 3 different documents"
>
>
> if you do debug query on the search, are the scores for the top 3 documents
> the same or not?  you can easily have three documents with the same score,
> so when you have a result set that is ranked 1-1-1-2-3-4 you can expect
> 1-1-1 to rotate based on whatever.  use a second element like id to your
> ranking perhaps.
>
>
>
>
> On Thu, Sep 7, 2017 at 10:54 AM, Webster Homer 
> wrote:
>
> > I am not concerned about deleted documents. I am concerned that the same
> > search gives different results after each search. The top document seems
> to
> > cycle between 3 different documents
> >
> > I have an enhanced collections info api call that calls the core admin
> api
> > to get the index information for the replica.
> > When I said the numdocs were the same I meant exactly that. maxdocs and
> > deleted documents are not the same for the replicas, but the number of
> > numdocs is.
> >
> > Or are you saying that the search is looking at deleted documents
> wouldn't
> > that be a very significant bug?
> >
> > The four replicas:
> > shard1
> > core_node1
> > "numDocs": 383817,
> > "maxDocs": 611592,
> > "deletedDocs": 227775,
> > "size": "2.49 GB",
> > "lastModified": "2017-09-07T08:18:03.639Z",
> > "current": true,
> > "version": 35644,
> > "segmentCount": 28
> >
> > core_node3
> > "numDocs": 383817,
> > "maxDocs": 571737,
> > "deletedDocs": 187920,
> > "size": "2.85 GB",
> > "lastModified": "2017-09-07T08:18:03.634Z",
> > "current": false,
> > "version": 35562,
> > "segmentCount": 36
> > shard2
> > core_node2
> > "numDocs": 385326,
> > "maxDocs": 529214,
> > "deletedDocs": 143888,
> > "size": "2.13 GB",
> > "lastModified": "2017-09-07T08:18:03.632Z",
> > "current": true,
> > "version": 34783,
> > "segmentCount": 24
> > core_node4
> > "numDocs": 385326,
> > "maxDocs": 488201,
> > "deletedDocs": 102875,
> > "size": "1.96 GB",
> > "lastModified": "2017-09-07T08:18:03.633Z",
> > "current": true,
> > "version": 34932,
> > "segmentCount": 21
> >
> >
> > On Thu, Sep 7, 2017 at 7:58 AM, Yonik Seeley  wrote:
> >
> > > On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson <
> erickerick...@gmail.com
> > >
> > > wrote:
> > > > bq: and deleted documents are irrelevant to term statistics...
> > > >
> > > > Did you mean "relevant"? Or do I have to adjust my thinking _again_?
> > >
> > > One can make it work either way ;-)
> > > Whether a document is marked as deleted or not has no effect on term
> > > statistics (i.e. irrelevant)
> > > OR documents marked for deletion still count in term statistics (i.e.
> > > relevant)
> > >
> > > I guess I used the former because we don't go out of our way to still
> > > include deleted documents... it's just a side effect of the index
> > > structure that we don't (and can't easily) update statistics when a
> > > document is marked as deleted.
> > >
> > > -Yonik
> > >
> > >
> > > > Erick
> > > >
> > > > On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley 
> > wrote:
> > > >> Different replicas of the same shard can have different numbers of
> > > >> deleted documents (really just marked as deleted), and deleted
> > > >> documents are irrelevant to term statistics (like the number of
> > > >> documents a term appears in).  Documents marked for deletion stop
> > > >> contributing to corpus statistics when they are actually removed
> (via
> > > >> expunge deletes, merges, optimizes).
> > > >> -Yonik
> > > >>
> > > >>
> > > >> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <
> webster.ho...@sial.com
> > >
> > > wrote:
> > > >>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards and
> 4
> > > >>> replicas (total of 4 nodes).
> > > >>>
> > > >>> If I run the query multiple times I see the three different top
> > scoring
> > > >>> results.
> > > >>> No data load is running, all data has been commited
> > > >>>
> > > >>> I get these three different hits with their scores:
> > > >>> copperiinitratehemipentahydrate2325919004194430.61722
> > > >>> copperiinitrateoncelite1234598765
> > >  432.44238
> > > >>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
> > > >>>
> > > >>> How is it that the same search against the same data can give
> > different
> > > >>> responses?
> > > >>> I looked at the specific cores they look OK the numdocs for the
> > > replicas in
> > > >>> a shard match
> > > >>>
> > > >>> This is the query:
> > > >>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-
> > > catalog-product/select?defType=edismax=searchmv_
> > > en_keywords,%20searchmv_keywords,searchmv_pno,%
> > 20searchmv_en_s_pri_name,%
> > > 20search_en_p_pri_name,%20search_pno%20[explain%
> > > 20style=nl]=id_s=30=true&
> > > 

Re: Consecutive calls to a query give different results

2017-09-07 Thread David Hastings
"I am concerned that the same
search gives different results after each search. The top document seems to
cycle between 3 different documents"


if you do debug query on the search, are the scores for the top 3 documents
the same or not?  you can easily have three documents with the same score,
so when you have a result set that is ranked 1-1-1-2-3-4 you can expect
1-1-1 to rotate based on whatever.  use a second element like id to your
ranking perhaps.




On Thu, Sep 7, 2017 at 10:54 AM, Webster Homer 
wrote:

> I am not concerned about deleted documents. I am concerned that the same
> search gives different results after each search. The top document seems to
> cycle between 3 different documents
>
> I have an enhanced collections info api call that calls the core admin api
> to get the index information for the replica.
> When I said the numdocs were the same I meant exactly that. maxdocs and
> deleted documents are not the same for the replicas, but the number of
> numdocs is.
>
> Or are you saying that the search is looking at deleted documents wouldn't
> that be a very significant bug?
>
> The four replicas:
> shard1
> core_node1
> "numDocs": 383817,
> "maxDocs": 611592,
> "deletedDocs": 227775,
> "size": "2.49 GB",
> "lastModified": "2017-09-07T08:18:03.639Z",
> "current": true,
> "version": 35644,
> "segmentCount": 28
>
> core_node3
> "numDocs": 383817,
> "maxDocs": 571737,
> "deletedDocs": 187920,
> "size": "2.85 GB",
> "lastModified": "2017-09-07T08:18:03.634Z",
> "current": false,
> "version": 35562,
> "segmentCount": 36
> shard2
> core_node2
> "numDocs": 385326,
> "maxDocs": 529214,
> "deletedDocs": 143888,
> "size": "2.13 GB",
> "lastModified": "2017-09-07T08:18:03.632Z",
> "current": true,
> "version": 34783,
> "segmentCount": 24
> core_node4
> "numDocs": 385326,
> "maxDocs": 488201,
> "deletedDocs": 102875,
> "size": "1.96 GB",
> "lastModified": "2017-09-07T08:18:03.633Z",
> "current": true,
> "version": 34932,
> "segmentCount": 21
>
>
> On Thu, Sep 7, 2017 at 7:58 AM, Yonik Seeley  wrote:
>
> > On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson  >
> > wrote:
> > > bq: and deleted documents are irrelevant to term statistics...
> > >
> > > Did you mean "relevant"? Or do I have to adjust my thinking _again_?
> >
> > One can make it work either way ;-)
> > Whether a document is marked as deleted or not has no effect on term
> > statistics (i.e. irrelevant)
> > OR documents marked for deletion still count in term statistics (i.e.
> > relevant)
> >
> > I guess I used the former because we don't go out of our way to still
> > include deleted documents... it's just a side effect of the index
> > structure that we don't (and can't easily) update statistics when a
> > document is marked as deleted.
> >
> > -Yonik
> >
> >
> > > Erick
> > >
> > > On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley 
> wrote:
> > >> Different replicas of the same shard can have different numbers of
> > >> deleted documents (really just marked as deleted), and deleted
> > >> documents are irrelevant to term statistics (like the number of
> > >> documents a term appears in).  Documents marked for deletion stop
> > >> contributing to corpus statistics when they are actually removed (via
> > >> expunge deletes, merges, optimizes).
> > >> -Yonik
> > >>
> > >>
> > >> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer  >
> > wrote:
> > >>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards and 4
> > >>> replicas (total of 4 nodes).
> > >>>
> > >>> If I run the query multiple times I see the three different top
> scoring
> > >>> results.
> > >>> No data load is running, all data has been commited
> > >>>
> > >>> I get these three different hits with their scores:
> > >>> copperiinitratehemipentahydrate2325919004194430.61722
> > >>> copperiinitrateoncelite1234598765
> >  432.44238
> > >>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
> > >>>
> > >>> How is it that the same search against the same data can give
> different
> > >>> responses?
> > >>> I looked at the specific cores they look OK the numdocs for the
> > replicas in
> > >>> a shard match
> > >>>
> > >>> This is the query:
> > >>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-
> > catalog-product/select?defType=edismax=searchmv_
> > en_keywords,%20searchmv_keywords,searchmv_pno,%
> 20searchmv_en_s_pri_name,%
> > 20search_en_p_pri_name,%20search_pno%20[explain%
> > 20style=nl]=id_s=30=true&
> > group.sort=sort_ds%20asc=on=2%3C-25%25=
> > OR=copper%20nitrate=search_pid
> > >>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%
> > 20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%
> > 20searchmv_p_skus_genr%20searchmv_user_term^200%
> > 20search_lform^190%20searchmv_en_acronym^180%20search_en_
> > root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_
> > 

Re: Consecutive calls to a query give different results

2017-09-07 Thread Webster Homer
I am not concerned about deleted documents. I am concerned that the same
search gives different results after each search. The top document seems to
cycle between 3 different documents

I have an enhanced collections info api call that calls the core admin api
to get the index information for the replica.
When I said the numdocs were the same I meant exactly that. maxdocs and
deleted documents are not the same for the replicas, but the number of
numdocs is.

Or are you saying that the search is looking at deleted documents wouldn't
that be a very significant bug?

The four replicas:
shard1
core_node1
"numDocs": 383817,
"maxDocs": 611592,
"deletedDocs": 227775,
"size": "2.49 GB",
"lastModified": "2017-09-07T08:18:03.639Z",
"current": true,
"version": 35644,
"segmentCount": 28

core_node3
"numDocs": 383817,
"maxDocs": 571737,
"deletedDocs": 187920,
"size": "2.85 GB",
"lastModified": "2017-09-07T08:18:03.634Z",
"current": false,
"version": 35562,
"segmentCount": 36
shard2
core_node2
"numDocs": 385326,
"maxDocs": 529214,
"deletedDocs": 143888,
"size": "2.13 GB",
"lastModified": "2017-09-07T08:18:03.632Z",
"current": true,
"version": 34783,
"segmentCount": 24
core_node4
"numDocs": 385326,
"maxDocs": 488201,
"deletedDocs": 102875,
"size": "1.96 GB",
"lastModified": "2017-09-07T08:18:03.633Z",
"current": true,
"version": 34932,
"segmentCount": 21


On Thu, Sep 7, 2017 at 7:58 AM, Yonik Seeley  wrote:

> On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson 
> wrote:
> > bq: and deleted documents are irrelevant to term statistics...
> >
> > Did you mean "relevant"? Or do I have to adjust my thinking _again_?
>
> One can make it work either way ;-)
> Whether a document is marked as deleted or not has no effect on term
> statistics (i.e. irrelevant)
> OR documents marked for deletion still count in term statistics (i.e.
> relevant)
>
> I guess I used the former because we don't go out of our way to still
> include deleted documents... it's just a side effect of the index
> structure that we don't (and can't easily) update statistics when a
> document is marked as deleted.
>
> -Yonik
>
>
> > Erick
> >
> > On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley  wrote:
> >> Different replicas of the same shard can have different numbers of
> >> deleted documents (really just marked as deleted), and deleted
> >> documents are irrelevant to term statistics (like the number of
> >> documents a term appears in).  Documents marked for deletion stop
> >> contributing to corpus statistics when they are actually removed (via
> >> expunge deletes, merges, optimizes).
> >> -Yonik
> >>
> >>
> >> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer 
> wrote:
> >>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards and 4
> >>> replicas (total of 4 nodes).
> >>>
> >>> If I run the query multiple times I see the three different top scoring
> >>> results.
> >>> No data load is running, all data has been commited
> >>>
> >>> I get these three different hits with their scores:
> >>> copperiinitratehemipentahydrate2325919004194430.61722
> >>> copperiinitrateoncelite1234598765
>  432.44238
> >>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
> >>>
> >>> How is it that the same search against the same data can give different
> >>> responses?
> >>> I looked at the specific cores they look OK the numdocs for the
> replicas in
> >>> a shard match
> >>>
> >>> This is the query:
> >>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-
> catalog-product/select?defType=edismax=searchmv_
> en_keywords,%20searchmv_keywords,searchmv_pno,%20searchmv_en_s_pri_name,%
> 20search_en_p_pri_name,%20search_pno%20[explain%
> 20style=nl]=id_s=30=true&
> group.sort=sort_ds%20asc=on=2%3C-25%25=
> OR=copper%20nitrate=search_pid
> >>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%
> 20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%
> 20searchmv_p_skus_genr%20searchmv_user_term^200%
> 20search_lform^190%20searchmv_en_acronym^180%20search_en_
> root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_
> pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_
> keywords^140%20search_en_sortkey^120%20searchmv_p_skus^
> 100%20searchmv_chem_comp^90%20searchmv_en_name_suf%
> 20searchmv_cas_number^80%20searchmv_component_cas^70%
> 20search_beilstein^50%20search_color_idx^40%20search_ecnumber^30%20search_
> egecnumber^30%20search_femanumber^20%20searchmv_isbn^
> 10%20search_mdl_number%20searchmv_en_page_title%
> 20searchmv_en_descriptions%20searchmv_en_attributes%
> 20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_
> xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_
> equivalent_pno%20searchmv_xref_exact_pno%20searchmv_
> xref_exact_sku%20searchmv_component_molform=30&
> sort=score%20desc,sort_en_name%20asc,sort_ds%20asc,
> search_pid%20asc=json
> >>>
> >>> --
> >>>
> >>>
> >>> This message and any attachment are confidential 

Re: Consecutive calls to a query give different results

2017-09-07 Thread Erick Erickson
Whew! I haven't been lying to people for _years_..

On Thu, Sep 7, 2017 at 5:58 AM, Yonik Seeley  wrote:
> On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson  
> wrote:
>> bq: and deleted documents are irrelevant to term statistics...
>>
>> Did you mean "relevant"? Or do I have to adjust my thinking _again_?
>
> One can make it work either way ;-)
> Whether a document is marked as deleted or not has no effect on term
> statistics (i.e. irrelevant)
> OR documents marked for deletion still count in term statistics (i.e. 
> relevant)
>
> I guess I used the former because we don't go out of our way to still
> include deleted documents... it's just a side effect of the index
> structure that we don't (and can't easily) update statistics when a
> document is marked as deleted.
>
> -Yonik
>
>
>> Erick
>>
>> On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley  wrote:
>>> Different replicas of the same shard can have different numbers of
>>> deleted documents (really just marked as deleted), and deleted
>>> documents are irrelevant to term statistics (like the number of
>>> documents a term appears in).  Documents marked for deletion stop
>>> contributing to corpus statistics when they are actually removed (via
>>> expunge deletes, merges, optimizes).
>>> -Yonik
>>>
>>>
>>> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer  
>>> wrote:
 I am using Solr 6.2.0 configured as a solr cloud with 2 shards and 4
 replicas (total of 4 nodes).

 If I run the query multiple times I see the three different top scoring
 results.
 No data load is running, all data has been commited

 I get these three different hits with their scores:
 copperiinitratehemipentahydrate2325919004194430.61722
 copperiinitrateoncelite1234598765   432.44238
 copperiinitratehydrate18756anhydrousbasis13778319 428.24185

 How is it that the same search against the same data can give different
 responses?
 I looked at the specific cores they look OK the numdocs for the replicas in
 a shard match

 This is the query:
 http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-catalog-product/select?defType=edismax=searchmv_en_keywords,%20searchmv_keywords,searchmv_pno,%20searchmv_en_s_pri_name,%20search_en_p_pri_name,%20search_pno%20[explain%20style=nl]=id_s=30=true=sort_ds%20asc=on=2%3C-25%25=OR=copper%20nitrate=search_pid
 ^500%20search_concat_pno^400%20searchmv_concat_sku^400%20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%20searchmv_p_skus_genr%20searchmv_user_term^200%20search_lform^190%20searchmv_en_acronym^180%20search_en_root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_keywords^140%20search_en_sortkey^120%20searchmv_p_skus^100%20searchmv_chem_comp^90%20searchmv_en_name_suf%20searchmv_cas_number^80%20searchmv_component_cas^70%20search_beilstein^50%20search_color_idx^40%20search_ecnumber^30%20search_egecnumber^30%20search_femanumber^20%20searchmv_isbn^10%20search_mdl_number%20searchmv_en_page_title%20searchmv_en_descriptions%20searchmv_en_attributes%20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_equivalent_pno%20searchmv_xref_exact_pno%20searchmv_xref_exact_sku%20searchmv_component_molform=30=score%20desc,sort_en_name%20asc,sort_ds%20asc,search_pid%20asc=json

 --


 This message and any attachment are confidential and may be privileged or
 otherwise protected from disclosure. If you are not the intended recipient,
 you must not copy this message or attachment or disclose the contents to
 any other person. If you have received this transmission in error, please
 notify the sender immediately and delete the message and any attachment
 from your system. Merck KGaA, Darmstadt, Germany and any of its
 subsidiaries do not accept liability for any omissions or errors in this
 message which may arise as a result of E-Mail-transmission or for damages
 resulting from any unauthorized changes of the content of this message and
 any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
 subsidiaries do not guarantee that this message is free of viruses and does
 not accept liability for any damages caused by any virus transmitted
 therewith.

 Click http://www.emdgroup.com/disclaimer to access the German, French,
 Spanish and Portuguese versions of this disclaimer.


Solr Commit Thread Blocked because of excessive number of merging threads

2017-09-07 Thread yasoobhaider
Hi

My team has tasked me with upgrading Solr from the version we are using
(5.4) to the latest stable version 6.6. I am stuck for a few days now on the
indexing part.

First I'll list the requirements, then all the configuration settings I have
tried.

So in total I'm indexing about 2.5million documents. The average document
size is ~5KB. I have 10 (PHP) workers which are running in parallel, hitting
Solr with ~1K docs/minute. (This sometimes goes up to ~3K docs/minute).

System specifications:
RAM: 120G
Processors: 16

Solr configuration:
Heap size: 80G


solrconfig.xml: (Relevant parts; please let me know if there's anything else
you would like to look at)


  1
  380
  true



  ${solr.autoSoftCommit.maxTime:-1}


5000
1


  30
  30



  8
  7




The main problem:

When I start indexing everything is good until I reach about 2 million docs,
which takes ~10 hours. But then the  commitscheduler thread gets blocked. It
is stuck at doStall() in ConcurrentMergeScheduler(CMS). Looking at the logs
from InfoStream, I found "too many merges; stalling" message from the
commitscheduler thread, post which it gets stuck in the while loop forever.

Here's the check that's stalling our commitscheduler thread.

while (writer.hasPendingMerges() && mergeThreadCount() >= maxMergeCount) {
..
..
  if (verbose() && startStallTime == 0) {
message("too many merges; stalling...");
  }
  startStallTime = System.currentTimeMillis();
  doStall();
}
}

This is the reason I have put maxMergeCount and maxThreadCount explicitly in
my solrconfig. I thought increasing the number of threads would make sure
that there is always one extra thread for commit to go through. But now that
I have increased the allowed number of threads, Lucene just spawns that many
"Lucene Merge Thread"s and leaves none for when a commit comes along and
triggers a merge. And then it gets stuck forever.

Well, not really forever, I'm guessing that once one of the merging threads
is removed (by using removeMergeThread() in CMS) the commit will go through,
but for some reason, the merging is so slow that this doesn't happen (I gave
this a couple of hours of time, but commit thread was still stuck). Which
brings us to the second problem.



The second problem:
Merging is extremely slow. I'm not sure what I'm missing here. Maybe there's
a change in 6.x version which has significantly hampered merging speed. From
the thread dump, what I can see is that "Lucene Merge Thread"s are in the
Runnable state, and at TreeMap.getEntry() call. Is this normal?

Another thing I noticed was that the disk IO is throttled at ~20Mb/s. But
I'm not sure if this is something that can actually hamper merging.

My index size was ~10GB and I left it overnight (~6hours) and almost no
merging happened.

Here's another infoStream message from logs. Just putting it here in case it
helps.

-

2017-09-06 14:11:07.921 INFO  (qtp834133664-115) [   x:collection1]
o.a.s.u.LoggingInfoStream [MS][qtp834133664-115]: updateMergeThreads
ioThrottle=true targetMBPerSec=23.6 MB/sec
merge thread Lucene Merge Thread #4 estSize=5116.1 MB (written=4198.1 MB)
runTime=8100.1s (stopped=0.0s, paused=142.5s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #7 estSize=1414.3 MB (written=0.0 MB)
runTime=0.0s (stopped=0.0s, paused=0.0s) rate=23.6 MB/sec
  leave running at 23.6 MB/sec
merge thread Lucene Merge Thread #5 estSize=1014.4 MB (written=427.2 MB)
runTime=6341.9s (stopped=0.0s, paused=12.3s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #3 estSize=752.8 MB (written=362.8 MB)
runTime=8100.1s (stopped=0.0s, paused=12.4s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #2 estSize=312.5 MB (written=151.9 MB)
runTime=8100.7s (stopped=0.0s, paused=8.7s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #6 estSize=87.7 MB (written=63.0 MB)
runTime=3627.8s (stopped=0.0s, paused=0.9s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #1 estSize=57.3 MB (written=21.7 MB)
runTime=8101.2s (stopped=0.0s, paused=0.2s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #0 estSize=4.6 MB (written=0.0 MB)
runTime=8101.0s (stopped=0.0s, paused=0.0s) rate=unlimited
  leave running at Infinity MB/sec

-

I also increased by maxMergeAtOnce and segmentsPerTier from 10 to 20 and
then to 30, in hopes of having fewer merging threads to be running at once,
but 

Re: CommitScheduler Thread blocked due to excessive number of Merging Threads

2017-09-07 Thread Shawn Heisey
On 9/7/2017 4:25 AM, yasoobhaider wrote:
> So I did a little more digging around why the merging is taking so
> long, and it looks like merging postings is the culprit. On the 5.4
> version, merging 500 docs is taking approximately 100 msec, while on
> the 6.6 version, it is taking more than 3000 msec. The difference
> seems to get worse when more docs are being merged. Any ideas why this
> may be the case? 

The rest of this thread is completely lost here, I only found the info
by going to Nabble, which is a mirror of the mailing list in forum
format.  The mailing list is the canonical repository.

Setting the ramBufferSizeMB to nearly 5 gigabytes is only going to be
helpful if the docs you are indexing into Solr are enormous -- many
megabytes of text data in each one.  Testing by Solr developers has
shown that values above about 128MB do not typically provide any
performance advantage with normal sized documents.  The commit
characteristics will have more to do with how large each segment is than
the ramBufferSizeMB.  The default ramBufferSizeMB value in modern Solr
versions is 100.

Assuming we are dealing with relatively small documents, I would
recommend these settings in indexConfig (removing ramBufferSizeMB,
mergePolicyFactory, and maxBufferedDocs entirely):


      6
      false



      60



      6
      1


If your data is on standard disks, then you want maxThreadCount at one. 
If it's on SSD, then you can raise it a little bit, but I wouldn't go
beyond about 2 or 3.  On standard disks with many threads writing merged
segments, the disk will begin thrashing excessively and I/O will slow to
a crawl.

If the documents are huge, then you can raise ramBufferSizeMB, but five
gigabytes is REALLY BIG and will require a very large heap.

If there is good reason to increase the values in mergePolicy, then this
is what I would recommend:


      30
      30
      90


The settings I've described here may help, or it may do nothing.  If it
doesn't help, then the problems may be memory-related, which is a whole
separate discussion.

When Lucene says "too many merge threads, stalling" it means there are
many merges scheduled at the same time, which usually means that there
are multiple *levels* of merging scheduled -- one that combines a bunch
of initial level segments into one second level segment, one that
combines multiple second level segments into third-level segments, and
so on.  The "stalling" means that the *indexing* thread is paused until
the number of merges drops below maxMergeCount.  If this is happening
with maxMergeCount at eight, it is likely because of the current
autoCommit maxDocs setting of 1 -- each of the initial segments are
very small, so there are a LOT of segments that need merging.  The
autoCommit and autoSoftCommit settings that I provided will hopefully
make that less of a problem.

Merging segments goes slower than the speed of your disks.  This is
because Lucene must collect a lot of information from each source
segment and combine it in memory to write a new segment.  The gathering
and combining is much slower than modern disk speeds.

Thanks,
Shawn



Re: deep paging in parallel sql

2017-09-07 Thread Susmit Shukla
you could use filter clause to create a custom cursor  since the results
are sorted. I had used the approach with raw cloudsolr stream, not with
parallelSQL though.
This would be useful-
https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

Thanks,
Susmit

On Wed, Sep 6, 2017 at 10:45 PM, Imran Rajjad  wrote:

> My only concern is the performance as the cursor moves forward in
> resultset with approximately 2 billion records
>
> Regards,
> Imran
>
> Sent from Mail for Windows 10
>
> From: Joel Bernstein
> Sent: Wednesday, September 6, 2017 7:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: deep paging in parallel sql
>
> Parallel SQL supports unlimited SELECT statements which return the entire
> result set. The documentation discusses the differences between the limited
> and unlimited SELECT statements. Other then the LIMIT clause there is not
> yet support for paging.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Sep 6, 2017 at 9:11 AM, Imran Rajjad  wrote:
>
> > Dear list,
> >
> > Is it possible to enable deep paging when querying data through Parallel
> > SQL?
> >
> > Regards,
> > Imran
> >
> > Sent from Mail for Windows 10
> >
> >
>
>


Re: Consecutive calls to a query give different results

2017-09-07 Thread Yonik Seeley
On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson  wrote:
> bq: and deleted documents are irrelevant to term statistics...
>
> Did you mean "relevant"? Or do I have to adjust my thinking _again_?

One can make it work either way ;-)
Whether a document is marked as deleted or not has no effect on term
statistics (i.e. irrelevant)
OR documents marked for deletion still count in term statistics (i.e. relevant)

I guess I used the former because we don't go out of our way to still
include deleted documents... it's just a side effect of the index
structure that we don't (and can't easily) update statistics when a
document is marked as deleted.

-Yonik


> Erick
>
> On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley  wrote:
>> Different replicas of the same shard can have different numbers of
>> deleted documents (really just marked as deleted), and deleted
>> documents are irrelevant to term statistics (like the number of
>> documents a term appears in).  Documents marked for deletion stop
>> contributing to corpus statistics when they are actually removed (via
>> expunge deletes, merges, optimizes).
>> -Yonik
>>
>>
>> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer  wrote:
>>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards and 4
>>> replicas (total of 4 nodes).
>>>
>>> If I run the query multiple times I see the three different top scoring
>>> results.
>>> No data load is running, all data has been commited
>>>
>>> I get these three different hits with their scores:
>>> copperiinitratehemipentahydrate2325919004194430.61722
>>> copperiinitrateoncelite1234598765   432.44238
>>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
>>>
>>> How is it that the same search against the same data can give different
>>> responses?
>>> I looked at the specific cores they look OK the numdocs for the replicas in
>>> a shard match
>>>
>>> This is the query:
>>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-catalog-product/select?defType=edismax=searchmv_en_keywords,%20searchmv_keywords,searchmv_pno,%20searchmv_en_s_pri_name,%20search_en_p_pri_name,%20search_pno%20[explain%20style=nl]=id_s=30=true=sort_ds%20asc=on=2%3C-25%25=OR=copper%20nitrate=search_pid
>>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%20searchmv_p_skus_genr%20searchmv_user_term^200%20search_lform^190%20searchmv_en_acronym^180%20search_en_root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_keywords^140%20search_en_sortkey^120%20searchmv_p_skus^100%20searchmv_chem_comp^90%20searchmv_en_name_suf%20searchmv_cas_number^80%20searchmv_component_cas^70%20search_beilstein^50%20search_color_idx^40%20search_ecnumber^30%20search_egecnumber^30%20search_femanumber^20%20searchmv_isbn^10%20search_mdl_number%20searchmv_en_page_title%20searchmv_en_descriptions%20searchmv_en_attributes%20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_equivalent_pno%20searchmv_xref_exact_pno%20searchmv_xref_exact_sku%20searchmv_component_molform=30=score%20desc,sort_en_name%20asc,sort_ds%20asc,search_pid%20asc=json
>>>
>>> --
>>>
>>>
>>> This message and any attachment are confidential and may be privileged or
>>> otherwise protected from disclosure. If you are not the intended recipient,
>>> you must not copy this message or attachment or disclose the contents to
>>> any other person. If you have received this transmission in error, please
>>> notify the sender immediately and delete the message and any attachment
>>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>> subsidiaries do not accept liability for any omissions or errors in this
>>> message which may arise as a result of E-Mail-transmission or for damages
>>> resulting from any unauthorized changes of the content of this message and
>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>>> subsidiaries do not guarantee that this message is free of viruses and does
>>> not accept liability for any damages caused by any virus transmitted
>>> therewith.
>>>
>>> Click http://www.emdgroup.com/disclaimer to access the German, French,
>>> Spanish and Portuguese versions of this disclaimer.


RE: [EXTERNAL] - Re: NumberFormatException for multvalue, pint

2017-09-07 Thread Steve Pruitt
Sigh.  You are right and thank you for pointing out the obvious, much to my 
chagrin.  :>)

Again, thanks.

-S

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, September 06, 2017 4:42 PM
To: solr-user
Subject: [EXTERNAL] - Re: NumberFormatException for multvalue, pint

You're making a common mistake as to the meaning of multiValued. The input doc 
should look something like (xml format)


  
 1
2
  


Each "mv_int_field" is a separate, complete single integer. But there can be a 
many of them.

when you specify
   1,2,3

you're telling Solr that the _single_ value of the field is "1,2,3"
which, of course, doesn't parse as an integer.

Best,
Erick



On Wed, Sep 6, 2017 at 1:09 PM, Steve Pruitt  wrote:
> Can't get a multi-valued pint field to update.
>
> The schema defines the field:   multiValued="true" required="false" docValues="true" stored="true"/>
>
> I get the exception on this input:   name="dnis">7780386,7313483
>
> Caused by: java.lang.NumberFormatException: For input string: "7780386, 
> 7313483"
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> at java.lang.Integer.parseInt(Integer.java:580)
> at java.lang.Integer.parseInt(Integer.java:615)
> at 
> org.apache.solr.schema.IntPointField.createField(IntPointField.java:181)
> at org.apache.solr.schema.PointField.createFields(PointField.java:216)
> at 
> org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:72)
> at 
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java
> :179)
>
> Not sure why the parser thinks the values are strings.  I don't see any 
> non-numeric extraneous characters.
>
> Do I need docValues and multivalued in my field definition, since they are 
> defined on the pints field type?
>
> Thanks.
>
> -Steve


RE: ERR_SSL_VERSION_OR_CIPHER_MISMATCH

2017-09-07 Thread Younge, Kent A - Norman, OK - Contractor
Still receiving the same issue.  I have cloned another machine and it has the 
same issue.  Not sure what to do next.  Last resort build machine from scratch 
and see if it has the same issue if it does then I have no clue what is going 
on. 








-Original Message-
From: Younge, Kent A - Norman, OK - Contractor 
[mailto:kent.a.you...@usps.gov.INVALID] 
Sent: Tuesday, September 05, 2017 6:54 AM
To: solr-user@lucene.apache.org
Subject: RE: ERR_SSL_VERSION_OR_CIPHER_MISMATCH

The new box is a clone of all the boxes so nothing should have changed other 
than the certificates and the keystore.  That is why I am at such a loss on 
this issue.   Java is the same across five servers all settings are the same 
across five servers.  I will look into the JVM security and see if it is the 
same across all the boxes.






-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Friday, September 01, 2017 5:46 PM
To: solr-user@lucene.apache.org
Subject: Re: ERR_SSL_VERSION_OR_CIPHER_MISMATCH


all of the low level SSL code used by Solr comes from the JVM.

double check which version of java you are using and make sure it's consistent 
on all of your servers -- if you disable SSL on the affected server you can use 
the Solr Admin UI to be 100% certain of exactly which version of java is being 
used...

https://lucene.apache.org/solr/guide/6_6/overview-of-the-solr-admin-ui.html

If the JVM Runtime *versions* are identicle, the next thing to check would be 
the the JVM security settings which control which ciphers are used.  
For Oracle JVMs this file is named "java.security" -- compare that file between 
your functional/non-functional servers.

There are lots of docs out there on SSL protocol and cipher configuration in 
java's java.security file, here's a quick one that links deep into the details 
of enabling/disabling protocols...

http://docs.oracle.com/javase/8/docs/technotes/guides/security/SunProviders.html#SunJSSE_Protocols

...but the bottomline is: you probably want to fix your broken server to match 
your working servers, and unless the JVM versions are different, that means 
someone/thing must have modified the JVM security settings on one of your 
servers -- find out who & why.


-Hoss
http://www.lucidworks.com/


Re: Solr Issue

2017-09-07 Thread Michael Kuhlmann
Hi Patrick,

can you attach the query you're sending to Solr and one example result?
Or more specific, what are your hl.* parameters?

-Michael

Am 07.09.2017 um 09:36 schrieb Patrick Fallert:
>
> Hey Guys, 
> i´ve got a problem with my Solr Highlighter..
> When I search for a word, i get some results. For every result i want
> to display the highlighted text and here is my problem. Some of the
> returned documents have a highlighted text the other ones doesnt. I
> don´t know why it is but i need to fix this problem. Below is the
> configuration of my managed-schema. The configuration of the
> highlighter in solrconfig.xml is default.
> I hope someone can help me. If you need more details you can ask me
> for sure.
>
> managed-schema:
>
> 
> 
> 
> id
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  sortMissingLast="true" multiValued="true"/>
>  currencyConfig="currency.xml" defaultCurrency="USD" precisionStep="8"/>
>  positionIncrementGap="0" docValues="true" precisionStep="0"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="0"/>
>  indexed="true" stored="false">
> 
> 
> 
> 
> 
>  indexed="true" stored="false">
> 
> 
>  encoder="integer"/>
> 
> 
>  indexed="true" stored="false">
> 
> 
>  encoder="identity"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  positionIncrementGap="0" docValues="true" precisionStep="0"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="0"/>
>  positionIncrementGap="0" docValues="true" precisionStep="0"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="0"/>
>  stored="false" docValues="false" multiValued="true"/>
>  positionIncrementGap="0" docValues="true" precisionStep="0"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="0"/>
>  docValues="true"/>
>  class="solr.SpatialRecursivePrefixTreeFieldType" geo="true"
> maxDistErr="0.001" distErrPct="0.025" distanceUnits="kilometers"/>
>  positionIncrementGap="0" docValues="true" precisionStep="0"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="0"/>
>  positionIncrementGap="100">
> 
> 
> 
> 
> 
> 
>  multiValued="true"/>
> 
>  docValues="true" multiValued="true"/>
> 
>  docValues="true" multiValued="true"/>
>  stored="false">
> 
> 
> 
> 
> 
> 
>  multiValued="true"/>
> 
>  multiValued="true"/>
>  dimension="2"/>
> 
>  docValues="true"/>
>  docValues="true" multiValued="true"/>
>  positionIncrementGap="0" docValues="true" precisionStep="6"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="6"/>
>  positionIncrementGap="0" docValues="true" precisionStep="8"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="8"/>
>  positionIncrementGap="100">
> 
> 
> 
>  ignoreCase="true"/>
> 
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  ignoreCase="true"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
>  articles="lang/contractions_ca.txt" ignoreCase="true"/>
> 
>  ignoreCase="true"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
> 
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  ignoreCase="true"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  words="lang/stopwords_da.txt" ignoreCase="true"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  words="lang/stopwords_de.txt" ignoreCase="true"/>
> 
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  ignoreCase="false"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
>  ignoreCase="true"/>
> 
> 
>  protected="protwords.txt"/>
> 
> 
> 
> 
>  ignoreCase="true" synonyms="synonyms.txt"/>
>  ignoreCase="true"/>
> 
> 
>  protected="protwords.txt"/>
> 
> 
> 
>  autoGeneratePhraseQueries="true" positionIncrementGap="100">
> 
> 
>  ignoreCase="true"/>
>  catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1"
> generateWordParts="1" catenateAll="0" catenateWords="1"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
> 
> 
>  ignoreCase="true" synonyms="synonyms.txt"/>
>  ignoreCase="true"/>
>  catenateNumbers="0" generateNumberParts="1" splitOnCaseChange="1"
> generateWordParts="1" catenateAll="0" catenateWords="0"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
>  autoGeneratePhraseQueries="true" positionIncrementGap="100">
> 
> 
>  ignoreCase="true" synonyms="synonyms.txt"/>
>  ignoreCase="true"/>
>  catenateNumbers="1" generateNumberParts="0" generateWordParts="0"
> catenateAll="0" catenateWords="1"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
> 
> 
> 
>  ignoreCase="true" synonyms="synonyms.txt"/>
>  ignoreCase="true"/>
>  catenateNumbers="1" generateNumberParts="0" generateWordParts="0"
> catenateAll="0" catenateWords="1"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  words="lang/stopwords_es.txt" ignoreCase="true"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  ignoreCase="true"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
> 
> 
> 
>  ignoreCase="true"/>
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  words="lang/stopwords_fi.txt" 

Re: CommitScheduler Thread blocked due to excessive number of Merging Threads

2017-09-07 Thread yasoobhaider
So I did a little more digging around why the merging is taking so long, and
it looks like merging postings is the culprit.

On the 5.4 version, merging 500 docs is taking approximately 100 msec, while
on the 6.6 version, it is taking more than 3000 msec. The difference seems
to get worse when more docs are being merged.

Any ideas why this may be the case?

Yasoob



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr Issue

2017-09-07 Thread Patrick Fallert
Hey Guys,
i´ve got a problem with my Solr Highlighter..
When I search for a word, i get some results. For every result i want to 
display the highlighted text and here is my problem. Some of the returned 
documents have a highlighted text the other ones doesnt. I don´t know why it is 
but i need to fix this problem. Below is the configuration of my 
managed-schema. The configuration of the highlighter in solrconfig.xml is 
default.
I hope someone can help me. If you need more details you can ask me for sure.

managed-schema:




id










































































































































































































































































































































































































































































































































































Mit freundlichen Grüßen

Patrick Fallert


[cid:image001.jpg@01D327BC.EA2F]




Rainer-Haungs-Straße 7
D-77933 Lahr

zentrale:
fax:
mobil:

+49 7821 9509-0
+49 7821 9509-99

i...@schrempp-edv.de 
www.schrempp-edv.de 



Geschäftsführer: Brigitta Schrempp Gesamtleitung, Stefan Basler Entwicklung. 
Register-Nummer: HRB 391291,Register-Gericht: Freiburg i. Breisgau.
Steuernummer:  10050/03799. Umsatzsteuer-Identifikationsnummer: DE206688941

Vertraulichkeitshinweis:
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. 
Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten 
haben, informieren Sie bitte sofort den Absender und vernichten Sie diese 
E-Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser E-Mail 
ist nicht gestattet.
Diese E-Mail wurde doppelt auf Viren überprüft. Dies garantiert aber keine 
Virenfreiheit. Wir übernehmen keine Haftung für eventuelle Schäden, die durch 
diese E-Mail oder deren Anhänge entstehen könnten.