Re: Solr Ref Guide Changes - now HTML only

2019-10-28 Thread Alexandre Rafalovitch
I've done some experiments about indexing RefGuide (from source) into
Solr at: https://github.com/arafalov/solr-refguide-indexing . But the
problem was creating UI, hosting, etc.

There was also a thought (mine) of either shipping RefGuide in Solr
with pre-built index as an example or even just shipping an index with
links to the live version. Both of these were complicated because PDF
was throwing the publication schedule of. And also because we are
trying to make Solr distribution smaller, not bigger. A bit of a
catch-22 there. But maybe now it could be revisited.

Regards,
   Alex.
P.s. A personal offline copy of Solr RefGuide could certainly be built
from source. And it will become even easier to do that soon. But yes,
perhaps a compressed download of HTML version would be a nice
replacement of PDF.

On Tue, 29 Oct 2019 at 09:04, Shawn Heisey  wrote:
>
> On 10/28/2019 3:51 PM, Nicolas Paris wrote:
> > I am not very happy with the search engine embedded within the html
> > documentation I admit. Hope this is not solr under the hood :S
>
> It's not Solr under the hood.  It is done by a javascript library that
> runs in the browser.  It only searches page titles, not the whole document.
>
> The fact that a search engine has terrible search in its documentation
> is not lost on us.  We talked about what it would take to use Solr ...
> the infrastructure that would have to be set up and maintaned is
> prohibitive.
>
> We are looking into improving things in this area.  It's going a lot
> slower than we'd like.
>
> Thanks,
> Shawn


Re: Solr Ref Guide Changes - now HTML only

2019-10-28 Thread Shawn Heisey

On 10/28/2019 3:51 PM, Nicolas Paris wrote:

I am not very happy with the search engine embedded within the html
documentation I admit. Hope this is not solr under the hood :S


It's not Solr under the hood.  It is done by a javascript library that 
runs in the browser.  It only searches page titles, not the whole document.


The fact that a search engine has terrible search in its documentation 
is not lost on us.  We talked about what it would take to use Solr ... 
the infrastructure that would have to be set up and maintaned is 
prohibitive.


We are looking into improving things in this area.  It's going a lot 
slower than we'd like.


Thanks,
Shawn


Re: Solr Ref Guide Changes - now HTML only

2019-10-28 Thread Nicolas Paris
> If you are someone who wishes the PDF would continue, please share your
> feedback.

I have not particularly explored the documentation format but the
content. However here my thought on this:

Pdf version of solr documentation has two advantages:
1. readable offline
2. make searching easier than the html version


If there were a "one page" version of the html documentation, this
would mitigate searching within the whole. Also a monolitic html page
makes things easier to access offline.(transform back to pdf, ebook..?)

I am not very happy with the search engine embedded within the html
documentation I admit. Hope this is not solr under the hood :S

-- 
nicolas


Solr Ref Guide Changes - now HTML only

2019-10-28 Thread Cassandra Targett
Hi all -

Some have already noticed this change, but to state it formally, as of 8.2,
the Lucene PMC will no longer treat the PDF version of the Solr
Reference Guide as the primary format, and we will no longer release a PDF
version. The Guide will now be available online only.

Some of you may prefer the PDF and will be disappointed by this change. To
explain, there are several reasons why we're doing this:

1. We believe that most in our community rely on the HTML version (at
https://lucene.apache.org/solr/guide), but since our release focus has been
the PDF version, we are not spending time making sure the HTML works as
well as it should and could.
2. The PDF has grown far too large. The 8.1 version is 1,483 pages, and
16Mb. Attempting to cut it back would be complex, and, considering it is a
less effective medium, possibly not worth the effort.
3. The release process held us back from getting the Guide out at the same
time as the artifact release (which is what has happened so far with 8.x
versions of the Guide).
4. Focusing on supporting the PDF first holds us back from several things
we would like to do in the HTML for better content presentation (including
easy-to-maintain architecture diagrams, proper formatting of math formulas,
and more complete language examples, among other things).

So, starting with 8.2 we are making a few changes:

1. The 8.2 version of the Ref Guide has been published in HTML form only (
https://lucene.apache.org/solr/guide/8_2/), and a PDF will not be available.
2. When 8.3 is released (soon), the HTML version will be available online
at the same time, and will be announced together.
3. For those who follow the development list, starting with 8.3 and going
forward a DRAFT version of the Guide will be available online as soon as a
Lucene & Solr release candidate is prepared and a VOTE thread has started.

If you are someone who wishes the PDF would continue, please share your
feedback. While the PDF is not sustainable in its current form - there are
pending changes that will break our current tooling entirely - we could see
if it's possible to find alternate ways to satisfy the same use cases.

Thanks to all of you for your continued support of Lucene and Solr, and we
look forward to making substantial improvements to the Guide in the months
to come.

Regards,
Cassandra


Re: CDCR cpu usage 100% with some errors

2019-10-28 Thread Louis
I just saw this article.
https://issues.apache.org/jira/browse/SOLR-13349

Can my issue be related to this?



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


CDCR cpu usage 100% with some errors

2019-10-28 Thread Louis
* Solr Version 7.7. Using Cloud with CDCR
* 3 replicas 1 shard on production and disaster recovery

Hi,

Last week, I posted a question about tlogs -
https://lucene.472066.n3.nabble.com/tlogs-are-not-deleted-td4451323.html#a4451430

I disabled buffer based on the advice, but still, tlogs in "production" are
not being deleted. (tlogs in "disaster recovery" nodes are cleaned.) 

And there is another issue, which I suspect it to be related to the problem
that I previously posted. 

I am having tons of logs from our "disaster recovery" nodes. The log files
are building up at an incredibly fast rate with the messages below forever
and cpu usage is always 100% every day("production" nodes' cpu usage is
normal).

It looks like replicating from production server to disaster recovery, but
it actually never ends.

Is this high cpu usage on disaster recovery nodes be normal? 
And is tlogs, which is not being cleaned properly, on production nodes
related to high cpu usage on dr nodes?


*these are the sample messages from tons of logs in disaster recovert nodes
*

2019-10-28 18:25:09.817 INFO  (qtp404214852-90778) [c:test_collection
s:shard1 r:core_node3 x:test_collection_shard1_replica_n1] o.a.s.c.S.Request
[test_collection1_shard1_replica_n1]  webapp=/solr path=/cdcr
params={action=LASTPROCESSEDVERSION=javabin=2} status=0 QTime=0
2019-10-28 18:25:09.817 INFO  (qtp404214852-90778) [c:test_collection
s:shard1 r:core_node3 x:test_collection_shard1_replica_n1] o.a.s.c.S.Request
[test_collection2_shard1_replica_n1]  webapp=/solr path=/cdcr
params={action=LASTPROCESSEDVERSION=javabin=2} status=0 QTime=0
2019-10-28 18:25:09.817 INFO  (qtp404214852-90778) [c:test_collection
s:shard1 r:core_node3 x:test_collection_shard1_replica_n1] o.a.s.c.S.Request
[test_collection3_shard1_replica_n1]  webapp=/solr path=/cdcr
params={action=LASTPROCESSEDVERSION=javabin=2} status=0 QTime=0
2019-10-28 18:18:11.729 INFO  (cdcr-replicator-378-thread-1) [   ]
o.a.s.h.CdcrReplicator Forwarded 0 updates to target test_collection1
2019-10-28 18:18:11.730 INFO  (cdcr-replicator-282-thread-1) [   ]
o.a.s.h.CdcrReplicator Forwarded 0 updates to target test_collection2
2019-10-28 18:18:11.730 INFO  (cdcr-replicator-332-thread-1) [   ]
o.a.s.h.CdcrReplicator Forwarded 0 updates to target test_collection3
...


*And in the middle of logs, I see the following exception for some of the
collections.*


2019-10-28 18:18:11.732 WARN  (cdcr-replicator-404-thread-1) [   ]
o.a.s.h.CdcrReplicator Failed to forward update request to target:
collection_steps
java.lang.ClassCastException: java.lang.Long cannot be cast to
java.util.List
at
org.apache.solr.update.CdcrUpdateLog$CdcrLogReader.getVersion(CdcrUpdateLog.java:732)
~[solr-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 -
jimczi - 2019-02-04 23:23:46]
at
org.apache.solr.update.CdcrUpdateLog$CdcrLogReader.next(CdcrUpdateLog.java:635)
~[solr-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 -
jimczi - 2019-02-04 23:23:46]
at
org.apache.solr.handler.CdcrReplicator.run(CdcrReplicator.java:77)
~[solr-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 -
jimczi - 2019-02-04 23:23:46]
at
org.apache.solr.handler.CdcrReplicatorScheduler.lambda$null$0(CdcrReplicatorScheduler.java:81)
~[solr-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 -
jimczi - 2019-02-04 23:23:46]
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
~[solr-solrj-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 -
jimczi - 2019-02-04 23:23:50]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_181]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


CDCR cpu usage 100% with some errors

2019-10-28 Thread Louis
* Solr Version 7.7. Using Cloud with CDCR
* 3 replicas 1 shard on production and disaster recovery

Hi,

Last week, I posted a question about tlogs -
https://lucene.472066.n3.nabble.com/tlogs-are-not-deleted-td4451323.html#a4451430

I disabled buffer based on the advice, but still, tlogs in "production" are
not being deleted. (tlogs in "disaster recovery" nodes are cleaned.) 

And there is another issue, which I suspect it to be related to the problem
that I previously posted. 

I am having tons of logs from our "disaster recovery" nodes. The log files
are building up at an incredibly fast rate with the messages below forever
and cpu usage is always 100% every day("production" nodes' cpu usage is
normal).

It looks like replicating from production server to disaster recovery, but
it actually never ends.

Is this high cpu usage on disaster recovery nodes be normal? 
And is tlogs, which is not being cleaned properly, on production nodes
related to high cpu usage on dr nodes?


* *

2019-10-28 18:25:09.817 INFO  (qtp404214852-90778) [c:test_collection
s:shard1 r:core_node3 x:test_collection_shard1_replica_n1] o.a.s.c.S.Request
[test_collection1_shard1_replica_n1]  webapp=/solr path=/cdcr
params={action=LASTPROCESSEDVERSION=javabin=2} status=0 QTime=0
2019-10-28 18:25:09.817 INFO  (qtp404214852-90778) [c:test_collection
s:shard1 r:core_node3 x:test_collection_shard1_replica_n1] o.a.s.c.S.Request
[test_collection2_shard1_replica_n1]  webapp=/solr path=/cdcr
params={action=LASTPROCESSEDVERSION=javabin=2} status=0 QTime=0
2019-10-28 18:25:09.817 INFO  (qtp404214852-90778) [c:test_collection
s:shard1 r:core_node3 x:test_collection_shard1_replica_n1] o.a.s.c.S.Request
[test_collection3_shard1_replica_n1]  webapp=/solr path=/cdcr
params={action=LASTPROCESSEDVERSION=javabin=2} status=0 QTime=0
2019-10-28 18:18:11.729 INFO  (cdcr-replicator-378-thread-1) [   ]
o.a.s.h.CdcrReplicator Forwarded 0 updates to target test_collection1
2019-10-28 18:18:11.730 INFO  (cdcr-replicator-282-thread-1) [   ]
o.a.s.h.CdcrReplicator Forwarded 0 updates to target test_collection2
2019-10-28 18:18:11.730 INFO  (cdcr-replicator-332-thread-1) [   ]
o.a.s.h.CdcrReplicator Forwarded 0 updates to target test_collection3
...


*And in the middle of logs, I see the following exception for some of the
collections.*


2019-10-28 18:18:11.732 WARN  (cdcr-replicator-404-thread-1) [   ]
o.a.s.h.CdcrReplicator Failed to forward update request to target:
collection_steps
java.lang.ClassCastException: java.lang.Long cannot be cast to
java.util.List
at
org.apache.solr.update.CdcrUpdateLog$CdcrLogReader.getVersion(CdcrUpdateLog.java:732)
~[solr-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 -
jimczi - 2019-02-04 23:23:46]
at
org.apache.solr.update.CdcrUpdateLog$CdcrLogReader.next(CdcrUpdateLog.java:635)
~[solr-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 -
jimczi - 2019-02-04 23:23:46]
at
org.apache.solr.handler.CdcrReplicator.run(CdcrReplicator.java:77)
~[solr-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 -
jimczi - 2019-02-04 23:23:46]
at
org.apache.solr.handler.CdcrReplicatorScheduler.lambda$null$0(CdcrReplicatorScheduler.java:81)
~[solr-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 -
jimczi - 2019-02-04 23:23:46]
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
~[solr-solrj-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 -
jimczi - 2019-02-04 23:23:50]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_181]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Parts of the Json response to a curl query are arrays, and parts are hashes

2019-10-28 Thread Shawn Heisey

On 10/28/2019 9:30 AM, rhys J wrote:

Will I break Solr if i change this to default to not multi-valued?


If you are only indexing one value in those fields, then setting 
multiValued to false will not break anything.


If an indexing request ever comes in that has more than one value for a 
field that does not have multiValued set to true, that document (and any 
others following it in the batch) will fail to index.  It is likely that 
the reason the default is set to true is to avoid complaints from users 
who DO send more than one value.


In situations where copyField is used to copy more than one field to a 
target, the target must be multiValued, or that indexing will also fail.


Thanks,
Shawn


Re: Parts of the Json response to a curl query are arrays, and parts are hashes

2019-10-28 Thread rhys J
I forgot to include the fields created through the API:


  
  
  
  
  
  
  
  
  
  
  
  
  

Thanks,

Rhys

On Mon, Oct 28, 2019 at 11:30 AM rhys J  wrote:

>
>
>> Did you reload the core/collection or restart Solr so the new schema
>> would take effect? If it's SolrCloud, did you upload the changes to
>> zookeeper and then reload the collection?  SolrCloud does not use config
>> files on disk.
>>
>
> So I have not done this part yet, but I noticed some things in the
> managed-schema.
>
>  the first was this (I did verify that the version of the schema is
> up-to-date. I am doing an out of the box install of the latest Solr release.
>
> I checked all the fields that I created (I will paste them below), and
> they are NOT multi-valued. However, text_general is set to multi-valued as
> a default?
>
>   positionIncrementGap="100" multiValued="true">
> 
>   
>ignoreCase="true"/>
>   
> 
> 
>   
>ignoreCase="true"/>
>ignoreCase="true" synonyms="synonyms.txt"/>
>   
> 
>   
>
> Here are some of the fields I created through the API. When I created
> them, I did NOT check the multi-valued box at all. However, when I then go
> to look at the field through the API, it is marked Multi-valued. I am
> assuming this is because of the fieldType definition above? Why is this set
> to default to Multi-valued?
>
> Will I break Solr if i change this to default to not multi-valued?
>
> Thanks,
>
> Rhys
>


Re: Parts of the Json response to a curl query are arrays, and parts are hashes

2019-10-28 Thread rhys J
> Did you reload the core/collection or restart Solr so the new schema
> would take effect? If it's SolrCloud, did you upload the changes to
> zookeeper and then reload the collection?  SolrCloud does not use config
> files on disk.
>

So I have not done this part yet, but I noticed some things in the
managed-schema.

 the first was this (I did verify that the version of the schema is
up-to-date. I am doing an out of the box install of the latest Solr release.

I checked all the fields that I created (I will paste them below), and they
are NOT multi-valued. However, text_general is set to multi-valued as a
default?

 

  
  
  


  
  
  
  

  

Here are some of the fields I created through the API. When I created them,
I did NOT check the multi-valued box at all. However, when I then go to
look at the field through the API, it is marked Multi-valued. I am assuming
this is because of the fieldType definition above? Why is this set to
default to Multi-valued?

Will I break Solr if i change this to default to not multi-valued?

Thanks,

Rhys


Re: merge policy & autocommit

2019-10-28 Thread Shawn Heisey

On 10/28/2019 7:23 AM, Danilo Tomasoni wrote:

We have a solr instance with around 40MLN docs.

In the bulk import phase we noticed a high IO and CPU load and it looks 
like it's related to autocommit because if I disable autocommit the load 
of the system is very low.


I know that disabling autocommit is not recommended, but I'm wondering 
if there is a minimum hardware requirement to make this suggestion 
effective.


What are your settings for autoCommit and autoSoftCommit?  If the 
settings are referring to system properties, have you defined those 
system properties?  Would you be able to restart Solr and then share a 
solr.log file that goes back to that start?


The settings that Solr has shipped with for quite a while are to enable 
autoCommit with a 15 second maxTime, no maxDoc, and openSearcher set to 
false.  The autoSoftCommit setting is not enabled by default.


These settings work well, though I personally think 15 seconds is 
perhaps too frequent, and like to set it to something like one minute 
instead.


With openSearcher set to false, autoCommit will not affect document 
visibility.  If automatically making index changes visible is desired, 
it is better to configure autoSoftCommit in addition to autoCommit ... 
and super short intervals are not recommended.


Our system is not very powerful in terms of IO read/write speed (around 
100 Mbyte/s) is it possible that this relatively low IO performance 
combined with


100MB/sec is not what I would call low I/O.  It's the minimum that you 
can expect from modern commodity SATA hard drives, and some of those can 
go even faster.  It's also roughly equivalent to the maximum real-world 
achievable throughput of a gigabit network connection with TCP-based 
protocols.


autocommit will slow down incredibly our solr instance to the point of 
making it not responsive?


If it's configured correctly, autoCommit should have very little effect 
on performance.  Hard commits that do not open a new searcher should 
happen VERY quickly.  It seems very strange to me that disabling a 
correctly configured autoCommit would substantially affect indexing speeds.


The same can be true also for the merge policy? how the IO speed can 
affect the merge policy parameters?


I kept the default merge policy configuration but it looks like it never 
merges segments. How can I know if a merge is happening?


If you have segments that are radically different sizes, then merging is 
happening.  With default settings, merges from the first level should 
produce segments roughly ten times the size of the ones created by 
indexing.  Second level merges will probably produce segments roughly 
100 times the size of the smallest ones.  Segment merging is a normal 
part of Lucene operation, it would be very unusual for it to not occur.


Merging will affect I/O, but it is extremely rare for merging to happen 
super-quickly.  The fastest I have ever seen merging on a single Solr 
core proceed is about 30 megabytes per second, though usually that 
system achieved about 20 megabytes per second.  Merging involves 
considerable computational work, it's not just a straight data copy.


Thanks,
Shawn


merge policy & autocommit

2019-10-28 Thread Danilo Tomasoni

Hello all,

We have a solr instance with around 40MLN docs.

In the bulk import phase we noticed a high IO and CPU load and it looks 
like it's related to autocommit because if I disable autocommit the load 
of the system is very low.


I know that disabling autocommit is not recommended, but I'm wondering 
if there is a minimum hardware requirement to make this suggestion 
effective.


Our system is not very powerful in terms of IO read/write speed (around 
100 Mbyte/s) is it possible that this relatively low IO performance 
combined with


autocommit will slow down incredibly our solr instance to the point of 
making it not responsive?


The same can be true also for the merge policy? how the IO speed can 
affect the merge policy parameters?


I kept the default merge policy configuration but it looks like it never 
merges segments. How can I know if a merge is happening?



Thank you

Danilo

--
Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for 
Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomas...@cosbi.eu
http://www.cosbi.eu
 
As for the European General Data Protection Regulation 2016/679 on the protection of natural persons with regard to the processing of personal data, we inform you that all the data we possess are object of treatment in the respect of the normative provided for by the cited GDPR.

It is your right to be informed on which of your data are used and how; you may 
ask for their correction, cancellation or you may oppose to their use by 
written request sent by recorded delivery to The Microsoft Research – 
University of Trento Centre for Computational and Systems Biology Scarl, Piazza 
Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to



Re: Leader node on specific host machines?

2019-10-28 Thread Erick Erickson
There’s the preferredLeader property, see: 
https://lucene.apache.org/solr/guide/6_6/collections-api.html

That said, this was put in for situations where there were 100s of shards with 
replicas from many shards hosted on any given machine, so it was possible in 
that setup to have 100 or more leaders on a single node.

In the usual case, the leader role doesn’t do very much extra work, and the 
extra work is mostly distributing the incoming documents to the followers 
during indexing (mostly I/O). During query time, the leader has no extra duties 
at all. So if “heavy use” means heavy querying, it shouldn’t make any 
appreciable difference.

I would urge you to have evidence that this was worth the effort before 
spending time on it. And, the “preferredLeader” property is just that, a 
preference all things being equal. It’s still possible for a leader to be a 
different replica, otherwise you’d defeat the whole point of trying for HA.

For TLOG and PULL setups, the leader will always be a TLOG replica, so you 
could strategically place them to get what you want. In this case, the leader 
indeed has a lot more work to do than the follower so it makes more sense.

Best,
Erick

> On Oct 28, 2019, at 6:13 AM, Koen De Groote  
> wrote:
> 
> Hello,
> 
> I'm looking for a way to configure my collections as such that the leader
> nodes of specific collections never share the same host.
> 
> This as a way to prevent several large and/or heavy-usage collections on
> the same machine.
> 
> Is this something I can set in solrconfig.xml? Or are there rules for this?
> 
> Kind regards,
> Koen De Groote



Leader node on specific host machines?

2019-10-28 Thread Koen De Groote
Hello,

I'm looking for a way to configure my collections as such that the leader
nodes of specific collections never share the same host.

This as a way to prevent several large and/or heavy-usage collections on
the same machine.

Is this something I can set in solrconfig.xml? Or are there rules for this?

Kind regards,
Koen De Groote