SolrClould 6.6 stability challenges

2017-11-03 Thread Rick Dig
hello all,
we are trying to run solrcloud 6.6 in a production setting.
here's our config and issue
1) 3 nodes, 1 shard, replication factor 3
2) all nodes are 16GB RAM, 4 core
3) Our production load is about 2000 requests per minute
4) index is fairly small, index size is around 400 MB with 300k documents
5) autocommit is currently set to 5 minutes (even though ideally we would
like a smaller interval).
6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
7) all of this runs perfectly ok when indexing isn't happening. as soon as
we start "nrt" indexing one of the follower nodes goes down within 10 to 20
minutes. from this point on the nodes never recover unless we stop
indexing.  the master usually is the last one to fall.
8) there are maybe 5 to 7 processes indexing at the same time with document
batch sizes of 500.
9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
10) no cpu and / or oom issues that we can see.
11) cpu load does go fairly high 15 to 20 at times.
any help or pointers appreciated

thanks
rick


Re: Anyone have any comments on current solr monitoring favorites?

2017-11-03 Thread Webster Homer
My company uses Dynatrace for most everything in production. They have a
plugin for Solr that works with 6.*

On Thu, Nov 2, 2017 at 4:05 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Robi,
> Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and
> more. We use it for monitoring our Solr instances and for consulting.
>
> Disclaimer - see signature :)
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 2 Nov 2017, at 19:35, Walter Underwood  wrote:
> >
> > We use New Relic for JVM, CPU, and disk monitoring.
> >
> > I tried the built-in metrics support in 6.4, but it just didn’t do what
> we want. We want rates and percentiles for each request handler. That gives
> us 95th percentile for textbooks suggest or for homework search results
> page, etc. The Solr metrics didn’t do that. The Jetty metrics didn’t do
> that.
> >
> > We built a dedicated servlet filter that goes in front of the Solr
> webapp and reports metrics. It has some special hacks to handle some weird
> behavior in SolrJ. A request to the “/srp” handler is sent as
> “/select?qt=/srp”, so we normalize that.
> >
> > The metrics start with the cluster name, the hostname, and the
> collection. The rest is generated like this:
> >
> > URL: GET /solr/textbooks/select?q=foo&qt=/auto
> > Metric: textbooks.GET./auto
> >
> > URL: GET /solr/textbooks/select?q=foo
> > Metric: textbooks.GET./select
> >
> > URL: GET /solr/questions/auto
> > Metric: questions.GET./auto
> >
> > So a full metric for the cluster “solr-cloud” and the host “search01"
> would look like “solr-cloud.search01.solr.textbooks.GET./auto.m1_rate”.
> >
> > We send all that to InfluxDB. We’ve configured a template so that each
> part of the metric name is mapped to a field, so we can write efficient
> queries in InfluxQL.
> >
> > Metrics are graphed in Grafana. We have dashboards that mix Cloudwatch
> (for the load balancer) and InfluxDB.
> >
> > I’m still working out the kinks in some of the more complicated queries,
> but the data is all there. I also want to expand the servlet filter to
> report HTTP response codes.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> >> On Nov 2, 2017, at 9:30 AM, Petersen, Robert (Contr) <
> robert.peters...@ftr.com> wrote:
> >>
> >> OK I'm probably going to open a can of worms here...  lol
> >>
> >>
> >> In the old old days I used PSI probe to monitor solr running on tomcat
> which worked ok on a machine by machine basis.
> >>
> >>
> >> Later I had a grafana dashboard on top of graphite monitoring which was
> really nice looking but kind of complicated to set up.
> >>
> >>
> >> Even later I successfully just dropped in a newrelic java agent which
> had solr monitors and a dashboard right out of the box, but it costs money
> for the full tamale.
> >>
> >>
> >> For basic JVM health and Solr QPS and time percentiles, does anyone
> have any favorites or other alternative suggestions?
> >>
> >>
> >> Thanks in advance!
> >>
> >> Robi
> >>
> >> 
> >>
> >> This communication is confidential. Frontier only sends and receives
> email on the basis of the terms set out at http://www.frontier.com/email_
> disclaimer.
> >
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Fwd: configuring Solr with Tesseract

2017-11-03 Thread Admin eLawJournal
Hi,
I have read that we can use tesseract with solr to index image files. I
would like some guidance on setting this up.

Currently, I am using solr for searching my wordpress installation via the
WPSOLR plugin.

I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
wordpress.

I have also installed tesseract but have no clue on configuring it.


I am new to solr so will greatly appreciate a detailed step by step
instruction.

Thank you very much


Re: update document stuck on: java.net.SocketInputStream.socketRead0

2017-11-03 Thread Nawab Zada Asad Iqbal
Hi,

I added some very liberal connection timeout and socket timeout to the
request config. And I see a lot of SocketTimeoutException and some
ConnectTimeoutException

RequestConfig requestConfig = RequestConfig.custom()
.setConnectionRequestTimeout(10*60*1000)
.setConnectTimeout(60*1000)
.setSocketTimeout(3*60*1000)
.build();

I am totally lost on what needs to be fixed here, but it is blocking a lot
of connections for a very long time; and my expected throughput has reduced
to almost half (compared to Solr 4 and Jetty).
 Jetty 9 doesn't support bio.SocketConnector and following snippet from
Solr4 (probably Jetty 8) shows that Solr was performing better with
SocketConnector insted of nio.SelectChannelConnector. I am wondering if
this gives some clue to my problem. How should I keep my Jetty 9 config
(for non blocking i/o as SocketConnector is not supported), to at least
improve my performance.

PS: I have also posted this question here:
https://stackoverflow.com/questions/47098816/solr-jetty-9-webserver-sending-a-ton-of-socket-timeouts


Thanks
Nawab


On Thu, Oct 26, 2017 at 7:03 PM, Nawab Zada Asad Iqbal 
wrote:

> Hi,
>
> After Solr 7 upgrade, I am realizing that my '/update' request is
> sometimes getting stuck on this:-
>
>  - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[],
> int, int, int) @bci=0 (Compiled frame; information may be imprecise)
>  - java.net.SocketInputStream.read(byte[], int, int, int) @bci=87,
> line=152 (Compiled frame)
>  - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=122
> (Compiled frame)
>  - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer()
> @bci=71, line=166 (Compiled frame)
>  - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90
> (Compiled frame)
>  - org.apache.http.impl.io.AbstractSessionInputBuffer.
> readLine(org.apache.http.util.CharArrayBuffer) @bci=137, line=281
> (Compiled frame)
>  - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(
> org.apache.http.io.SessionInputBuffer) @bci=16, line=92 (Compiled frame)
>  - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(
> org.apache.http.io.SessionInputBuffer) @bci=2, line=62 (Compiled frame)
>  - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38,
> line=254 (Compiled frame)
>  - org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader()
> @bci=8, line=289 (Compiled frame)
>  - org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader()
> @bci=1, line=252 (Compiled frame)
>  - 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader()
> @bci=6, line=191 (Compiled frame)
>  - org.apache.http.protocol.HttpRequestExecutor.
> doReceiveResponse(org.apache.http.HttpRequest, 
> org.apache.http.HttpClientConnection,
> org.apache.http.protocol.HttpContext) @bci=62, line=300 (Compiled frame)
>  - 
> org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest,
> org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext)
> @bci=60, line=127 (Compiled frame)
>  - org.apache.http.impl.client.DefaultRequestDirector.
> tryExecute(org.apache.http.impl.client.RoutedRequest,
> org.apache.http.protocol.HttpContext) @bci=198, line=715 (Compiled frame)
>  - org.apache.http.impl.client.DefaultRequestDirector.
> execute(org.apache.http.HttpHost, org.apache.http.HttpRequest,
> org.apache.http.protocol.HttpContext) @bci=574, line=520 (Compiled frame)
>  - 
> org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost,
> org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
> @bci=344, line=906 (Compiled frame)
>  - org.apache.http.impl.client.AbstractHttpClient.execute(
> org.apache.http.client.methods.HttpUriRequest, 
> org.apache.http.protocol.HttpContext)
> @bci=21, line=805 (Compiled frame)
>  - org.apache.http.impl.client.AbstractHttpClient.execute(
> org.apache.http.client.methods.HttpUriRequest) @bci=6, line=784 (Compiled
> frame)
>
>
> It seems that I am hitting this issue: https://stackoverflow.com/
> questions/28785085/how-to-prevent-hangs-on-socketinputstream-socketread0-
> in-java
> Although, I will fix my timeout settings in client, I am curious what has
> changed in Solr7 (i am upgrading from solr 4), which would cause this?
>
>
> Thanks
> Nawab
>


Re: [Parent] doc transformer

2017-11-03 Thread Mikhail Khludnev
It seems  you can only use slow [subquery] for this.

On Mon, Oct 30, 2017 at 7:17 PM, Aurélien MAZOYER <
aurelien.mazo...@francelabs.com> wrote:

> Hi,
>
>
>
> Is there in Solr a kind of [parent] doc transformer (like the [child] doc
> transformer) that can be used to embed parent’s fields in the response of a
> query that uses the block join children query parser?
>
>
>
> Thank you,
>
>
>
> Aurélien MAZOYER
>
>


-- 
Sincerely yours
Mikhail Khludnev


Re: how to ensure that one shard does not get overloaded when we use routing

2017-11-03 Thread Emir Arnautović
Hi Ketan,
I’ll just add that with 4 shards you might just as well skip bits part - all 
tenant document will end up on a single shard anyway.
Unless you have a lot projectIds or all have pretty much the same number of 
documents, and you always search single projectId, I would reevaluate using 
routing since it can give you more troubles then benefits. 

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 2 Nov 2017, at 16:15, Erick Erickson  wrote:
> 
> Well, you have to monitor. That's the down-side to using this type of
> routing, you're effectively saying "I know enough about my usage to
> predict".
> 
> What do you think you're gaining by using this? Putting all docs from
> a single org on a subset of your servers reduces some part of the
> parallelism you get from sharding. So unless you have a very specific
> use case and some data to back it up I wonder why you even want to try
> to control it like this ;)
> 
> 
> Best,
> Erick
> 
> On Thu, Nov 2, 2017 at 7:51 AM, Ketan Thanki  wrote:
>> Hi,
>> 
>> I have 4 shard and 4 replica and I do Composite document routing for my 
>> unique field 'Id'  as mentions below.
>> e.g :  tenants bits use as projectId/2! prefix with Id
>> 
>> how to ensure that one shard does not get overloaded when we use routing
>> 
>> Regards,
>> Ketan.
>> Please cast a vote for Asite in the 2017 Construction Computing Awards: 
>> Click here to 
>> Vote
>> 
>> [CC Award Winners!]
>> 



Re: Advice on Stemming in Solr

2017-11-03 Thread Emir Arnautović
Hi Edwin,
Hunspell is configurable, language independent library and you can define any 
morphology rules. It’s beed there for a while and I would not be surprised if 
someone already adjusted english rules to suite you case.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Nov 2017, at 04:25, Zheng Lin Edwin Yeo  wrote:
> 
> Hi Emir,
> 
> We are looking to change to HunspellStemFilterFactory. This has a
> dictionary file containing words and applicable flags, and an affix file
> that specifies how these flags will control spell checking.
> Probably we can control it from those files in HunspellStemFilterFactory?
> 
> Regards,
> Edwin
> 
> 
> On 2 November 2017 at 17:46, Emir Arnautović 
> wrote:
> 
>> Hi Edwin,
>> It seems that it would be best if you do not apply *ing stemming rule at
>> all. The first idea is to trick stemmer and replace any word that ends with
>> ing to some nonexisting char combination e.g. ‘wqx’. You can use 
>> solr.PatternReplaceFilterFactory
>> to do that. You can switch it back after stemming if want to have proper
>> token in index.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 2 Nov 2017, at 03:23, Zheng Lin Edwin Yeo 
>> wrote:
>>> 
>>> Hi Emir,
>>> 
>>> We do have quite alot of words that should not be stemmed. Currently, the
>>> KStemFilterFactory are stemming all the non-English words that end with
>>> "ing" as well. There are quite alot of places and names which ends in
>>> "ing", and all these are being stemmed as well, which leads to an
>>> inaccurate search.
>>> 
>>> Regards,
>>> Edwin
>>> 
>>> 
>>> On 1 November 2017 at 18:20, Emir Arnautović <
>> emir.arnauto...@sematext.com>
>>> wrote:
>>> 
 Hi Edwin,
 If the number of words that should not be stemmed is not high you could
 use KeywordMarkerFilterFactory to flag those words as keywords and it
 should prevent stemmer from changing them.
 Depending on what you want to achieve, you might not be able to avoid
 using stemmer at indexing time. If you want to find documents that
>> contain
 only “walking” with search term “walk”, then you have to stem at index
 time. Cases when you use stemming on query time only are rare and
>> specific.
 If you want to prefer exact matches over stemmed matches, you have to
 index same content with and without stemming and boost matches on field
 without stemming.
 
 HTH,
 Emir
 --
 Monitoring - Log Management - Alerting - Anomaly Detection
 Solr & Elasticsearch Consulting Support Training - http://sematext.com/
 
 
 
> On 1 Nov 2017, at 10:11, Zheng Lin Edwin Yeo 
 wrote:
> 
> Hi,
> 
> We are currently using KStemFilterFactory in Solr, but we found that it
 is
> actually doing stemming on non-English words like "ximenting", which it
> stem to "ximent". This is not what we wanted.
> 
> Another option is to use the HunspellStemFilterFactory, but there are
 some
> English words like "running", walking" that are not being stemmed.
> 
> Would like to check, is it advisable to use Stemming at index? Or we
 should
> not use Stemming at index time, but at query time, do a search for the
> stemmed words as well, like for example, if the user search for
 "walking",
> we will do the search together with "walk", and the actual word of
 walking
> will have higher weightage.
> 
> I'm currently using Solr 6.5.1.
> 
> Regards,
> Edwin
 
 
>> 
>>