Re: Solr fails even ZK quorum has majority

2018-07-23 Thread Michael Braun
Per the exception, this looks like a network / DNS resolution issue,
independent of Solr and Zookeeper code:

Caused by: org.apache.solr.common.SolrException:
java.net.UnknownHostException: ditsearch001.es.com: Name or service not
known

Is this address actually resolvable at the time?

On Mon, Jul 23, 2018 at 3:46 PM, Susheel Kumar 
wrote:

> In usual circumstances when one Zookeeper goes down while others 2 are up,
> Solr continues to operate but when one of the ZK machine was not reachable
> with ping returning below results, Solr count't starts.  See stack trace
> below
>
> ping: cannot resolve ditsearch001.es.com: Unknown host
>
>
> Setup: Solr 6.6.2 and Zookeeper 3.4.10
>
> I had to remove this server name from the ZK_HOST list (solr.in.sh) in
> order to get Solr started. Ideally whatever issue is there as far as
> majority is there, Solr should get started.
>
> Has any one noticed this issue?
>
> Thnx
>
> 2018-07-23 15:30:47.218 INFO  (main) [   ] o.e.j.s.Server
> jetty-9.3.14.v20161028
>
> 2018-07-23 15:30:47.817 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter  ___
> _   Welcome to Apache Solr‚Ñ¢ version 6.6.2
>
> 2018-07-23 15:30:47.829 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter / __|
> ___| |_ _   Starting in cloud mode on port 8080
>
> 2018-07-23 15:30:47.830 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter \__
> \/ _ \ | '_|  Install dir: /opt/solr
>
> 2018-07-23 15:30:47.861 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
> |___/\___/_|_|Start time: 2018-07-23T15:30:47.832Z
>
> 2018-07-23 15:30:47.863 INFO  (main) [   ] o.a.s.s.StartupLoggingUtils
> Property solr.log.muteconsole given. Muting ConsoleAppender named CONSOLE
>
> 2018-07-23 15:30:47.929 INFO  (main) [   ] o.a.s.c.SolrResourceLoader Using
> system property solr.solr.home: /app/solr/data
>
> 2018-07-23 15:30:48.037 ERROR (main) [   ] o.a.s.s.SolrDispatchFilter Could
> not start Solr. Check solr/home property and the logs
>
> 2018-07-23 15:30:48.235 ERROR (main) [   ] o.a.s.c.SolrCore
> null:org.apache.solr.common.SolrException: Error occurred while loading
> solr.xml from zookeeper
>
> at
> org.apache.solr.servlet.SolrDispatchFilter.loadNodeConfig(
> SolrDispatchFilter.java:270)
>
> at
> org.apache.solr.servlet.SolrDispatchFilter.createCoreContainer(
> SolrDispatchFilter.java:242)
>
> at
> org.apache.solr.servlet.SolrDispatchFilter.init(
> SolrDispatchFilter.java:173)
>
> at
> org.eclipse.jetty.servlet.FilterHolder.initialize(FilterHolder.java:137)
>
> at
> org.eclipse.jetty.servlet.ServletHandler.initialize(
> ServletHandler.java:873)
>
> at
> org.eclipse.jetty.servlet.ServletContextHandler.startContext(
> ServletContextHandler.java:349)
>
> at
> org.eclipse.jetty.webapp.WebAppContext.startWebapp(
> WebAppContext.java:1404)
>
> at
> org.eclipse.jetty.webapp.WebAppContext.startContext(
> WebAppContext.java:1366)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.
> doStart(ContextHandler.java:778)
>
> at
> org.eclipse.jetty.servlet.ServletContextHandler.doStart(
> ServletContextHandler.java:262)
>
> at
> org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:520)
>
> at
> org.eclipse.jetty.util.component.AbstractLifeCycle.
> start(AbstractLifeCycle.java:68)
>
> at
> org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(
> StandardStarter.java:41)
>
> at
> org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:188)
>
> at
> org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(
> DeploymentManager.java:499)
>
> at
> org.eclipse.jetty.deploy.DeploymentManager.addApp(
> DeploymentManager.java:147)
>
> at
> org.eclipse.jetty.deploy.providers.ScanningAppProvider.
> fileAdded(ScanningAppProvider.java:180)
>
> at
> org.eclipse.jetty.deploy.providers.WebAppProvider.
> fileAdded(WebAppProvider.java:458)
>
> at
> org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(
> ScanningAppProvider.java:64)
>
> at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:610)
>
> at
> org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:529)
>
> at org.eclipse.jetty.util.Scanner.scan(Scanner.java:392)
>
> at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:313)
>
> at
> org.eclipse.jetty.util.component.AbstractLifeCycle.
> start(AbstractLifeCycle.java:68)
>
> at
> org.eclipse.jetty.deploy.providers.ScanningAppProvider.
> doStart(ScanningAppProvider.java:150)
>
> at
> org.eclipse.jetty.util.component.AbstractLifeCycle.
> start(AbstractLifeCycle.java:68)
>
> at
> org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(
> DeploymentManager.java:561)
>
> at
> org.eclipse.jetty.deploy.DeploymentManager.doStart(
> DeploymentManager.java:236)
>
> at
> org.eclipse.jetty.util.component.AbstractLifeCycle.
> start(AbstractLifeCycle.java:68)
>
> a

Re: Hardware-Aware Solr Coud Sharding?

2018-07-16 Thread Michael Braun
Ended up working well with nodeset EMPTY and placing all replicas manually.
Thank you all for the assistance!

On Thu, Jun 14, 2018 at 9:28 AM, Jan Høydahl  wrote:

> You could also look into the Autoscaling stuff in 7.x which can be
> programmed to move shards around based on system load and HW specs on the
> various nodes, so in theory that framework (although still a bit unstable)
> will suggest moving some replicas from weak nodes over to more powerful
> ones. If you "overshard" your system, i.e. if you have three nodes, you
> create a collection with 9 shards, then there will be three shards per
> node, and Solr can suggest moving one of them off to anther server.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 12. jun. 2018 kl. 18:39 skrev Erick Erickson :
> >
> > In a mixed-hardware situation you can certainly place replicas as you
> > choose. Create a minimal collection or use the special nodeset EMPTY
> > and then place your replicas one-by-one.
> >
> > You can also consider "replica placement rules", see:
> > https://lucene.apache.org/solr/guide/6_6/rule-based-
> replica-placement.html.
> > I _think_ this would be a variant of "rack aware". In this case you'd
> > provide a "snitch" that says something about the hardware
> > characteristics and the rules you'd define would be sensitive to that.
> >
> > WARNING: haven't done this myself so don't have any examples to point
> to
> >
> > Best,
> > Erick
> >
> > On Tue, Jun 12, 2018 at 8:34 AM, Shawn Heisey 
> wrote:
> >> On 6/12/2018 9:12 AM, Michael Braun wrote:
> >>> The way to handle this right now looks to be running additional Solr
> >>> instances on nodes with increased resources to balance the load (so if
> the
> >>> machines are 1x, 1.5x, and 2x, run 2 instances, 3 instances, and 4
> >>> instances, respectively). Has anyone looked into other ways of handling
> >>> this that don't require the additional Solr instance deployments?
> >>
> >> Usually, no.  In most cases, you only want to run one Solr instance per
> >> server.  One Solr instance can handle many individual shard replicas.
> >> If there are more individual indexes on a Solr instance, then it is
> >> likely to be able to take advantage of additional system resources
> >> without running another Solr instance.
> >>
> >> The only time you should run multiple Solr instances is when the heap
> >> requirements for running the required indexes with one instance would be
> >> way too big.  Splitting the indexes between two instances with smaller
> >> heaps might end up with much better garbage collection efficiency.
> >>
> >> https://lucene.apache.org/solr/guide/7_3/taking-solr-to-
> production.html#running-multiple-solr-nodes-per-host
> >>
> >> Thanks,
> >> Shawn
> >>
>
>


Hardware-Aware Solr Coud Sharding?

2018-06-12 Thread Michael Braun
We have a case of a Solr Cloud cluster with different kinds of nodes - some
may have significant differences in hardware specs (50-100% more
HD/RAM/CPU, etc). Ideally nodes with increased resources could take on more
shard replicas.

It looks like the Collections API (
https://lucene.apache.org/solr/guide/6_6/collections-api.html) supports
only even splitting of shards when using compositeId routing.

The way to handle this right now looks to be running additional Solr
instances on nodes with increased resources to balance the load (so if the
machines are 1x, 1.5x, and 2x, run 2 instances, 3 instances, and 4
instances, respectively). Has anyone looked into other ways of handling
this that don't require the additional Solr instance deployments?

-Michael


Re: 7.2.1 cluster dies within minutes after restart

2018-01-29 Thread Michael Braun
Believe this is reported in https://issues.apache.org/jira/browse/SOLR-10471


On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma 
wrote:

> Hello SG,
>
> The default in solr.in.sh is commented so it defaults to the value set in
> bin/solr, which is fifteen seconds. Just uncomment the setting in
> solr.in.sh and your timeout will be thirty seconds.
>
> For Solr itself to really default to thirty seconds, Solr's bin/solr needs
> to be patched to use the correct value.
>
> Regards,
> Markus
>
> -Original message-
> > From:S G 
> > Sent: Monday 29th January 2018 20:15
> > To: solr-user@lucene.apache.org
> > Subject: Re: 7.2.1 cluster dies within minutes after restart
> >
> > Hi Markus,
> >
> > We are in the process of upgrading our clusters to 7.2.1 and I am not
> sure
> > I quite follow the conversation here.
> > Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher
> value
> > in the config (and it's just a default value being wrong/overridden
> > somewhere)?
> > Or is it more severe in the sense that any config set for
> ZK_CLIENT_TIMEOUT
> > by the user is just ignored completely by Solr in 7.2.1 ?
> >
> > Thanks
> > SG
> >
> >
> > On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> > > Ok, i applied the patch and it is clear the timeout is 15000. Solr.xml
> > > says 3 if ZK_CLIENT_TIMEOUT is not set, which is by default unset
> in
> > > solr.in.sh,but set in bin/solr to 15000. So it seems Solr's default is
> > > still 15000, not 3.
> > >
> > > But, back to my topic. I see we explicitly set it in solr.in.sh to
> 3.
> > > To be sure, i applied your patch to a production machine, all our
> > > collections run with 3. So how would that explain this log line?
> > >
> > > o.a.z.ClientCnxn Client session timed out, have not heard from server
> in
> > > 22130ms
> > >
> > > We also see these with smaller values, seven seconds. And, is this
> > > actually an indicator of the problems we have?
> > >
> > > Any ideas?
> > >
> > > Many thanks,
> > > Markus
> > >
> > >
> > > -Original message-
> > > > From:Markus Jelsma 
> > > > Sent: Saturday 27th January 2018 10:03
> > > > To: solr-user@lucene.apache.org
> > > > Subject: RE: 7.2.1 cluster dies within minutes after restart
> > > >
> > > > Hello,
> > > >
> > > > I grepped for it yesterday and found nothing but 3 in the
> settings,
> > > but judging from the weird time out value, you may be right. Let me
> apply
> > > your patch early next week and check for spurious warnings.
> > > >
> > > > Another note worthy observation for those working on cloud stability
> and
> > > recovery, whenever this happens, some nodes are also absolutely sure
> to run
> > > OOM. The leaders usually live longest, the replica's don't, their heap
> > > usage peaks every time, consistently.
> > > >
> > > > Thanks,
> > > > Markus
> > > >
> > > > -Original message-
> > > > > From:Shawn Heisey 
> > > > > Sent: Saturday 27th January 2018 0:49
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Re: 7.2.1 cluster dies within minutes after restart
> > > > >
> > > > > On 1/26/2018 10:02 AM, Markus Jelsma wrote:
> > > > > > o.a.z.ClientCnxn Client session timed out, have not heard from
> > > server in 22130ms (although zkClientTimeOut is 3).
> > > > >
> > > > > Are you absolutely certain that there is a setting for
> zkClientTimeout
> > > > > that is actually getting applied?  The default value in Solr's
> example
> > > > > configs is 30 seconds, but the internal default in the code (when
> no
> > > > > configuration is found) is still 15.  I have confirmed this in the
> > > code.
> > > > >
> > > > > Looks like SolrCloud doesn't log the values it's using for things
> like
> > > > > zkClientTimeout.  I think it should.
> > > > >
> > > > > https://issues.apache.org/jira/browse/SOLR-11915
> > > > >
> > > > > Thanks,
> > > > > Shawn
> > > > >
> > > > >
> > > >
> > >
> >
>


Re: NullPointerException in PeerSync.handleUpdates

2017-11-22 Thread Michael Braun
I went ahead and resolved the jira - it was never seen again by us in later
versions of Solr. There are a number of bug fixes since the 6.2 release, so
I personally recommend updating!

On Wed, Nov 22, 2017 at 11:48 AM, Pushkar Raste 
wrote:

> As mentioned in the JIRA, exception seems to be coming from a log
> statement. The issue was fixed in 6.3, here is relevant line f rom 6.3
> https://github.com/apache/lucene-solr/blob/releases/
> lucene-solr/6.3.0/solr/core/src/java/org/apache/solr/
> update/PeerSync.java#L707
>
>
>
> On Wed, Nov 22, 2017 at 1:18 AM, Erick Erickson 
> wrote:
>
> > Right, if there's no "fixed version" mentioned and if the resolution
> > is "unresolved", it's not in the code base at all. But that JIRA is
> > not apparently reproducible, especially on more recent versions that
> > 6.2. Is it possible to test a more recent version (6.6.2 would be my
> > recommendation).
> >
> > Erick
> >
> > On Tue, Nov 21, 2017 at 9:58 PM, S G  wrote:
> > > My bad. I found it at https://issues.apache.org/jira/browse/SOLR-9453
> > > But I could not find it in changes.txt perhaps because its yet not
> > resolved.
> > >
> > > On Tue, Nov 21, 2017 at 9:15 AM, Erick Erickson <
> erickerick...@gmail.com
> > >
> > > wrote:
> > >
> > >> Did you check the JIRA list? Or CHANGES.txt in more recent versions?
> > >>
> > >> On Tue, Nov 21, 2017 at 1:13 AM, S G 
> wrote:
> > >> > Hi,
> > >> >
> > >> > We are running 6.2 version of Solr and hitting this error
> frequently.
> > >> >
> > >> > Error while trying to recover. core=my_core:java.lang.
> > >> NullPointerException
> > >> > at org.apache.solr.update.PeerSync.handleUpdates(
> > >> PeerSync.java:605)
> > >> > at org.apache.solr.update.PeerSync.handleResponse(
> > >> PeerSync.java:344)
> > >> > at org.apache.solr.update.PeerSync.sync(PeerSync.java:257)
> > >> > at org.apache.solr.cloud.RecoveryStrategy.doRecovery(
> > >> RecoveryStrategy.java:376)
> > >> > at org.apache.solr.cloud.RecoveryStrategy.run(
> > >> RecoveryStrategy.java:221)
> > >> > at java.util.concurrent.Executors$RunnableAdapter.
> > >> call(Executors.java:511)
> > >> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > >> > at org.apache.solr.common.util.ExecutorUtil$
> > >> MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> > >> > at java.util.concurrent.ThreadPoolExecutor.runWorker(
> > >> ThreadPoolExecutor.java:1142)
> > >> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > >> ThreadPoolExecutor.java:617)
> > >> > at java.lang.Thread.run(Thread.java:745)
> > >> >
> > >> >
> > >> >
> > >> > Is this a known issue and fixed in some newer version?
> > >> >
> > >> >
> > >> > Thanks
> > >> > SG
> > >>
> >
>


Re: Solr 6.6.0 - High CPU during indexing

2017-08-18 Thread Michael Braun
Have you attached JVisualVM or a similar application to the process to
sample where the time is being spent? It can be very helpful for debugging
this sort of problem.

On Fri, Aug 18, 2017 at 12:37 PM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Indexing about 15 million documents per day across 100 shards on 45
> servers.  Up until about 350 million documents, each of the solr instances
> was taking up about 1 core (100% CPU).  Recently, they all jumped to 700%.
> Is this normal?  Anything that I can check for?
>
> I don't see anything unusual in the solr logs.  Sample from the GC logs:
>
> ---
>
> 2017-08-18 11:53:15 GC log file created /opt/solr6/server/logs/solr_gc
> .log.2
> OpenJDK 64-Bit Server VM (25.141-b16) for linux-amd64 JRE (1.8.0_141-b16),
> built on Jul 20 2017 11:14:57 by "mockbuild" with gcc 4.4.7 20120313 (Red
> Hat 4.4.7-18)
> Memory: 4k page, physical 99016188k(796940k free), swap
> 33554428k(32614048k free)
> CommandLine flags: -XX:+AggressiveOpts -XX:CICompilerCount=12
> -XX:ConcGCThreads=4 -XX:G1HeapRegionSize=16777216
> -XX:GCLogFileSize=20971520 -XX:InitialHeapSize=17179869184
> -XX:InitiatingHeapOccupancyPercent=75 -XX:MarkStackSize=4194304
> -XX:MaxDirectMemorySize=3221225472 -XX:MaxGCPauseMillis=300
> -XX:MaxHeapSize=30064771072 -XX:MaxNewSize=18035507200 <(803)%20550-7200>
> -XX:MinHeapDeltaBytes=16777216 -XX:NumberOfGCLogFiles=9
> -XX:OnOutOfMemoryError=/opt/solr6/bin/oom_solr.sh 9100
> /opt/solr6/server/logs -XX:ParallelGCThreads=16 -XX:+ParallelRefProcEnabled
> -XX:+PerfDisableSharedMem -XX:+PrintGC -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:-ResizePLAB
> -XX:ThreadStackSize=256 -XX:+UseCompressedClassPointers
> -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UseGCLogFileRotation
> -XX:+UseLargePages
> {Heap before GC invocations=559440 (full 0):
>  garbage-first heap   total 29360128K, used 24944705K [0xc000,
> 0xc1003800, 0x0007c000)
>   region size 16384K, 1075 young (17612800K), 13 survivors (212992K)
>  Metaspace   used 95460K, capacity 97248K, committed 97744K, reserved
> 1134592K
>   class spaceused 11616K, capacity 12104K, committed 12240K, reserved
> 1048576K
> 2017-08-18T11:53:15.985-0400: 522594.835: [GC pause (G1 Evacuation Pause)
> (young)
> Desired survivor size 1132462080 bytes, new threshold 15 (max 15)
> - age   1:   23419920 bytes,   23419920 total
> - age   2:9355296 bytes,   32775216 total
> - age   3:2455384 bytes,   35230600 total
> - age   4:   38246704 bytes,   73477304 total
> - age   5:   47064408 bytes,  120541712 total
> - age   6:   13228864 bytes,  133770576 total
> - age   7:   23990800 bytes,  157761376 total
> - age   8:1031416 bytes,  158792792 total
> - age   9:   17011128 bytes,  175803920 total
> - age  10:7371888 bytes,  183175808 total
> - age  11:6226576 bytes,  189402384 total
> - age  12: 637184 bytes,  190039568 total
> - age  13:   11577864 bytes,  201617432 total
> - age  14:9519224 bytes,  211136656 total
> - age  15: 672304 bytes,  211808960 total
> , 0.0391210 secs]
>[Parallel Time: 32.1 ms, GC Workers: 16]
>   [GC Worker Start (ms): Min: 522594835.0, Avg: 522594835.1, Max:
> 522594835.2, Diff: 0.2]
>   [Ext Root Scanning (ms): Min: 0.5, Avg: 0.8, Max: 2.2, Diff: 1.7,
> Sum: 12.2]
>   [Update RS (ms): Min: 0.9, Avg: 2.3, Max: 3.2, Diff: 2.2, Sum: 36.6]
>  [Processed Buffers: Min: 3, Avg: 4.7, Max: 8, Diff: 5, Sum: 75]
>   [Scan RS (ms): Min: 0.1, Avg: 0.2, Max: 0.4, Diff: 0.3, Sum: 3.0]
>   [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> Sum: 0.1]
>   [Object Copy (ms): Min: 27.7, Avg: 28.3, Max: 28.6, Diff: 0.8, Sum:
> 453.5]
>   [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
>  [Termination Attempts: Min: 1, Avg: 1.3, Max: 2, Diff: 1, Sum: 21]
>   [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.3, Sum:
> 2.4]
>   [GC Worker Total (ms): Min: 31.6, Avg: 31.7, Max: 32.0, Diff: 0.4,
> Sum: 507.9]
>   [GC Worker End (ms): Min: 522594866.7, Avg: 522594866.8, Max:
> 522594867.0, Diff: 0.2]
>[Code Root Fixup: 0.1 ms]
>[Code Root Purge: 0.0 ms]
>[Clear CT: 1.7 ms]
>[Other: 5.2 ms]
>   [Choose CSet: 0.0 ms]
>   [Ref Proc: 2.9 ms]
>   [Ref Enq: 0.1 ms]
>   [Redirty Cards: 0.2 ms]
>   [Humongous Register: 0.1 ms]
>   [Humongous Reclaim: 0.0 ms]
>   [Free CSet: 1.6 ms]
>[Eden: 16.6G(16.6G)->0.0B(16.6G) Survivors: 208.0M->208.0M Heap:
> 23.8G(28.0G)->7371.4M(28.0G)]
> Heap after GC invocations=559441 (full 0):
>  garbage-first heap   total 29360128K, used 7548353K [0xc000,
> 0xc1003800, 0x0007c000)
>   region size 16384K, 13 young (212992K), 13 survivors (212992K)
>  Metaspace   used 95460K, capacity 97248K, committed 97744K, reserved
> 1134592K

Re: Highlighting Performance improvement suggestions required - Solr 6.5.1

2017-08-09 Thread Michael Braun
Have you attached JVisualVM or a similar tool for sampling when Solr is
answering the requests with highlight? What relevant methods are coming up?

On Wed, Aug 9, 2017 at 11:26 AM, sasarun  wrote:

> Hi Amrit,
>
> Thanks for the response. I did went through both and that is how I landed
> up
> with unified method for highlighter
>
> Thanks,
> Arun
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Highlighting-Performance-improvement-
> suggestions-required-Solr-6-5-1-tp4349767p4349781.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>