Re: Problem about logging in openshift origin

2017-09-18 Thread Peter Portante
On Mon, Sep 18, 2017 at 2:33 AM, Yu Wei <yu20...@hotmail.com> wrote:

> Hi Peter,
>
> The storage is EmptyDir for es pods.
>

​How much storage do you have available for each ES pod to use?  ES can
fill TBs of storage if the amount of logging is high enough.

How big are your ES indices?

-peter​

> What's the meaning of aos-int-services? I only enabled logging feature
> during ansible installation.
>
>
> Thanks,
>
> Jared, (韦煜)
> Software developer
> Interested in open source software, big data, Linux
> ------
> *From:* Peter Portante <pport...@redhat.com>
> *Sent:* Friday, September 15, 2017 7:20:18 PM
> *To:* Yu Wei
> *Cc:* users@lists.openshift.redhat.com; d...@lists.openshift.redhat.com;
> aos-int-services
> *Subject:* Re: Problem about logging in openshift origin
>
>
>
> On Fri, Sep 15, 2017 at 6:10 AM, Yu Wei <yu20...@hotmail.com> wrote:
>
>> Hi,
>>
>> I setup OpenShift origin 3.6 cluster successfully and enabled metrics and
>> logging.
>>
>> Metrics worked well and logging didn't worked.
>>
>> Pod * logging-es-data-master-lf6al5rb-5-deploy* in logging frequently
>> crashed with below logs,
>>
>> *--> Scaling logging-es-data-master-lf6al5rb-5 to 1 *
>> *--> Waiting up to 10m0s for pods in rc logging-es-data-master-lf6al5rb-5
>> to become ready *
>> *error: update acceptor rejected logging-es-data-master-lf6al5rb-5: pods
>> for rc "logging-es-data-master-lf6al5rb-5" took longer than 600 seconds to
>> become ready*
>>
>> I didn't find other information. How could I debug such problem?
>>
> ​Hi Yu,​
>
> Added aos-int-services ...
>
> ​How many indices do you have in the Elasticsearch instance?
>
> What is the storage configuration for the Elasticsearch pods?
>
> ​Regards, -peter
>
>
>
>>
>> Thanks,
>>
>> Jared, (韦煜)
>> Software developer
>> Interested in open source software, big data, Linux
>>
>> ___
>> users mailing list
>> users@lists.openshift.redhat.com
>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>
>>
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Problem about logging in openshift origin

2017-09-15 Thread Peter Portante
On Fri, Sep 15, 2017 at 6:10 AM, Yu Wei  wrote:

> Hi,
>
> I setup OpenShift origin 3.6 cluster successfully and enabled metrics and
> logging.
>
> Metrics worked well and logging didn't worked.
>
> Pod *logging-es-data-master-lf6al5rb-5-deploy* in logging frequently
> crashed with below logs,
>
> *--> Scaling logging-es-data-master-lf6al5rb-5 to 1 *
> *--> Waiting up to 10m0s for pods in rc logging-es-data-master-lf6al5rb-5
> to become ready *
> *error: update acceptor rejected logging-es-data-master-lf6al5rb-5: pods
> for rc "logging-es-data-master-lf6al5rb-5" took longer than 600 seconds to
> become ready*
>
> I didn't find other information. How could I debug such problem?
>
​Hi Yu,​

Added aos-int-services ...

​How many indices do you have in the Elasticsearch instance?

What is the storage configuration for the Elasticsearch pods?

​Regards, -peter



>
> Thanks,
>
> Jared, (韦煜)
> Software developer
> Interested in open source software, big data, Linux
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: [Logging] searchguard configuration issue? ["warning", "elasticsearch"], "pid":1, "message":"Unable to revive connection: https://logging-es:9200/"}

2017-07-12 Thread Peter Portante
On Wed, Jul 12, 2017 at 9:28 AM, Stéphane Klein <cont...@stephane-klein.info
> wrote:

>
> 2017-07-12 15:20 GMT+02:00 Peter Portante <pport...@redhat.com>:
>
>> This looks a lot like this BZ: https://bugzilla.redhat.co
>> m/show_bug.cgi?id=1449378, "Timeout after 30SECONDS while retrieving
>> configuration"
>>
>> What version of Origin are you using?
>>
>>
> Logging image : origin-logging-elasticsearch:v1.5.0
>
> $ oc version
> oc v1.4.1+3f9807a
> kubernetes v1.4.0+776c994
> features: Basic-Auth
>
> Server https://console.tech-angels.net:443
> openshift v1.5.0+031cbe4
> kubernetes v1.5.2+43a9be4
>
> and with 1.4 nodes because of this crazy bug https://github.com/openshift/
> origin/issues/14092)
>
>
>> I found that I had to run the sgadmin script in each ES pod at the same
>> time, and when one succeeds and one fails, just run it again and it worked.
>>
>>
> Ok, I'll try that, how can I execute sgadmin script manually ?
>

​You can see it in the run.sh script in each pod, look for the invocation
of sgadmin there.

-peter​



>
> Best regards,
> Stéphane
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: [Logging] searchguard configuration issue? ["warning", "elasticsearch"], "pid":1, "message":"Unable to revive connection: https://logging-es:9200/"}

2017-07-12 Thread Peter Portante
This looks a lot like this BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1449378, "Timeout after
30SECONDS while retrieving configuration"

What version of Origin are you using?

I found that I had to run the sgadmin script in each ES pod at the same
time, and when one succeeds and one fails, just run it again and it worked.

It seems to have to do with sgadmin script trying to be sure that all nodes
can see the searchguard index, but since we create one per node, if another
node does not have searchguard successfully setup, the current node's setup
will fail.  Retry at the same time until they work seems to be the fix. :(

-peter

On Wed, Jul 12, 2017 at 9:03 AM, Stéphane Klein  wrote:

> Hi,
>
> Since one day, after ES cluster pods restart, I have this error message
> when I launch logging-es:
>
> $ oc logs -f logging-es-ne81bsny-5-jdcdk
> Comparing the specificed RAM to the maximum recommended for
> ElasticSearch...
> Inspecting the maximum RAM available...
> ES_JAVA_OPTS: '-Dmapper.allow_dots_in_name=true -Xms128M -Xmx4096m'
> Checking if Elasticsearch is ready on https://localhost:9200
> ..Will connect to localhost:9300 ...
> done
> Contacting elasticsearch cluster 'elasticsearch' and wait for YELLOW
> clusterstate ...
> Clustername: logging-es
> Clusterstate: YELLOW
> Number of nodes: 2
> Number of data nodes: 2
> .searchguard.logging-es-ne81bsny-5-jdcdk index does not exists, attempt
> to create it ... done (with 1 replicas, auto expand replicas is off)
> Populate config from /opt/app-root/src/sgconfig/
> Will update 'config' with /opt/app-root/src/sgconfig/sg_config.yml
>SUCC: Configuration for 'config' created or updated
> Will update 'roles' with /opt/app-root/src/sgconfig/sg_roles.yml
>SUCC: Configuration for 'roles' created or updated
> Will update 'rolesmapping' with /opt/app-root/src/sgconfig/sg_
> roles_mapping.yml
>SUCC: Configuration for 'rolesmapping' created or updated
> Will update 'internalusers' with /opt/app-root/src/sgconfig/sg_
> internal_users.yml
>SUCC: Configuration for 'internalusers' created or updated
> Will update 'actiongroups' with /opt/app-root/src/sgconfig/sg_
> action_groups.yml
>SUCC: Configuration for 'actiongroups' created or updated
> Timeout (java.util.concurrent.TimeoutException: Timeout after 30SECONDS
> while retrieving configuration for [config, roles, rolesmapping,
> internalusers, actiongroups](index=.searchguard.logging-es-
> x39myqbs-1-s5g7c))
> Done with failures
>
> after some time, my ES cluster (2 nodes) is green:
>
> stephane$ oc rsh logging-es-x39myqbs-1-s5g7c bash
> st:9200/_cluster/health?pretty=trueasticsearch/secret/admin-cert
> https://localho
> {
>   "cluster_name" : "logging-es",
>   "status" : "green",
>   "timed_out" : false,
>   "number_of_nodes" : 2,
>   "number_of_data_nodes" : 2,
>   "active_primary_shards" : 1643,
>   "active_shards" : 3286,
>   "relocating_shards" : 0,
>   "initializing_shards" : 0,
>   "unassigned_shards" : 0,
>   "delayed_unassigned_shards" : 0,
>   "number_of_pending_tasks" : 0,
>   "number_of_in_flight_fetch" : 0,
>   "task_max_waiting_in_queue_millis" : 0,
>   "active_shards_percent_as_number" : 100.0
> }
>
> I have this error in kibana container:
>
> $ oc logs -f -c kibana logging-kibana-1-jblhl
> {"type":"log","@timestamp":"2017-07-12T12:54:54Z","tags":[
> "warning","elasticsearch"],"pid":1,"message":"No living connections"}
> {"type":"log","@timestamp":"2017-07-12T12:54:57Z","tags":[
> "warning","elasticsearch"],"pid":1,"message":"Unable to revive
> connection: https://logging-es:9200/"}
>
> But in Kibana container I can access to elasticsearch server:
>
> $ oc rsh -c kibana logging-kibana-1-jblhl bash
> $ curl https://logging-es:9200/ --cacert /etc/kibana/keys/ca --key
> /etc/kibana/keys/key --cert /etc/kibana/keys/cert
> {
>   "name" : "Adri Nital",
>   "cluster_name" : "logging-es",
>   "cluster_uuid" : "iRo3wOHWSq2bTZskrIs6Zg",
>   "version" : {
> "number" : "2.4.4",
> "build_hash" : "fcbb46dfd45562a9cf00c604b30849a6dec6b017",
> "build_timestamp" : "2017-01-03T11:33:16Z",
> "build_snapshot" : false,
> "lucene_version" : "5.5.2"
>   },
>   "tagline" : "You Know, for Search"
> }
>
> How can I fix this error?
>
> Best regards,
> Stéphane
> --
> Stéphane Klein 
> blog: http://stephane-klein.info
> cv : http://cv.stephane-klein.info
> Twitter: http://twitter.com/klein_stephane
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: [Logging] What component forward log entries to fluentd input service?

2017-07-11 Thread Peter Portante
On Tue, Jul 11, 2017 at 9:00 AM, Alex Wauck  wrote:
> Last I checked (OpenShift Origin 1.2), fluentd was just slurping up the log
> files produced by Docker.  It can do that because the pods it runs in have
> access to the host filesystem.
>
> On Tue, Jul 11, 2017 at 6:12 AM, Stéphane Klein
>  wrote:
>>
>> Hi,
>>
>> I see here
>> https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/configs.d/input-post-forward-mux.conf#L2
>> that fluentd logging system use secure_forward input system.
>>
>> My question: what component forward log entries to fluentd input service ?

The "mux" service is a concentrator of sorts.

Without the mux service, each fluentd pod runs on a host in an
OpenShift cluster collecting logs and sending them to Elasticsearch
directly.  The collectors also have the responsibility of enhancing
the logs collected with the metadata that describes which
pod/container they came from.  This requires connections to the API
server to get that information.

So in a large cluster, 200+ nodes, maybe less, maybe more, the API
servers are overwhelmed by requests from all the fluentd pods.

With the mux service, all the fluentd collections pods only talk to
the mux service and DO NOT talk to the API server; they simply send
the logs they collect to the mux fluentd instance.

The mux fluentd instance in turns talks to the API service to enrich
the logs with the pod/container metadata and then send along to
Elasticsearch.

This scales much better.

-peter


>>
>> Best regards,
>> Stéphane
>> --
>> Stéphane Klein 
>> blog: http://stephane-klein.info
>> cv : http://cv.stephane-klein.info
>> Twitter: http://twitter.com/klein_stephane
>>
>> ___
>> users mailing list
>> users@lists.openshift.redhat.com
>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>
>
>
>
> --
>
> Alex Wauck // Senior DevOps Engineer
>
> E X O S I T E
> www.exosite.com
>
> Making Machines More Human.
>
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Origin-Aggregated-Logging OPS generate 10Go ES data by day, 40000 hits by hours

2017-07-07 Thread Peter Portante
On Fri, Jul 7, 2017 at 9:52 AM, Stéphane Klein
<cont...@stephane-klein.info> wrote:
>
> 2017-07-07 15:51 GMT+02:00 Stéphane Klein <cont...@stephane-klein.info>:
>>
>> 2017-07-07 14:26 GMT+02:00 Peter Portante <pport...@redhat.com>:
>>>
>>> >
>>> > 4 hits by hours!
>>>
>>> How are you determining 40,000 hits per hour?
>>>
>>
>> I did a search in Kibana, last hour => 40,000 hits
>
>
> for one node.

Can you share the query you put into Kibana?  And share what version
of origin you are using?  Perhaps this is 1.4 or 1.5?

Finally, can you use the "Discovery" tab in Kibana to view the entire
JSON document for one of the log entries, so I can see the other
metadata?

Thanks, -peter

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Origin-Aggregated-Logging OPS generate 10Go ES data by day, 40000 hits by hours

2017-07-07 Thread Peter Portante
On Fri, Jul 7, 2017 at 5:15 AM, Stéphane Klein
 wrote:
> Hi,
>
> Origin-Aggregated-Logging
> (https://github.com/openshift/origin-aggregated-logging) is installed on my
> cluster and I have enabled "OPS" option.
>
> Then, I have two ElasticSearch clusters:
>
> * ES
> * ES-OPS
>
> My issue: OPS logging generate 10Go ES data by day!
>
> origin-node log level is set at 0 (errors and warnings only).
>
> This is some logging record:
>
> /usr/bin/dockerd-current --add-runtime
> docker-runc=/usr/libexec/docker/docker-runc-current
> --default-runtime=docker-runc --exec-opt native.cgroupdriver=systemd
> --userland-proxy-path=/usr/libexec/docker/docker-proxy-current
> --selinux-enabled --insecure-registry=172.30.0.0/16 --log-driver=journald
> --storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt
> dm.thinpooldev=/dev/mapper/cah-docker--pool --storage-opt
> dm.use_deferred_removal=true --storage-opt dm.use_deferred_deletion=true
>
> /usr/lib/systemd/systemd --switched-root --system --deserialize 19
>
> /usr/bin/docker-current run --name origin-node --rm --privileged --net=host
> --pid=host --env-file=/etc/sysconfig/origin-node -v /:/rootfs:ro,rslave -e
> CONFIG_FILE=/etc/origin/node/node-config.yaml -e OPTIONS=--loglevel=0 -e
> HOST=/rootfs -e HOST_ETC=/host-etc -v /var/lib/origin:/var/lib/origin:rslave
> -v /etc/origin/node:/etc/origin/node -v /etc/localtime:/etc/localtime:ro -v
> /etc/machine-id:/etc/machine-id:ro -v /run:/run -v /sys:/sys:rw -v
> /sys/fs/cgroup:/sys/fs/cgroup:rw -v /usr/bin/docker:/usr/bin/docker:ro -v
> /var/lib/docker:/var/lib/docker -v /lib/modules:/lib/modules -v
> /etc/origin/openvswitch:/etc/openvswitch -v
> /etc/origin/sdn:/etc/openshift-sdn -v /var/lib/cni:/var/lib/cni -v
> /etc/systemd/system:/host-etc/systemd/system -v /var/log:/var/log -v
> /dev:/dev --volume=/usr/bin/docker-current:/usr/bin/docker-current:ro
> --volume=/etc/sysconfig/docker:/etc/sysconfig/docker:ro
> openshift/node:v1.4.1
>
> ...
>
> 4 hits by hours!

How are you determining 40,000 hits per hour?

What query are you doing to determine the above log entries?

Thanks, -peter

>
> I don't understand why I have all this log record, it is usual?
>
> How can I fix it?
>
> Best regards,
> Stéphane
> --
> Stéphane Klein 
> blog: http://stephane-klein.info
> cv : http://cv.stephane-klein.info
> Twitter: http://twitter.com/klein_stephane
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: split long log records

2017-06-19 Thread Peter Portante
Hi Andre,

This is a hard-coded Docker size.  For background see:

 * https://bugzilla.redhat.com/show_bug.cgi?id=1422008, "[RFE] Fluentd
handling of long log lines (> 16KB) split by Docker and indexed into
several ES documents"
   * And the reason for the original 16 KB limit:
https://bugzilla.redhat.com/show_bug.cgi?id=1335951, "heavy logging
leads to Docker daemon OOM-ing"

The processor that reads the json-file documents for sending to
graylog needs to be endowed with the smarts to handle reconstruction
of those log lines, most likley, obviously with some other upper bound
(as a container is not required to emit newlines in stdout or stderr.

Regards,

-peter

On Mon, Jun 19, 2017 at 11:43 AM, Andre Esser
<andre.es...@voidbridge.com> wrote:
> We use Graylog for log visualisation. However that's not the culprit it
> turns out. Log entries in the pod's log file are already split into chunks
> of 16KB like this:
>
> {"log":"The quick brown[...]jumps ov","stream":"stdout",\
> "time":"2017-06-19T15:27:33.130524954Z"}
>
> {"log":"er the lazy dog.\n","stream":"stdout",\
> "time":"2017-06-19T15:27:33.130636562Z"}
>
> So, to cut a long story short, is there any way to increase the size limit
> before a log record gets split into two JSON records?
>
>
>
>
> On 2017-06-19 16:21, Peter Portante wrote:
>>
>> Who setup Graylog for openshift?
>>
>> -peter
>>
>> On Mon, Jun 19, 2017 at 11:18 AM, Andre Esser
>> <andre.es...@voidbridge.com> wrote:
>>>
>>> I meant the limit in Graylog. Although I just noticed that it is actually
>>> 16384 (16KB). The line split after 2048 characters only applies on the
>>> web
>>> UI.
>>>
>>> Is this a Graylog limitation and can it be extended?
>>>
>>>
>>> On 2017-06-19 14:21, Jessica Forrester wrote:
>>>>
>>>>
>>>> Are you asking about logs in the web console, the `oc logs` command, or
>>>> in
>>>> Kibana?
>>>>
>>>> On Mon, Jun 19, 2017 at 8:29 AM, Andre Esser <andre.es...@voidbridge.com
>>>> <mailto:andre.es...@voidbridge.com>> wrote:
>>>>
>>>>  Hi,
>>>>
>>>>  In Origin 1.4.1 all log records longer than 2048 characters are
>>>>  split over two lines (longer than 4096 characters over three lines
>>>>  and so on).
>>>>
>>>>  Is there any way to increase this limit?
>>>>
>>>>
>>>>  Thanks,
>>>>
>>>>  Andre
>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> users@lists.openshift.redhat.com
>>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: split long log records

2017-06-19 Thread Peter Portante
Who setup Graylog for openshift?

-peter

On Mon, Jun 19, 2017 at 11:18 AM, Andre Esser
 wrote:
> I meant the limit in Graylog. Although I just noticed that it is actually
> 16384 (16KB). The line split after 2048 characters only applies on the web
> UI.
>
> Is this a Graylog limitation and can it be extended?
>
>
> On 2017-06-19 14:21, Jessica Forrester wrote:
>>
>> Are you asking about logs in the web console, the `oc logs` command, or in
>> Kibana?
>>
>> On Mon, Jun 19, 2017 at 8:29 AM, Andre Esser > > wrote:
>>
>> Hi,
>>
>> In Origin 1.4.1 all log records longer than 2048 characters are
>> split over two lines (longer than 4096 characters over three lines
>> and so on).
>>
>> Is there any way to increase this limit?
>>
>>
>> Thanks,
>>
>> Andre
>
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: logging-es errors: shards failed

2016-07-15 Thread Peter Portante
Eric, Luke,

Do the logs from the ES instance itself flow into that ES instance?

-peter

On Fri, Jul 15, 2016 at 12:14 PM, Alex Wauck <alexwa...@exosite.com> wrote:
> I'm not sure that I can.  I clicked the "Archive" link for the logging-es
> pod and then changed the query in Kibana to "kubernetes_container_name:
> logging-es-cycd8veb && kubernetes_namespace_name: logging".  I got no
> results, instead getting this error:
>
> Index: unrelated-project.92c37428-11f6-11e6-9c83-020b5091df01.2016.07.12
> Shard: 2 Reason: EsRejectedExecutionException[rejected execution (queue
> capacity 1000) on
> org.elasticsearch.search.action.SearchServiceTransportAction$23@6b1f2699]
> Index: unrelated-project.92c37428-11f6-11e6-9c83-020b5091df01.2016.07.14
> Shard: 2 Reason: EsRejectedExecutionException[rejected execution (queue
> capacity 1000) on
> org.elasticsearch.search.action.SearchServiceTransportAction$23@66b9a5fb]
> Index: unrelated-project.92c37428-11f6-11e6-9c83-020b5091df01.2016.07.15
> Shard: 2 Reason: EsRejectedExecutionException[rejected execution (queue
> capacity 1000) on
> org.elasticsearch.search.action.SearchServiceTransportAction$23@512820e]
> Index: unrelated-project.f38ac6ff-3e42-11e6-ab71-020b5091df01.2016.06.29
> Shard: 2 Reason: EsRejectedExecutionException[rejected execution (queue
> capacity 1000) on
> org.elasticsearch.search.action.SearchServiceTransportAction$23@3dce96b9]
> Index: unrelated-project.f38ac6ff-3e42-11e6-ab71-020b5091df01.2016.06.30
> Shard: 2 Reason: EsRejectedExecutionException[rejected execution (queue
> capacity 1000) on
> org.elasticsearch.search.action.SearchServiceTransportAction$23@2f774477]
>
> When I initially clicked the "Archive" link, I saw a lot of messages with
> the kubernetes_container_name "logging-fluentd", which is not what I
> expected to see.
>
>
> On Fri, Jul 15, 2016 at 10:44 AM, Peter Portante <pport...@redhat.com>
> wrote:
>>
>> Can you go back further in the logs to the point where the errors started?
>>
>> I am thinking about possible Java HEAP issues, or possibly ES
>> restarting for some reason.
>>
>> -peter
>>
>> On Fri, Jul 15, 2016 at 11:37 AM, Lukáš Vlček <lvl...@redhat.com> wrote:
>> > Also looking at this.
>> > Alex, is it possible to investigate if you were having some kind of
>> > network connection issues in the ES cluster (I mean between individual
>> > cluster nodes)?
>> >
>> > Regards,
>> > Lukáš
>> >
>> >
>> >
>> >
>> >> On 15 Jul 2016, at 17:08, Peter Portante <pport...@redhat.com> wrote:
>> >>
>> >> Just catching up on the thread, will get back to you all in a few ...
>> >>
>> >> On Fri, Jul 15, 2016 at 10:08 AM, Eric Wolinetz <ewoli...@redhat.com>
>> >> wrote:
>> >>> Adding Lukas and Peter
>> >>>
>> >>> On Fri, Jul 15, 2016 at 8:07 AM, Luke Meyer <lme...@redhat.com> wrote:
>> >>>>
>> >>>> I believe the "queue capacity" there is the number of parallel
>> >>>> searches
>> >>>> that can be queued while the existing search workers operate. It
>> >>>> sounds like
>> >>>> it has plenty of capacity there and it has a different reason for
>> >>>> rejecting
>> >>>> the query. I would guess the data requested is missing given it
>> >>>> couldn't
>> >>>> fetch shards it expected to.
>> >>>>
>> >>>> The number of shards is a multiple (for redundancy) of the number of
>> >>>> indices, and there is an index created per project per day. So even
>> >>>> for a
>> >>>> small cluster this doesn't sound out of line.
>> >>>>
>> >>>> Can you give a little more information about your logging deployment?
>> >>>> Have
>> >>>> you deployed multiple ES nodes for redundancy, and what are you using
>> >>>> for
>> >>>> storage? Could you attach full ES logs? How many OpenShift nodes and
>> >>>> projects do you have? Any history of events that might have resulted
>> >>>> in lost
>> >>>> data?
>> >>>>
>> >>>> On Thu, Jul 14, 2016 at 4:06 PM, Alex Wauck <alexwa...@exosite.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> When doing searches in Kibana, I get error messages similar to
>> 

Re: logging-es errors: shards failed

2016-07-15 Thread Peter Portante
Can you go back further in the logs to the point where the errors started?

I am thinking about possible Java HEAP issues, or possibly ES
restarting for some reason.

-peter

On Fri, Jul 15, 2016 at 11:37 AM, Lukáš Vlček <lvl...@redhat.com> wrote:
> Also looking at this.
> Alex, is it possible to investigate if you were having some kind of network 
> connection issues in the ES cluster (I mean between individual cluster nodes)?
>
> Regards,
> Lukáš
>
>
>
>
>> On 15 Jul 2016, at 17:08, Peter Portante <pport...@redhat.com> wrote:
>>
>> Just catching up on the thread, will get back to you all in a few ...
>>
>> On Fri, Jul 15, 2016 at 10:08 AM, Eric Wolinetz <ewoli...@redhat.com> wrote:
>>> Adding Lukas and Peter
>>>
>>> On Fri, Jul 15, 2016 at 8:07 AM, Luke Meyer <lme...@redhat.com> wrote:
>>>>
>>>> I believe the "queue capacity" there is the number of parallel searches
>>>> that can be queued while the existing search workers operate. It sounds 
>>>> like
>>>> it has plenty of capacity there and it has a different reason for rejecting
>>>> the query. I would guess the data requested is missing given it couldn't
>>>> fetch shards it expected to.
>>>>
>>>> The number of shards is a multiple (for redundancy) of the number of
>>>> indices, and there is an index created per project per day. So even for a
>>>> small cluster this doesn't sound out of line.
>>>>
>>>> Can you give a little more information about your logging deployment? Have
>>>> you deployed multiple ES nodes for redundancy, and what are you using for
>>>> storage? Could you attach full ES logs? How many OpenShift nodes and
>>>> projects do you have? Any history of events that might have resulted in 
>>>> lost
>>>> data?
>>>>
>>>> On Thu, Jul 14, 2016 at 4:06 PM, Alex Wauck <alexwa...@exosite.com> wrote:
>>>>>
>>>>> When doing searches in Kibana, I get error messages similar to "Courier
>>>>> Fetch: 919 of 2020 shards failed".  Deeper inspection reveals errors like
>>>>> this: "EsRejectedExecutionException[rejected execution (queue capacity 
>>>>> 1000)
>>>>> on
>>>>> org.elasticsearch.search.action.SearchServiceTransportAction$23@14522b8e]".
>>>>>
>>>>> A bit of investigation lead me to conclude that our Elasticsearch server
>>>>> was not sufficiently powerful, but I spun up a new one with four times the
>>>>> CPU and RAM of the original one, but the queue capacity is still only 
>>>>> 1000.
>>>>> Also, 2020 seems like a really ridiculous number of shards.  Any idea 
>>>>> what's
>>>>> going on here?
>>>>>
>>>>> --
>>>>>
>>>>> Alex Wauck // DevOps Engineer
>>>>>
>>>>> E X O S I T E
>>>>> www.exosite.com
>>>>>
>>>>> Making Machines More Human.
>>>>>
>>>>>
>>>>> ___
>>>>> users mailing list
>>>>> users@lists.openshift.redhat.com
>>>>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>>>>
>>>>
>>>>
>>>> ___
>>>> users mailing list
>>>> users@lists.openshift.redhat.com
>>>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users