[jira] [Commented] (MESOS-10143) Outstanding Offers accumulating
[ https://issues.apache.org/jira/browse/MESOS-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178186#comment-17178186 ] Puneet Kumar commented on MESOS-10143: -- We reduced average latency of Scheduler::resourceOffers from ~50 ms to ~15ms by eliminating non-essential workload from our java application that implements Scheduler callbacks. After that, we haven't observed this issue till date. Our application is handling ~400 offers per second at average latency of 15 ms for Scheduler::resourceOffers. [~bmahler] Thanks for describing how to obtain metrics for scheduler library. > Outstanding Offers accumulating > --- > > Key: MESOS-10143 > URL: https://issues.apache.org/jira/browse/MESOS-10143 > Project: Mesos > Issue Type: Bug > Components: master, scheduler driver >Affects Versions: 1.7.0 > Environment: Mesos Version 1.7.0 > JDK 8.0 >Reporter: Puneet Kumar >Priority: Minor > > We manage an Apache Mesos cluster version 1.7.0. We have written a framework > in Java that schedules tasks to Mesos master at a rate of 300 TPS. Everything > works fine for almost 24 hours but then outstanding offers accumulate & > saturate within 15 minutes. Outstanding offers aren't reclaimed by Mesos > master. We observe "RescindOffer" messages in verbose (GLOG v=3) framework > logs but outstanding offers don't reduce. New resources aren't offered to > framework when outstanding offers saturate. We have to restart the scheduler > to reset outstanding offers to zero. > Any suggestions to debug this issue are welcome. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10143) Outstanding Offers accumulating
[ https://issues.apache.org/jira/browse/MESOS-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166505#comment-17166505 ] Benjamin Mahler commented on MESOS-10143: - [~puneetku287] It looks like the scheduler native library is getting backlogged, this can happen when the scheduler cannot process messages as fast as they come in from the master. (In your example I see Scheduler::resourceOffers took 6.8 ms which is long). If you want to check this while it's happening the next time, you can hit {{http://IP:PORT/metrics/snapshot}} of the scheduler library, and it shouldn't return because the scheduler metrics are not able to get computed in a timely manner. You can also specify a timeout via {{http://IP:PORT/metrics/snapshot?timeout=10secs}} and should see a response without the {{scheduler/event_queue_messages}} metric present. You may want to fix the port of the scheduler library in order to do this, by setting LIBPROCESS_PORT=X in your environment before instantiating the library. > Outstanding Offers accumulating > --- > > Key: MESOS-10143 > URL: https://issues.apache.org/jira/browse/MESOS-10143 > Project: Mesos > Issue Type: Bug > Components: master, scheduler driver >Affects Versions: 1.7.0 > Environment: Mesos Version 1.7.0 > JDK 8.0 >Reporter: Puneet Kumar >Priority: Minor > > We manage an Apache Mesos cluster version 1.7.0. We have written a framework > in Java that schedules tasks to Mesos master at a rate of 300 TPS. Everything > works fine for almost 24 hours but then outstanding offers accumulate & > saturate within 15 minutes. Outstanding offers aren't reclaimed by Mesos > master. We observe "RescindOffer" messages in verbose (GLOG v=3) framework > logs but outstanding offers don't reduce. New resources aren't offered to > framework when outstanding offers saturate. We have to restart the scheduler > to reset outstanding offers to zero. > Any suggestions to debug this issue are welcome. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10143) Outstanding Offers accumulating
[ https://issues.apache.org/jira/browse/MESOS-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153796#comment-17153796 ] Puneet Kumar commented on MESOS-10143: -- [~greggomann] , In following example, 3 offers were sent by Mesos Master at time 05:59:07 but received by Scheduler at 06:16:26 +Logs from Mesos Master:+ I0708 05:59:07.097918 5776 master.cpp:9722] Sending offers [ 64bb9634-4038-448b-8323-a9877f51f524-O10488046, 64bb9634-4038-448b-8323-a9877f51f524-O10488047, 64bb9634-4038-448b-8323-a9877f51f524-O10488048 ] to framework black-falcon-scheduler-puneetku (Black Falcon) at scheduler-a1fe5552-c044-4ff0-800c-9dbae2c27c45@10.162.31.219:40391 +Logs from Mesos native library that has java bindings after setting GLOG_v=3 :+ I0708 06:16:26.478953 80704 sched.cpp:890] Received 3 offers I0708 06:16:26.478965 80704 pid.cpp:91] Attempting to parse 'slave(1)@10.89.65.85:5051' into a PID I0708 06:16:26.478971 80704 sched.cpp:900] Saving PID 'slave(1)@10.89.65.85:5051' I0708 06:16:26.478981 80704 pid.cpp:91] Attempting to parse 'slave(1)@10.89.103.172:5051' into a PID I0708 06:16:26.478986 80704 sched.cpp:900] Saving PID 'slave(1)@10.89.103.172:5051' I0708 06:16:26.478993 80704 pid.cpp:91] Attempting to parse 'slave(1)@10.88.84.46:5051' into a PID I0708 06:16:26.478999 80704 sched.cpp:900] Saving PID 'slave(1)@10.88.84.46:5051' I0708 06:16:26.485853 80704 sched.cpp:914] Scheduler::resourceOffers took 6.788912ms +Logs from Scheduler's resourceOffers method that is implemented in Java:+ 08 Jul 2020 06:16:26,479 ESC[32m[INFO]ESC[m (Thread-14611242) com.blackfalconservice.mesos.scheduler.MesosScheduler$$EnhancerByGuice$$c8e6e266: resourceOffers: Offer Received 64bb9634-4038-448b-8323-a9877f51f524-O10488046,64bb9634-4038-448b-8323-a9877f51f524-O10488047,64bb9634-4038-448b-8323-a9877f51f524-O10488048 All the offers that were sent between 05:59:07 and 06:16:26 were counted as outstanding_offers by Mesos master. This gap in time keeps on increasing from minutes to hours until I restart the Scheduler process. How can I find out where these outstanding offers are queued up for minutes before being offered to Scheduler and why is there a delay of minutes? > Outstanding Offers accumulating > --- > > Key: MESOS-10143 > URL: https://issues.apache.org/jira/browse/MESOS-10143 > Project: Mesos > Issue Type: Bug > Components: master, scheduler driver >Affects Versions: 1.7.0 > Environment: Mesos Version 1.7.0 > JDK 8.0 >Reporter: Puneet Kumar >Priority: Minor > > We manage an Apache Mesos cluster version 1.7.0. We have written a framework > in Java that schedules tasks to Mesos master at a rate of 300 TPS. Everything > works fine for almost 24 hours but then outstanding offers accumulate & > saturate within 15 minutes. Outstanding offers aren't reclaimed by Mesos > master. We observe "RescindOffer" messages in verbose (GLOG v=3) framework > logs but outstanding offers don't reduce. New resources aren't offered to > framework when outstanding offers saturate. We have to restart the scheduler > to reset outstanding offers to zero. > Any suggestions to debug this issue are welcome. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10143) Outstanding Offers accumulating
[ https://issues.apache.org/jira/browse/MESOS-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152811#comment-17152811 ] Greg Mann commented on MESOS-10143: --- [~puneetku287] it's unclear to me from the description if this is an issue in Mesos or in your scheduler. A more precise description of the framework's behavior during the incidents would help - what does the scheduler do with the offers during this time? Feel free to find us on Mesos Slack, that might be an easier place to have a synchronous discussion about your issue. > Outstanding Offers accumulating > --- > > Key: MESOS-10143 > URL: https://issues.apache.org/jira/browse/MESOS-10143 > Project: Mesos > Issue Type: Bug > Components: master, scheduler driver >Affects Versions: 1.7.0 > Environment: Mesos Version 1.7.0 > JDK 8.0 >Reporter: Puneet Kumar >Priority: Minor > > We manage an Apache Mesos cluster version 1.7.0. We have written a framework > in Java that schedules tasks to Mesos master at a rate of 300 TPS. Everything > works fine for almost 24 hours but then outstanding offers accumulate & > saturate within 15 minutes. Outstanding offers aren't reclaimed by Mesos > master. We observe "RescindOffer" messages in verbose (GLOG v=3) framework > logs but outstanding offers don't reduce. New resources aren't offered to > framework when outstanding offers saturate. We have to restart the scheduler > to reset outstanding offers to zero. > Any suggestions to debug this issue are welcome. -- This message was sent by Atlassian Jira (v8.3.4#803005)