Hi Till , Thank you for the reply , I have posted some logs with initial email chain . I think issue is more to do with docker private registry when there is authorization involved . I can run docker running Job manager and task manager as separate task for marathon and connect via RPC port . I was trying to run via mesos app master so that job manager itself launch the task manager part of framework .
Thank you again ~ Biswajit On Fri, Aug 4, 2017 at 3:17 AM, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Biswajit, > > are there any Mesos logs which might help us pinpointing the problem? I've > actually never run Flink on Mesos with Docker images. But it could be that > Flink does not set things properly up for running Docker images. I'll try > to run Flink based on Docker images over the weekend in order to see > whether I can reproduce the problem. > > Cheers, > Till > > On Wed, Aug 2, 2017 at 8:48 PM, Biswajit Das <biswajit...@gmail.com> > wrote: > >> Hi There, >> >> I have posted this here in the group a few days back and after that I >> have been exchanging email with Eron, thanks to Eron for all the tips. >> Now I see this basic auth error, I'm little confused how come Job Manager >> launched fine and task manager failing to auth. >> Also, mesos doc says by default authenticate is false so it should not >> have gone there, do I have to disable somewhere inside flink ??? I don't >> see any config or property in code. >> >> This is kind of blocker for me now for mesos deployment , really >> appreciate for any inputs/suggestion >> >> ~ Biswajit >> >> ---------- Forwarded message ---------- >> From: Eron Wright <ewri...@live.com> >> Date: Wed, Aug 2, 2017 at 10:51 AM >> ------------------------------ >> *From:* Biswajit Das <biswajit...@gmail.com> >> *Sent:* Wednesday, August 2, 2017 10:19:45 AM >> *To:* Eron Wright >> *Subject:* Re: Flink -mesos-app master hang >> >> Hi Eron , >> >> Good morning , I'm really sorry for flooding question . I'll post this >> one to user group also . >> I could narrow down the actual error thrown by mesos , seems like JM some >> how not able to authenticate . I'm little confused if it is *docker >> private registry tls error *or some thing else , I have started slave >> even with --docker_config , previously mostly I was using docker.tar.gz >> with container for private repo authentication . >> >> 017-08-02 03:32:54,163 WARN org.apache.flink.mesos.schedul >> er.TaskMonitor - Mesos task taskmanager-00003 failed >> unexpectedly. >> 2017-08-02 03:32:54,163 INFO org.apache.flink.mesos.runtime >> .clusterframework.MesosFlinkResourceManager * - Mesos task >> taskmanager-00003 failed, with a TaskManager in launch or registration. >> State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch >> container: Unexpected WWW-Authenticate header format: 'Basic >> realm="Registry Realm"')* >> 2017-08-02 03:32:54,163 INFO org.apache.flink.mesos.runtime >> .clusterframework.MesosFlinkResourceManager - Diagnostics for task >> taskmanager-00003 in state TASK_FAILED : >> reason=REASON_CONTAINER_LAUNCH_FAILED >> message=Failed to launch container: Unexpected WWW-Authenticate header >> format: 'Basic realm="Registry Realm"' >> 2017-08-02 03:32:54,163 INFO org.apache.flink.mesos.runtime >> .clusterframework.MesosFlinkResourceManager - Total number of failed >> tasks so far: 3 >> 2017-08-02 03:32:54,164 ERROR org.apache.flink.mesos.runtime >> .clusterframework.MesosFlinkResourceManager - Stopping Mesos session >> because the number of failed tasks (3) exceeded the maximum failed tasks >> (2). This number is controlled by the 'mesos.maximum-failed-tasks' >> configuration setting. By default its the number of requested tasks. >> 2017-08-02 03:32:54,164 INFO org.apache.flink.mesos.runtime >> .clusterframework.MesosFlinkResourceManager - Shutting down cluster >> with status FAILED : Stopping Mesos session because the number of failed >> tasks (3) exceeded the maximum failed tasks (2). This number is controlled >> by the 'mesos.maximum-failed-tasks' configuration setting. By default its >> the number of requested tasks. >> 2017-08-02 03:32:54,164 INFO org.apache.flink.mesos.runtime >> .clusterframework.MesosFlinkResourceManager - Shutting down and >> unregistering as a Mesos framework. >> 2017-08-02 03:32:54,171 INFO org.apache.flink.mesos.runtime >> .clusterframework.MesosFlinkResourceManager - Stopping Mesos resource >> master >> root@ip-172-31-4-44:/etc/me >> >> On Tue, Aug 1, 2017 at 1:53 PM, Eron Wright <ewri...@live.com> wrote: >> >>> I think you're on the right track, in trying to configure the docker >>> image provider. This is on Linux right, and you definitely restarted the >>> agents? >>> >>> >>> An important difference between the JM and the TM is that the JM is a >>> task launched by the Marathon framework, whereas the TM is a task launched >>> by the JM framework. The respective configurations and behaviors are >>> different. For example, I see that Marathon is launching the JM with the >>> Docker containerizer, whereas the JS is launching the TM with the Mesos >>> containerizer (with Docker image provider support). The Mesos >>> containerizer is more modern and preferred, and I don't think Flink >>> supports anything else. >>> >>> >>> The doc I linked to shows how to launch a docker image-based container >>> with mesos-execute. Using mesos-execute to verify your cluster >>> configuration is a good idea, to isolate any issue. For example, see if >>> you can launch a container using the Mesos containerizer and the Docker >>> image provider, executing a simple command such as 'sleep'. >>> >>> >>> Eron >>> ------------------------------ >>> *From:* Biswajit Das <biswajit...@gmail.com> >>> *Sent:* Tuesday, August 1, 2017 10:02:51 AM >>> *To:* Eron Wright >>> >>> *Subject:* Re: Flink -mesos-app master hang >>> >>> Hi Eron , >>> >>> Thank you for the email , I really appreciate your reply. >>> >>> That's what is confusing me. I have been running mesos with container >>> both on staging and production for almost a year now with mostly >>> spark/presto load everything containerize fairly big cluster. .. Here is >>> one of my slave config . One interesting part here is , app master is >>> launched and I can access job manager web UI from mesos frame work , I can >>> also see it is registered itself as `flink` framework . The only thing I'm >>> seeing task manager is showing `0` . I have asked to create 2 instance >>> >>> >>> /usr/sbin/mesos-slave --master=zk://XXX/mesos --log_dir=/var/log/mesos >>> --attributes=environment:dev;agent_role:generic >>> *--containerizers=docker,mesos >>> * --executor_registration_timeout=10mins --hostname=XXX >>> *--image_providers=appc,docker >>> --ip=XXX --isolation=filesystem/linux,docker/runtime* >>> --resources=ports(*):[0-65535] --work_dir=/var/lib/mesos >>> >>> >>> Previously I never had *--image_providers and --isolation* , after >>> seeing this error I have added this two but not much help , I'm running on >>> ubuntu /mesos 1.1.0 and submitting the job with marathon .. >>> >>> >>> I have tried with toggling mesos debug log , not much info ...other hen >>> git signal to kill the framework .. >>> >>> marathon json task >>> >>>> { >>>> "id": "/flink-app-master", >>>> "cmd": null, >>>> "cpus": 2, >>>> "mem": 4096, >>>> "disk": 10000, >>>> "instances": 1, >>>> "constraints": [ >>>> [ >>>> "hostname", >>>> "LIKE", >>>> "xxx" ->>> restricited to some host for debugging as I have >>>> fairly big cluster >>>> ] >>>> ], >>>> "acceptedResourceRoles": [ >>>> "*" >>>> ], >>>> "container": { >>>> "type": "DOCKER", >>>> "volumes": [], >>>> "docker": { >>>> "image": "docker.xx.xx/flink:1.8.0", >>>> "network": "HOST", >>>> "portMappings": [], >>>> "privileged": false, >>>> "parameters": [], >>>> "forcePullImage": false >>>> } >>>> }, >>>> "env": { >>>> "MESOS_MASTER": "zk://XX/mesos" >>>> }, >>>> "portDefinitions": [ >>>> { >>>> "port": 9081, >>>> "protocol": "tcp", >>>> "name": "default", >>>> "labels": {} >>>> } >>>> ], >>>> "uris": [ >>>> "file:///etc/docker.tar.gz" >>>> ], >>>> "fetch": [ >>>> { >>>> "uri": "file:///etc/docker.tar.gz", >>>> "extract": true, >>>> "executable": false, >>>> "cache": false >>>> } >>>> ] >>>> } >>>> >>> >>> On Tue, Aug 1, 2017 at 7:22 AM, Eron Wright <ewri...@live.com> wrote: >>> >>>> From the error message it seems that your Mesos cluster doesn't have >>>> the docker image provisioner installed. The message originates from Mesos >>>> anyway so the problem lies there. Note that docker image support is >>>> provided in Linux only. You can also use the Flink on Mesos support >>>> without images, if you make sure that JAVA_HOME is set on all executors. >>>> >>>> Hope this helps! >>>> >>>> http://mesos.apache.org/documentation/latest/container-image/ >>>> >>>> Get Outlook for Android <https://aka.ms/ghei36> >>>> >>>> >>>> >>>> From: Biswajit Das >>>> Sent: Tuesday, August 1, 1:24 AM >>>> Subject: Re: Flink -mesos-app master hang >>>> To: ewri...@live.com >>>> >>>> >>>> Hi Eron , I have came across some of your comment in JIRA and wanted >>>> to clarify this ^^ . I'm kind of running little clueless , Any pointer for >>>> me to look .. >>>> >>>> >>>> ----------------------------------------------- >>>> 2017-08-01 07:26:34,688 INFO org.apache.flink.mesos.schedul >>>> er.LaunchCoordinator - Waiting for more offers; 1 task(s) >>>> are not yet launched. >>>> 2017-08-01 07:26:34,717 INFO org.apache.flink.mesos.runtime >>>> .clusterframework.MesosFlinkResourceManager - Launching Mesos task >>>> taskmanager-00039 on host 172.31.5.212. >>>> 2017-08-01 07:26:34,731 WARN org.apache.flink.mesos.schedul >>>> er.TaskMonitor - Mesos task taskmanager-00039 failed >>>> unexpectedly. >>>> *2017-08-01 07:26:34,733 INFO >>>> org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager >>>> - Mesos task taskmanager-00039 failed, with a TaskManager in launch or >>>> registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED >>>> (Failed to launch container: Unsupported container image type: DOCKER)* >>>> 2017-08-01 07:26:34,733 INFO org.apache.flink.mesos.runtime >>>> .clusterframework.MesosFlinkResourceManager - Diagnostics for task >>>> taskmanager-00039 in state TASK_FAILED : >>>> reason=REASON_CONTAINER_LAUNCH_FAILED >>>> message=Failed to launch container: Unsupported container image type: >>>> DOCKER >>>> 2017-08-01 07:26:34,733 INFO org.apache.flink.mesos.runtime >>>> .clusterframework.MesosFlinkResourceManager - Total number of failed >>>> tasks so far: 3 >>>> 2017-08-01 07:26:34,734 ERROR org.apache.flink.mesos.runtime >>>> .clusterframework.MesosFlinkResourceManager - Stopping Mesos session >>>> because the number of failed tasks (3) exceeded the maximum failed tasks >>>> (2). This number is controlled by the 'mesos.maximum-failed-tasks' >>>> configuration setting. By default its the number of requested tasks. >>>> 2017-08-01 07:26:34,734 INFO org.apache.flink.mesos.runtime >>>> .clusterframework.MesosFlinkResourceManager - Shutting down cluster >>>> with status FAILED : Stopping Mesos session because the number of failed >>>> tasks (3) exceeded the maximum failed tasks (2). This number is controlled >>>> by the 'mesos.maximum-failed-tasks' configuration setting. By default its >>>> the number of requested tasks. >>>> 2017-08-01 07:26:34,734 INFO org.apache.flink.mesos.runtime >>>> .clusterframework.MesosFlinkResourceManager - Shutting down and >>>> unregistering as a Mesos framework. >>>> 2017-08-01 07:26:34,745 INFO org.apache.flink.mesos.runtime >>>> .clusterframework.MesosFlinkResourceManager - Stopping Mesos resource >>>> master >>>> 2017-08-01 07:26:34,745 INFO org.apache.f >>>> --------------------------------------------------- >>>> >>>> Thank you in advance . >>>> ~Biswajit >>>> >>>> On Sun, Jul 30, 2017 at 12:42 PM, Biswajit Das <biswajit...@gmail.com> >>>> wrote: >>>> >>>> Hi All, >>>> I'm trying to run a flink docker from the marathon with mesos app >>>> master; I could see it goes on a continuous loop and failed to launch the >>>> task manger. If I go to mesos master UI I could see job manager web UI with >>>> task manager zero . >>>> >>>> I have pretty much checked every possible log starting from Ubuntu >>>> machine docker.log /mesos master/slave pretty much no information other >>>> than just failed task , I could see below log @ flink . However, I'm able >>>> to run same docker image if I run jobamanger and taskmanager by itself in >>>> marathon and let it connect via jobmanager RPC port . >>>> >>>> for mesos config , I'm using below details from yml >>>> mesos.master: ${MESOS_MASTER} >>>> mesos.failover-timeout: 60 >>>> mesos.initial-tasks: ${INITIAL_TASK_MANAGERS} >>>> mesos.resourcemanager.tasks.mem: ${RESOURCEMANAGER_TASKS_MEM:-4096} >>>> mesos.resourcemanager.tasks.cpus:${RESOURCEMANAGER_TASKS_CPU:-1} >>>> mesos.resourcemanager.tasks.container.type: docker >>>> mesos.resourcemanager.tasks.container.image.name: ${IMAGE_NAME} >>>> >>>> --------------------------- >>>> 07-30 02:05:48,351 WARN org.apache.flink.mesos.schedul >>>> er.TaskMonitor - Mesos task taskmanager-00002 failed >>>> unexpectedly. >>>> 2017-07-30 02:05:48,352 INFO org.apache.flink.mesos.runtime >>>> .clusterframework.MesosFlinkResourceManager - Mesos task >>>> taskmanager-00002 failed, with a TaskManager in launch or registration. >>>> State: TASK_FAILED Reason: REASON_COMMAND_EXECUTOR_FAILED (Container exited >>>> with status 127) >>>> ----------------------------------------------------- >>>> >>>> Please let me know if any one has any pointer to debug further .. >>>> >>>> >>>> ~ Biswajit >>>> >>>> >>>> >>>> >>>> >>> >> >> >