To reduce that time you will indeed want to talk directly to the scheduler. This will definitely require you to roll up your sleeves a bit and set up a thrift client to our api (based on api.thrift [1]), since you will need to specify your tasks in a format that the thermos executor can understand. Turns out this is JSON data, so it should not be *too* prohibitive.
However, there is another technical limitation you will hit for the submission rate you are after. The scheduler is backed by a durable store whose write latency is at minimum the amount of time required to fsync. [1] https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift -=Bill On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly < [email protected]> wrote: > Hi folks, > > I'm looking at a use cases that involves submitting potentially hundreds of > jobs a second to our Mesos cluster. My tests show that the aurora client is > taking 1-2 seconds for each job submission, and that I can run about four > client processes in parallel before they peg the CPU at 100%. I need more > throughput than this! > > Squashing jobs down to the Process or Task level doesn't really make sense > for our use case. I'm aware that with some shenanigans I can batch jobs > together using job instances, but that's a lot of work on my current > timeframe (and of questionable utility given that the jobs certainly won't > have identical resource requirements). > > What I really need is (at least) an order of magnitude speedup in terms of > being able to submit jobs to the Aurora scheduler (via the client or > otherwise). > > Conceptually it doesn't seem like adding a job to a queue should be a thing > that takes a couple of seconds, so I'm baffled as to why it's taking so > long. As an experiment, I wrapped the call to client.execute() in > client.py:proxy_main in cProfile and called aurora job create with a very > simple test job. > > Results of the profile are in the Gist below: > > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5 > > Our of a 0.977s profile time, the two things that stick out to me are: > > 1. 0.526s spent in Pystachio for a job that doesn't use any templates > 2. 0.564s spent in create_job, presumably talking to the scheduler (and > setting up the machinery for doing so) > > I imagine I can sidestep #1 with a check for "{{" in the job file and > bypass Pystachio entirely. Can I also skip the Aurora client entirely and > talk directly to the scheduler? If so what does that entail, and are there > any risks associated? > > Thanks, > -Hussein > > Hussein Elgridly > Senior Software Engineer, DSDE > The Broad Institute of MIT and Harvard >
