I dug into TRequestsTransport and I get it now. Sending raw bytes across a socket is not the same as doing an HTTP POST with said bytes stuffed in the body!
I guess I too will be rolling my own HTTP transport... Hussein Elgridly Senior Software Engineer, DSDE The Broad Institute of MIT and Harvard On 16 March 2015 at 18:44, Hussein Elgridly <[email protected]> wrote: > So this has now bubbled back to the top of my TODO list and I'm actively > working on it. I am entirely new to Thrift so please forgive the newbie > questions... > > I would like to talk to the Aurora scheduler directly from my (Python) > application using Thrift. Since I'm on Python 3.4 I've had to use thriftpy: > https://github.com/eleme/thriftpy > > As far as I can tell, the following should work (by default, thriftpy uses > a TBufferedTransport around a TSocket): > > --- > import thriftpy > import thriftpy.rpc > > aurora_api = thriftpy.load("api.thrift") > > client = thriftpy.rpc.make_client(aurora_api.AuroraSchedulerManager, > host="localhost", port=8081, > proto_factory=thriftpy.protocol.TJSONProtocolFactory() ) > > print(client.getJobSummary()) > --- > > Obviously I wouldn't be writing this email if it did work :) It hangs. > > I jumped into pdb and found it was sending the following payload: > > b'\x00\x00\x00\\{"metadata": {"name": "getJobSummary", "seqid": 0, > "ttype": 1, "version": 1}, "payload": {}}' > > to a socket that looked like this: > > <socket.socket fd=3, family=AddressFamily.AF_INET, type=2049, proto=0, > laddr=('<localhost's_private_ip>', 49167), raddr=('localhost's_private_ip', > 8081)> > > ...but was waiting forever to receive any data. Adding a timeout just > triggered the timeout. > > I'm stumped. Any clues? > > > Hussein Elgridly > Senior Software Engineer, DSDE > The Broad Institute of MIT and Harvard > > > On 12 February 2015 at 04:15, Erb, Stephan <[email protected]> > wrote: > >> Hi Hussein, >> >> we also had slight performance problems when talking to Aurora. We ended >> up using the existing python client directly in our code (see >> apache.aurora.client.api.__init__.py). This allowed us to reuse the api >> object and its scheduler connection, dropping a connection latency of about >> 0.3-0.4 seconds per request. >> >> Best Regards, >> Stephan >> ________________________________________ >> From: Bill Farner <[email protected]> >> Sent: Wednesday, February 11, 2015 9:29 PM >> To: [email protected] >> Subject: Re: Speeding up Aurora client job creation >> >> To reduce that time you will indeed want to talk directly to the >> scheduler. This will definitely require you to roll up your sleeves a bit >> and set up a thrift client to our api (based on api.thrift [1]), since you >> will need to specify your tasks in a format that the thermos executor can >> understand. Turns out this is JSON data, so it should not be *too* >> prohibitive. >> >> However, there is another technical limitation you will hit for the >> submission rate you are after. The scheduler is backed by a durable store >> whose write latency is at minimum the amount of time required to fsync. >> >> [1] >> >> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift >> >> -=Bill >> >> On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly < >> [email protected]> wrote: >> >> > Hi folks, >> > >> > I'm looking at a use cases that involves submitting potentially >> hundreds of >> > jobs a second to our Mesos cluster. My tests show that the aurora >> client is >> > taking 1-2 seconds for each job submission, and that I can run about >> four >> > client processes in parallel before they peg the CPU at 100%. I need >> more >> > throughput than this! >> > >> > Squashing jobs down to the Process or Task level doesn't really make >> sense >> > for our use case. I'm aware that with some shenanigans I can batch jobs >> > together using job instances, but that's a lot of work on my current >> > timeframe (and of questionable utility given that the jobs certainly >> won't >> > have identical resource requirements). >> > >> > What I really need is (at least) an order of magnitude speedup in terms >> of >> > being able to submit jobs to the Aurora scheduler (via the client or >> > otherwise). >> > >> > Conceptually it doesn't seem like adding a job to a queue should be a >> thing >> > that takes a couple of seconds, so I'm baffled as to why it's taking so >> > long. As an experiment, I wrapped the call to client.execute() in >> > client.py:proxy_main in cProfile and called aurora job create with a >> very >> > simple test job. >> > >> > Results of the profile are in the Gist below: >> > >> > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5 >> > >> > Our of a 0.977s profile time, the two things that stick out to me are: >> > >> > 1. 0.526s spent in Pystachio for a job that doesn't use any templates >> > 2. 0.564s spent in create_job, presumably talking to the scheduler (and >> > setting up the machinery for doing so) >> > >> > I imagine I can sidestep #1 with a check for "{{" in the job file and >> > bypass Pystachio entirely. Can I also skip the Aurora client entirely >> and >> > talk directly to the scheduler? If so what does that entail, and are >> there >> > any risks associated? >> > >> > Thanks, >> > -Hussein >> > >> > Hussein Elgridly >> > Senior Software Engineer, DSDE >> > The Broad Institute of MIT and Harvard >> > >> > >
