As I think I mentioned to some of you earlier, I’ve been having real trouble 
holding together a handin server for a class of 100 students all writing typed 
racket code. In order to dig deeper, I set up a test machine and a repeatable 
test harness, and I discovered some things that really surprised me; perhaps 
there are easy fixes for some of these, perhaps not.

I’m running the handin server on a 2-core, 4 GB linode server. There’s nothing 
else running on the machine, but I’m running only a single racket process, so I 
don’t expect it to use the other core at all.

Under ideal circumstances (a totally un-loaded server), it looks like a correct 
submission should run in about 10-12 seconds end-to-end.

My peak loads are around 200 submissions per hour. To model an interval on the 
high end, I modeled a 500-submissions-per-hour rate by randomly selecting 25 
delays in a 3-minute period, and starting (on a different, remote machine) a 
thread for each of these that waits the appropriate time and then tries to log 
in and submit the code. Each thread logs in as a different user. The server is 
configured with a two-minute timeout. Every submission is of my sample 
solution, which passes all of the tests.

My basic metric is “successes per minute”; that is, ignoring all of the 
timeouts, how many can we get through in each minute. At 8 submissions per 
minute, there’s absolutely no way for the server to keep up with the test load, 
so some failures are inevitable.

As I think I may have mentioned before, the base results are discouraging; the 
handin server successfully checks 3.16 per minute, but (okay, I do have other 
metrics) at the end, nearly all of them are skating on the edge of the timeout, 
so that many students are waiting for fully two minutes and then being told 
that their program might have a bug that makes it loop forever.

The first thing I did was just to put a single semaphore in place with an 
initial value of 1 and a fail-thunk that tells the user that there are too many 
submissions being processed. So: at most one at a time, with fast fail. 
Surprisingly, this did about as well as the base model, and the response times 
were, unsurprisingly, much much shorter. Of course, 15 of the 25 submissions 
were rejected. Also, students here aren’t being told that their code might be 
bad, they’re just being told that the server is busy, arguably better. On the 
other hand, failed submissions aren’t even stored by the server; I could 
perhaps tighten the code protected by the semaphore to solve this.

The next thing I did was to change the call-with-semaphore to comment out the 
fail-thunk, so that later submissions would just wait politely until the 
earlier ones were done.

Surprisingly, this did *terribly*. The server only processed 1.9 submissions 
per minute, and only successfully got through 9 submissions. The rest, as in 
the initial scenario, were put on ice for longer and longer periods of time. It 
didn’t even start out well; the very first ones were already horribly slow. 

I thought about this for a while, and it appears somehow that the delayed 
submissions—the ones waiting for the semaphore—are somehow consuming a huge 
amount of some resource, and I’d have to guess it’s memory. I can’t really see 
why this would be the case; each one has an open SSL connection and a 20k file 
to process. In fact, I can see that the log is periodically dumping a chunk of 
30 lines corresponding to waiting procedures, where one is of size (say) 30 
megabytes and all of the other waiting ones are tiny (20k, 22k, something like 
that). What I’m trying to figure out is why having waiting processes would be 
so, so expensive.

Very strange (to me).

It looks like the best setting for me right now is to set a limit of no more 
than two simultaneous submissions, and to definitely definitely use a 
fail-thunk to drop them above this limit. This brings me up to almost 5 
processed per minute, with mostly reasonable response times.

The *real* solution, of course, is to actually shorten the processing times, 
either by multi-threading or by hastening compilation of TR, which consumes 
something like 80-85% of the testing time.

In case you want to see the 10-line change I actually made to the server, the 
diff is here:

https://github.com/jbclements/handin/commit/efb0e860e0fc675310229b333bc77ddd0bbe171b

Thanks for reading. Any suggestions appreciated!

Best,

John Clements

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to