Folks,

I have done a lot of experiments over the past few weeks and came to a few interesting conclusions. First some background, then issues, solutions and conclusions.

I wrote a test harness for a poker server that understands the different binary packets and can send and receive them. The harness launches each "script" in a separate unbound thread that connects to the server via TCP and does its work.

The main goals of the project were: easy scripting, very high number of connections from the harness (a few thousand) and running on Windows. I develop on Mac OSX but have a Windows machine for testing and to run the poker server.

Another key goal was to support the server encryption. SSL encryption is done in a wierd way that requires attaching read/write OpenSSL BIOs to the SSL descriptor so that SSL encrypts to/from memory. Encrypted chunks are then taken from the BIOs and sent as payload in servver packets.

Overall, I probably spent about 4 weeks writing the server and about 2 more weeks grappling with the various issues. The issues centered around 1) the program trashing memory like no tomorrow, 2) intermittent crashes on Windows and 3) not being able to launch a high number of connections on Windows before crashing.

I significantly improved trashing of memory by switching to plain Haskell structures from nested lists of wxHaskell-style properties (attr := value). Intermittent crashes were harder to troubleshoot, specially given that things were running smoothly on Mac OSX.

Stack traces pointed into libcrypto (part of OpenSSL) and thus to the BIOs that I was allocating. I guesses that OpenSSL was maxing out some resources and closed the leak by explicitly freeing the SSL descriptor which freed the associated BIO structures. Then things got wierder as my program started crashing in a different place entirely with stack traces like this:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x3139322e
0x0027c174 in s8j1_info ()
(gdb) where
#0  0x0027c174 in s8j1_info ()
#1  0x0021c9f4 in StgRunIsImplementedInAssembler () at StgCRun.c:576
#2 0x0021cdc4 in schedule (mainThread=0x1100360, initialCapability=0x308548) at Schedule.c:932 #3 0x0021dd6c in waitThread_ (m=0x1100360, initialCapability=0x0) at Schedule.c:2156 #4 0x0021dc50 in scheduleWaitThread (tso=0x13c0000, ret=0x0, initialCapability=0x0) at Schedule.c:2050
#5  0x00219548 in rts_evalLazyIO (p=0x29b47c, ret=0x0) at RtsAPI.c:459
#6  0x001e4768 in main (argc=2262116, argv=0x308548) at Main.c:104

I took waitThread_ as a clue and started digging deeper.

Whenever I connect to the server or send a command I wait for X seconds and if not connected or desired command is not received I throw an exception which fails the script. I implemented the timeout combinator a couple of different ways, including that in the Asynchronous Exceptions paper but it did not help. I think the issue has to do with killing threads that are using FFI. Although I'm killing threads that call the Haskell connectTo, hGetBuf, etc. I think it's still FFI.

I disposed of timeouts entirely, leaving connectTo as it is and using hWaitForInput on my socket handle to simulate timeouts. This improved things tremendously and I'm now able to run a few thousands of unbound script threads on Windows with OpenSSL FFI and everything.

Memory usage is still higher than I would have liked and crashes in OpenSSL still happen when the number of threads/memory usage is really high so there's still room for improvement. I should probably go back to using a foreign finalizer (SSL_free) on the SSL descriptors rather than freeing them explicitly as the freeing does not happen if a script fails mid-way.

I'm quite satisfied with my first Haskell project. I love Haskell and will continue hacking away with it. This list is invaluable in the depth of offered help whereas #haskell (IRC) is invaluable when speed matters. I'm quite amazed at the things I have been able to do, the expressiveness of Haskell and the clean looks.

Clean looks can be deceptive, though, as they can hide code of amazing complexity. Fundeps, existential types, HList take a while to grasp. Also, I feel somewhat like a pioneer and I definitely got more than a fair share of arrows in my back.

I had GHC run out of memory during compilation (fixed by SPJ), had it quit midway during compilation with an error about generated extents being too large in assembler code. I had GHC crash at runtime with an error like "fromJust not returning Just, this could not be happening!". Yesterday's error topped them all:

internal error: update_fwd: unknown/strange object  0
   Please report this as a bug to glasgow-haskell-bugs@haskell.org,
   or http://www.sourceforge.net/projects/ghc/

I think I got this when using +RTS -C0 -c.

Overall, the experience with Haskell has been exhilarating and I'm already preparing to use it on my next projects like detecting collusion in poker as well as rake optimization (Dazzle paper very helpful here!). Still, I think that GHC can be a bit rough around the edges and I would think twice about writing high-performance network apps with it.

        Thanks, Joel

P.S. The Glasgow Distributed Haskell (GdH) people are supposed to have a mailing list and I would love to share my findings twith them but I could not find the mailing list itself.

--
http://wagerlabs.com/





_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to