Begin forwarded message: > From: Brian Warner <[email protected]> > Date: June 3, 2009 5:22:23 AM MDT > To: [email protected] > Subject: [tahoe-dev] pycryptopp vs ARM > Reply-To: [email protected] > > > I did some digging into the mysterious pycrypto++ corruption > failures on > Zandr's ARM-based linkstation. The executive summary: crypto+ > +-5.5.2 is > broken, 5.6.0 probably fixes it, the problem won't appear on x86 or > most > other processors, there's a workaround but it's too ugly to > contemplate. > > > == The problem with crypto++-5.5.2 and ARM == > > The main workhorse of AES-CTR_Mode is in strcipr.cpp, > AdditiveCipherTemplate<S>::ProcessData. This is responsible for > each call to > AES.ProcessData(), which means it is effectively generating a > stream cipher > that looks like: > > 16 bytes of AES.encode(00000) ("00000" means 128 bits of zeros) > 16 bytes of AES.encode(00001) (127 bits of zeros and a single > "one" bit) > 16 bytes of AES.encode(00002) > etc.. > > and XORing that stream cipher with the input data to produce the > output data. > Each call to ProcessData() can involve an arbitrary amount of data, > but > always works with adjacent spans of the stream, so it is > responsible for > keeping track of how many bytes have been processed already: an > offset from > the beginning of the stream. > > It remembers this offset in two pieces. The high-order piece is the > number of > 16-byte blocks which have been encoded, stored in "m_counterArray", > as a > 128-bit/16-byte number (remember that we're using AES here, so the > keysize > can be either 128-bits or 256-bits, but the blocksize is always > 128-bits/16-bytes). > > The low-order piece is stored backwards in "m_leftOver": upon entry to > ProcessData, if we're sitting at offset=0,16,32, then m_leftOver=0, > but if > we're sitting at offset=1,17,33 then m_leftOver=15, and if we're at > offset=15,31,47 then m_leftOver=1. > > The AES object has a buffer named "m_buffer" which is used to hold the > leftover stream cipher blocks between calls to ProcessData. It > always holds > exactly 16 bytes, and always updates these bytes as a unit. m_buffer > [0] holds > the stream byte that corresponds to offset=0,16,32 . m_buffer[1] > holds the > byte that is used for offset=1,17,33. > > m_leftOver then tells you which of the right-most bytes in m_buffer > [] are > waiting to be used. If m_leftOver=3, then m_buffer[13..15] are > waiting to be > XORed to provide offset=13..15, or offset=29..31 . > > ProcessData() has three phases: > 1: use up any leftover stream data sitting in m_buffer from last time > 2: process as many full 16-byte blocks as possible > 3: process a trailing partial block, if any, putting the leftover > stream in > m_buffer > > Phase 1 happens if m_leftOver>0 and works by just XORing the right > offset in > m_buffer with the input data, and incrementing all the pointers. It > does this > one byte at a time: not as efficient as you might like, but it's > never being > used for more than 15 bytes per operation. > > Phase 2 has two forms: some operations can handle multple blocks at > once very > efficiently (remember that ProcessData() is pretty generic and is > used by > lots of different codepaths), so it sets up the underlying > operation (i.e. > AES256) to encrypt-and-XOR a couple of blocks and says Go. The > operation code > that is passed in to OperateKeystream() includes a note that says > whether the > input and output string pointers are aligned, but many operations > (including > modes.cpp:CTR_ModePolicy::OperateKeystream) ignore this note. > > If the first form couldn't be used, it falls back to the second > form, which > has a while(length>=bufferByteSize) loop and explictly writes the > keystream > into m_buffer, XORs it with the input string into the output > string, and > increments all the pointers. This XOR loop will run one byte at a > time if its > arguments aren't aligned. > > Phase 3 (which only happens if the remaining length is nonzero) > writes the > keystream into m_buffer, XORs just the partial block (one byte at a > time), > then sets m_leftOver so that the next time through Phase 1 will use > the > previously generated keystream. A given keystream block is only > generated > once. > > Note that Phase 1 and Phase 3 (and the second form of Phase 2, but > not the > first) all use byte-at-a-time loops. Also note that Phase 2 is used > for the > bulk of the data, so it wants to be as fast as possible. > > == The Gun On The Mantle In Act One == > > All full-size CPUs like big fat memory reads, for efficiency: > they've got 32- > or 64- bit memory busses, and read whole cache lines at a time. > They prefer > aligned memory reads: x=*((int32_t*)0x0), because unaligned reads like > x=*((int32_t*)0x1) or x=*((int32_t*)0x2) actually require the > processor to > perform two reads (the first of 0x0-0x3, the second of 0x4-0x7) and > then > shuffle bytes around to get them into the right place. Memory > writes behave > similarly. > > (microcontrollers, at least those with an 8-bit bus, don't care about > alignment. In fact, many of them don't even have 16- or 32- bit > operations, > so the issue never comes up). > > Some processors will begrudgingly perform unaligned reads for you, but > they'll be slow and grouchy about it. Some will throw a hardware > exception, > which might be caught by the kernel and emulated (meaning it'll > work, but now > the kernel is grouchy too, and things are really really slow), or > might be > delivered as a SIGILL to your program (which will probably kill it). > > The ARM processor has an exciting quirk. In certain configurations > (which > depend upon what the kernel wants to do), it uses a third mode, in > which > unaligned reads cause the specific byte to be loaded correctly but the > remaining bytes get shifted around. This effectively corrupts most > of the > loaded word. http://www.aleph1.co.uk/chapter-10-arm-structured- > alignment-faq > has a good writeup on the issue: their basic example is (remember > these are > little-endian): > > Address: 0 1 2 3 4 5 6 7 > Value : 10 21 66 23 ab 5e 9c 1d > > Using *(unsigned long*)2 would give: > > on x86: 0x5eab2366 > on ARM: 0x21102366 > > Similar things happen if you try to do an unaligned write: instead of > modifying bytes [2..5], you'll modify bytes [0..3] with some > surprisingly > rotated version of what you were writing. Byte [2] might get the > right thing, > but byte[0] will be clobbered. > > == The Gun On The Mantle In Act Two == > > The function in Phase 2 which generates the stream data bottoms out in > rijndael.cpp's Rijndael::Enc::ProcessAndXorBlock, which takes three > (byte *) > pointers (plus the internal key, already prepared and turned into > subkeys): > the input block (i.e. the counter), the XOR block (i.e. the > plaintext), and > the output block (i.e. the ciphertext). Most of the code in > rijndael.cpp is > x86 assembly code, but when that's disabled (hint: use M-x > cpp-highlight-buffer to make emacs colorize or hide code inside > specific > conditionals, like !defined(CRYPTOPP_X86_ASM_AVAILABLE) ) the part > that's > left is the regular C AES implementation. It shuffles a bunch of > data around > internally and then does something like this: > > word32 *const obw = (word32 *)outBlock; > const word32 *const xbw = (const word32 *)xorBlock; > > obw[0] = tbw[0] ^ xbw[0] ^ rk[0]; > > That means it's taking the xorBlock plaintext pointer (a byte*), > pretending > it's really a 4-byte (word32*), and then reading from it one word > at a time. > If xorBlock was not actually a multiple of 4 bytes, then this will > be an > unaligned read. The write is doing the same thing, using outBlock > and casting > it to a word32* and then writing words into it, so you can get an > unaligned > write from this. This doesn't occur on x86 because the assembly > version gets > used instead of the C version (I assume that either the assembly > handles > misalignment explicitly, or that the x86 behaves correctly in the > face of > unaligned accesses). > > == The Gun Is Fired In Act Three == > > The sequence of operations that triggers a pycryptopp failure on > ARM is when > an AES encryption instance is used to process two chunks of data in > which the > first chunk is less than 16 bytes long and the two chunks combined > are at > least 32 bytes long. The python code looks something like this: > > expected = AES(key).process(plaintext) > e2 = AES(key) > got = e2.process(plaintext[:9]) + e2.process(plaintext[9:9+23]) > assert expected == got # fails > > In particular, on ARM we see something like: > 09,23: 0551d7974ff1d9e29c > 9d883746f146a6,ffc9ce5ecaa9abd890da470e46000000 <--GOT > 09,23: 0551d7974ff1d9e29c > 9d883746080343,f15abafbc9ca5ac6a9a7eeaada790a42 <--EXPECTED > > (where the space is the break between the 9-byte call and the 23- > byte call, > and the comma is where the 16-byte blocksize lands). > > We also see failures with other length combinations: 9+24, 10+22, 10 > +23, > 10+24, 15+17, etc. > > Let's look specifically at 9+23. The first call to process() will > skip Phase > 1 (since this is the first time we've used this object), will skip > Phase 2 > (since we're not processing a full block), and only run Phase 3. > Phase 3 will > generate the stream data for offset=0..15 and store it in m_buffer, > then XOR > the first 9 bytes of that to produce the ciphertext, and store it > in the > output pointer. Note that pycryptopp/Python allocates a new string > for each > call to ProcessData(), and those strings are always 4-byte aligned. > Phase 3 > sets m_leftOver to 7, since we've only used 9 bytes out of the 16 in > m_buffer. This works fine. The 9 bytes of ciphertext > (0551d7974ff1d9e29c) > returned by process() are correct. The output buffer so far: > > 09,23: 0551d7974ff1d9e29c > > The second call to process() wants to process 23 bytes of > plaintext. It > allocates a Python string for the result, let's pretend it gets > address > 0x1000 (it will generally be 4-byte aligned, so it could just as > easily be at > 0x1004 or 0x1008 or 0x100c). Phase 1 will see that m_leftOver=7, so > it does > byte-wise XORs the remaining stream data from m_buffer into the > output and > gets 9d883746080343. So far, so good. Phase 1 finishes by > incrementing the > output pointer, in this case 0x1000+7=0x1007, and decrementing the > remaining > length to be processed, 23-7=16. The output buffer now has: > > 09,23: 0551d7974ff1d9e29c 9d883746080343 > > Phase 2 wants to work with whole blocks, and since length>=16, it > gets to > run. It calls OperateKeystream() and tells it to encrypt-and-XOR > data using a > ciphertext target pointer (outblock) of... 0x1007. Here is the > problem. That > 32-bit cast and obw[0]= dereference causes an unaligned access, and > on ARM > you wind up with writes to [0x1004-0x1007] instead of > [0x1007-0x100a]. This > clobbers the bytes which were already written, giving us: > > 09,23: 0551d7974ff1d9e29c 9d883746080343 > xxxxxxxx <- expected write > f146a6,ff <- actual write > 09,23: 0551d7974ff1d9e29c 9d883746f146a6,ff <- resulting data > > The rest of Phase 2 does more damage. I think the unalignedness of the > plaintext pointer (xorBlock) is affecting stuff too, which is why the > corrupted ciphertext looks both shifted and bit-flipped from the > expected > value. > > I instrumented the ProcessData() code in cryptopp-5.5.2 to confirm > that the > last byte of the output buffer was correct (0x43) at the end of > Phase 1 and > then clobbered (0xa6) at the end of Phase 2. > > This corruption doesn't happen if Phase 1 didn't run, since then > the pointers > are still equal to the original (4-byte aligned) Python string > buffer, which > means that if your call to ProcessData() leaves it on a 16-byte > boundary, > then you're fine. > > It also doesn't happen unless Phase 2 runs, which requires that > there be at > least 16 bytes left to process (so it can do a full block). And it > only > happens on ARM where aligned accesses give you this weird > corruption behavior > (on other chips you might get a trap or a hard-to-spot performance > problem). > > Calling ProcessData() only once per SigningKey instance won't > trigger it. > Always calling ProcessData() with short strings (<16 bytes) won't > trigger it, > and always calling it with multiples of 16-byte strings won't > trigger it. > > == Episode 4: A New Hope == > > cryptopp-5.6.0 changes the AES implementation considerably. I > haven't traced > through it enough to be sure, but I suspect that they've fixed the > problem, > because of a new #define CRYPTOPP_ALLOW_UNALIGNED_DATA_ACCESS > (which is > conservatively only set on x86, x64, and PowerPC). In addition, the > vital > write in rijndael.cpp now looks like: > > Block::Put(xorBlock, outBlock)(tbw[0]^rk[0])(tbw[1]^rk[1])(tbw[2] > ^rk[2])(tbw[3]^rk[3]); > > and is done without casting xorBlock or outBlock to word32*. I > can't find > where Block::Put is defined, but I'm encouraged that this will fix the > problem. The terse release notes for 5.6.0 don't mention > intentionally fixing > any problem like this. > > I've got a library compiling now to test this. I'll have to let it run > overnight.. compiling libcrypto++ on Zandr's linkstation takes > about 3 hours. > (incidentally, it runs a lot faster if you edit the makefile to use > -O0). > > What this means is that libcrypto++-5.5.2 has a grave bug on ARM, > which is > probably fixed in the current libcryptopp++-5.6.0 . If we really > wanted to, > pycryptopp could work around the bug, by keeping track of how many > bytes we'd > submitted to ProcessData() and splitting up the input data (at most > one split > per call to .process) such that we either tell ProcessData() to > start at a > 16-byte boundary, or we give it less than 16 bytes of data to work > with. > Something like: > > def process(self, plaintext): > if len(plaintext) < 16 or self.counter%16==0: > self.counter += len(plaintext) > return self.e.ProcessData(plaintext) > gap = 16 - (self.counter%16) > return self.process(plaintext[:gap]) + self.process > (plaintext[gap:]) > > That ought to sound crazy to you. > > Lenny has 5.5.2, and while we should develop a minimal (C++) > demonstration > case and file a "serious" or "grave" debian/lenny bug on libcrypto++7 > (specific to arm/armel), I think it's unlikely that we could > convince them to > issue a security fix and upgrade stable all the way to 5.6.0 for > just one > architecture. So ARM boxes that want to use pycryptopp with > --disable-embedded-cryptopp will need to upgrade to sid (which has > 5.6.0). > Fortunately, pycryptopp.deb isn't going to go into Lenny anyways > (since it's > already shipped), so this is less of an issue for the > get-pycryptopp-into-Debian effort, just for people who want to use > pycryptopp > in the syslib form on a lenny/stable ARM box. We may want to consider > building a backported cryptopp-5.6.0 for these folks. > > The embedded-cryptopp form should work (assuming my hopes for > Block::Put are > justified), because Zooko recently upgraded the embedded copy to > 5.6.0 . > > We should also look at crypto++'s test suite and make sure there's > a case > which uses a single encryption object to two ProcessData operations > (with > various lengths) so this case is being exercised right from the > source. I've > attached the python equivalent below. > > cheers, > -Brian > > > #! /usr/bin/python > > from pycryptopp.cipher.aes import AES > from binascii import a2b_hex, b2a_hex > > key = a2b_hex > ("54b0789aeeddb6b3351e6d7b8d79d8d489582376ab1b322dd3362ccbfdb82f7a") > assert len(key) == 32 > plaintext = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" > assert len(plaintext) == 52 > expected_ciphertext = AES(key=key).process(plaintext) > #print b2a_hex(expected_ciphertext[:16]), b2a_hex > (expected_ciphertext[16:32]), b2a_hex(expected_ciphertext[32:48]) > expected_s = > "0551d7974ff1d9e29c9d883746080343"+"f15abafbc9ca5ac6a9a7eeaada790a42"+ > "73a45de485a97ecd52d79fe702a07a67"+"9b8d73ad" > assert b2a_hex(expected_ciphertext) == expected_s > > errors = [] > for i in range(1, 25): > for j in range(1, 25): > p = AES(key=key) > a = p.process(plaintext[:i]) > assert len(a) == i > b = p.process(plaintext[i:i+j]) > assert len(b) == j > c = a+b > if c != expected_ciphertext[:i+j]: > errors.append((i,j,a,b,expected_ciphertext > [:i],expected_ciphertext[i:i+j])) > for (i,j,a,b,ea,eb) in errors: > print "%02d,%02d: %s %s <--GOT" % (i, j, b2a_hex(a), b2a_hex(b)) > print "%02d,%02d: %s %s <--EXPECTED" % (i, j, b2a_hex(ea), > b2a_hex(eb)) > print > if not errors: > print "no errors detected" > _______________________________________________ > tahoe-dev mailing list > [email protected] > http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the "Crypto++ Users" Google Group. To unsubscribe, send an email to [email protected]. More information about Crypto++ and this group is available at http://www.cryptopp.com. -~----------~----~----~----~------~----~------~--~---
