Re: Heap32Next performance awful on 64-bit Win7 (Was: CryptoAPI calls failing in rand_win on Windows 7)
Ger Hobbelt g...@hobbelt.com wrote: Odd question maybe, but does the API call slowdown too when traversing other heaps (which carry fewer items)? Yes. This surprised me, but Heap32Next takes the same amount of time to execute when traversing the 2nd heaplist (which has 15 items) as it does the 1st heaplist (which has a million items). Are those time-per-API-call numbers averaged or does /each/ Heap32Next call take this long?! Each and every call takes the same long amount of time. To me, this indicates that the time spent is not actually spent *finding* the next heap entry (as if we were traversing a linked list to get to our destination), but in allocating (to the nearest 2^N) space for and/or recording info about every heap entry in every heap list. an adjustment to keep the rand collecting scan within reasonable bounds is well feasible (no hard upper limit, though, because, ah, 'granularity' there is the time one (slowest) API call takes, no matter how the solution is coded. It would definitely be easy to constrain the number of heap entries checked even further, based on time spent in the inner loop, but doesn't that run into this: Oh yeah, to answer one Q in first post: it's not a very smart idea to strip out entropy collecting code sections ... If we limited the inner loop to 1 second as we do the outer loop, we'd effectively be cutting out (in this case) 79 of the usual 80 bytes of entropy which, as you say, makes one trepidatious. RAND_poll appears to gather randomly varying amounts of entropy, basically what it can grab in a few seconds. Is there a minimum effective amount of entropy that is known? The ideal thing is to add another source of entropy to compensate, but that's not something that's within my capabilities or time limits right now. __ OpenSSL Project http://www.openssl.org User Support Mailing Listopenssl-users@openssl.org Automated List Manager majord...@openssl.org
Re: Heap32Next performance awful on 64-bit Win7 (Was: CryptoAPI calls failing in rand_win on Windows 7)
On Fri, Nov 13, 2009 at 6:34 PM, James Baker j...@j-baker.org wrote: [...] Each and every call takes the same long amount of time. To me, this indicates that the time spent is not actually spent *finding* the next heap entry (as if we were traversing a linked list to get to our destination), but in allocating (to the nearest 2^N) space for and/or recording info about every heap entry in every heap list. .[censored] eh... Gosh! The only 'close to reasonable' cause for this I can come up with is someone leaving some 'scan all to verify everything on each invocation' diag code in there. May it's doing a full snapshot every time. Whatever. So, yes, cutting short the heaplist scan based on timing results obtained over one or a few API calls is the way to go IMO; yes, when your machine had been heaplist-treated like that, it effectively cuts out 99.9% of the (semi-)entropy that might have been waiting for us in the heaplist, so that's not very nice. I know the whole entropy gathering business there comes with some handwaving anyway as nothing is 'hard' - after all, all the 'entropy' gathered in the scan can be theorized to be deterministic in some /very/ complex way (if one assumes no external influences like humans and such touch the machine up to than, knowing all the context and innards, etc.etc.), but it's all about making it bloody hard for attackers to make it close to deterministic within the confines of a reasonable amount of effort spent. So every source of possible (semi)entropy counts. Hence there's never going to be hard numbers on 'minimum effective amount' of gathering effort or other detail of the scanning/gathering process -- unless someone includes provable random hardware sources with standard motherboards some day; maybe an on-board QRBG121, say. ;-) So my remark about please not stripping out gathering code sections should be read in that light: when a code section /can/ cause trouble for someone, it should not be discarded for /everybody/. That's the best we can do for now. Yep, it doesn't make the handwaving any less, but at least the impact is minimized to fringe cases (if I may call this huge-number heaplist thing a 'fringe case' ?). There /is/ another possible source of entropy available nowadays, at least for a lot of folks: with the existance of on-board analog audio and ubiquitous DirectSound support (which goes for Windows 7 as well) it would be a nice thing to take a few seconds of direct line and mic channel 'sound' and turn that into a few hashes to feed to the randomness collector. Input and mic lines can be 'noisy' due to noise in the open analog circuits and A/D when the volumes are not moved down to 0%: one can see this happen when recording mic line sound without a mic attached in sound editors like Adobe Audition: the VU bars show the very tiny line noise in the sampled signal and when you amplify such a sample (say, 'normalize' track), the actual analog hardware noise is clearly visible. It's not perfect white noise, but at least it's grey or pink and it's got entropy, yes sir. MIC line is favourable in this regard as it's 'signal-to-noise' ratio is less than for 'line', almost everywhere, mostly due to higher amplification ratios in the analog section there, to ensure the tiny microphone signal makes it through the A/D quantization with the least amount of quality loss. Wherever a sound engineer would curse the noise (analog and A/.D quantization ~), we want to be in on the show. Of course, there's squibles like the gatherer then temporarily 'occupying' the DirectSound I/O which other apps may not like all that much, so it's not a thing that's way easy in that context regard, but it'd be a nice addition to the gatherer. Hm, when there's some time available, I should have a look at that. Unless someone likes to beat me to it, that is. ;-)) (Note to self: see how we can grab DirectSound channels with 'share device with other applications' enabled, like some of the sound apps can do out there. Second note: traverse DirectSound devices and grab from each for extra noise yumminess.) (Note: the same trick applies to sampled analog video, but there's fewer folks with analog video-in, and analog video is becoming a bit of a rare species itself with video broadcasts becoming digital end-to-end nowadays. sigh ;-) ) If we limited the inner loop to 1 second as we do the outer loop, we'd effectively be cutting out (in this case) 79 of the usual 80 bytes of entropy which, as you say, makes one trepidatious. RAND_poll appears to gather randomly varying amounts of entropy, basically what it can grab in a few seconds. Is there a minimum effective amount of entropy that is known? The ideal thing is to add another source of entropy to compensate, but that's not something that's within my capabilities or time limits right now. For both parts, see above. (What's bad for one should not disappear for everyone + analog audio noise) -- Met vriendelijke
Re: CryptoAPI calls failing in rand_win on Windows 7
James Baker wrote: The problem does occur with full admin privileges. To be 100% clear, this is full admin with no UAC? UAC will drop privilege of an app seemingly running as 'administrator'. __ OpenSSL Project http://www.openssl.org User Support Mailing Listopenssl-users@openssl.org Automated List Manager majord...@openssl.org
Heap32Next performance awful on 64-bit Win7 (Was: CryptoAPI calls failing in rand_win on Windows 7)
Punchline: The time taken by a call to Heap32Next on 64-bit Windows-7 SCALES (roughly linearly?) with the number of heap entries in the heap list. This seems to be a serious problem that would affect (at least) most 32-bit-compiled OpenSSL users on 64-bit Win7. I've cleared my accusation against the CryptoAPI functions - those are working fine. The time is taken up by Heap32Next, even though good == 1 and stoptime is set. The 1-second constraint on the number of heaplists walked is ineffective because the time is all spent in the inner loop, walking the first 80 heap entries in the first heaplist. By the time I got up to 4 million (2-byte) heap objects in my test harness, each Heap32Next call was taking multiple seconds. It is not the overall size of the heap that counts, but the number of heap objects. The performance of each Heap32Next (the 1st versus the 80th) is roughly the same. I do not know whether the problem is specific to only 64-bit Win7 (due to WoW), or whether it applies to all Windows 7 versions. What then is the fix? Sure, this may be a Windows problem, but letting RAND_poll take dozens to hundreds of seconds is obviously not acceptable. This problem is sort of related to previous heap walking is slooow threads on this list dealing with lines ~500-515 in rand_win.c, but we can no longer get 80 entries from the first list in anything near 1 second. What would the cryptographic effect (on the entropy of the randomness pool) be from cutting the heap traversal entirely (i.e. cutting 80 bytes of entropy) - is that cryptographically acceptable? Is there some alternate way of traversing large heaps, or some alternate source of entropy we could turn to? I have a single cpp repro file with a slightly chopped-down RAND_poll ripped out of rand_win.c that I could pass on to any OpenSSL developer/contributor. Thanks, James my debugging output: stoptime: 851485984 Got heaplist_first. heap1st tickcount: 851624250 Exiting RAND_poll On Wed, Nov 11, 2009 at 4:50 PM, James Baker j...@j-baker.org wrote: It's not the CryptoAPI calls that are taking time - nearly all of the time is spent within Heap32Next. Thus my hypothesis is that CryptAcquireContextW or CryptGenRandom is failing, causing 'good' to be 0 and the heap traversal to be unbounded. I see the entrycnt = 80 constraint on walking the length of each heaplist, but there is no bound on the outer while loop calling Heap32ListNext? You say that very first block of heap is retrieved when good is 0 - is that because GetTickCount() stoptime is supposed to be a short-circuit when stoptime == 0? (It's not - perhaps I should examine next whether GetTickCount is malfunctioning, or returning a signed negative int for comparison) The problem does occur with full admin privileges. I might speculate about the effect the WoW layer has on using the Heap32* functions, but my investigation so far is focused on why the traversal isn't bounded (i.e. the CryptoAPI -- good relationship), as 4 seconds (1 each for heap/process/thread/module) would be tolerable. I have not yet written a standalone C program that simulates the same CryptoAPI call sequence. If no one on this list can say Yes, the RAND_Poll CryptoAPI calls work on Windows-7, this will be my next step. Thanks, James __ OpenSSL Project http://www.openssl.org User Support Mailing Listopenssl-users@openssl.org Automated List Manager majord...@openssl.org
Re: Heap32Next performance awful on 64-bit Win7 (Was: CryptoAPI calls failing in rand_win on Windows 7)
I've confirmed my linear performance conjecture w/r/t heap objects. Click here to see pretty pictures graphing my results: http://thenewjamesbaker.blogspot.com/2009/11/performance-of-heap32next-on-64-bit.html On Thu, Nov 12, 2009 at 11:50 AM, James Baker j...@j-baker.org wrote: Punchline: The time taken by a call to Heap32Next on 64-bit Windows-7 SCALES (roughly linearly?) with the number of heap entries in the heap list. This seems to be a serious problem that would affect (at least) most 32-bit-compiled OpenSSL users on 64-bit Win7. I've cleared my accusation against the CryptoAPI functions - those are working fine. The time is taken up by Heap32Next, even though good == 1 and stoptime is set. The 1-second constraint on the number of heaplists walked is ineffective because the time is all spent in the inner loop, walking the first 80 heap entries in the first heaplist. By the time I got up to 4 million (2-byte) heap objects in my test harness, each Heap32Next call was taking multiple seconds. It is not the overall size of the heap that counts, but the number of heap objects. The performance of each Heap32Next (the 1st versus the 80th) is roughly the same. I do not know whether the problem is specific to only 64-bit Win7 (due to WoW), or whether it applies to all Windows 7 versions. What then is the fix? Sure, this may be a Windows problem, but letting RAND_poll take dozens to hundreds of seconds is obviously not acceptable. This problem is sort of related to previous heap walking is slooow threads on this list dealing with lines ~500-515 in rand_win.c, but we can no longer get 80 entries from the first list in anything near 1 second. What would the cryptographic effect (on the entropy of the randomness pool) be from cutting the heap traversal entirely (i.e. cutting 80 bytes of entropy) - is that cryptographically acceptable? Is there some alternate way of traversing large heaps, or some alternate source of entropy we could turn to? I have a single cpp repro file with a slightly chopped-down RAND_poll ripped out of rand_win.c that I could pass on to any OpenSSL developer/contributor. Thanks, James __ OpenSSL Project http://www.openssl.org User Support Mailing Listopenssl-users@openssl.org Automated List Manager majord...@openssl.org
Re: Heap32Next performance awful on 64-bit Win7 (Was: CryptoAPI calls failing in rand_win on Windows 7)
Odd question maybe, but does the API call slowdown too when traversing other heaps (which carry fewer items)? I assume not, but since you tested this and I don't see that aspect in your blog. (Pondering what can be done here; when the answer is 'no' to previous it means the only way out is to 'measure' each HeapFirst/Next to see if it is a 'slow' one (plus of course watch the total time spent in the outer loop). There's no way to get the total number of heap blocks up front, so we're somehow stuck with 'seeing what happens while we traverse' one way or another, to checking after only a few API calls whether it registers on the clock()/ticks radar or not might work out... which leads to the second question regarding your values: Are those time-per-API-call numbers averaged or does /each/ Heap32Next call take this long?! (I assume here the first ones are faster and time spent increases gradually while the list is traversed, but again, that's only assuming and no observation data to aye or naye that yet) If the initial calls are faster, then the solution is still kind of the same, but needs a little further thought; a hacky 'check first N for time spent' won't work. just thinking out loud here slap self! / Aw, heck, this is doing things wrong way around anyway: whether those two assumptions are correct or not, the scanner code shouldn't depend on those anyhow and be able to cope with either one; an adjustment to keep the rand collecting scan within reasonable bounds is well feasible (no hard upper limit, though, because, ah, 'granularity' there is the time one (slowest) API call takes, no matter how the solution is coded. On Fri, Nov 13, 2009 at 2:38 AM, James Baker j...@j-baker.org wrote: I've confirmed my linear performance conjecture w/r/t heap objects. Click here to see pretty pictures graphing my results: Oh yeah, to answer one Q in first post: it's not a very smart idea to strip out entropy collecting code sections; it's the slow way to arrive at an undesirably predictable random generator as you take away a chance to introduce some entropy, one scanner part at the time. It's perfectly okay to /add/ other sources, such as noise input from audio sources, etc. (A/D converter and analog h/w noise) but taking out should be done with trepidation. There enough horror cases about the ones that have gone that road before, so, unless there's no other way, no need to add to that collection. ;-) -- Met vriendelijke groeten / Best regards, Ger Hobbelt -- web:http://www.hobbelt.com/ http://www.hebbut.net/ mail: g...@hobbelt.com mobile: +31-6-11 120 978 -- __ OpenSSL Project http://www.openssl.org User Support Mailing Listopenssl-users@openssl.org Automated List Manager majord...@openssl.org
Re: CryptoAPI calls failing in rand_win on Windows 7
It's not the CryptoAPI calls that are taking time - nearly all of the time is spent within Heap32Next. Thus my hypothesis is that CryptAcquireContextW or CryptGenRandom is failing, causing 'good' to be 0 and the heap traversal to be unbounded. I see the entrycnt = 80 constraint on walking the length of each heaplist, but there is no bound on the outer while loop calling Heap32ListNext? You say that very first block of heap is retrieved when good is 0 - is that because GetTickCount() stoptime is supposed to be a short-circuit when stoptime == 0? (It's not - perhaps I should examine next whether GetTickCount is malfunctioning, or returning a signed negative int for comparison) The problem does occur with full admin privileges. I might speculate about the effect the WoW layer has on using the Heap32* functions, but my investigation so far is focused on why the traversal isn't bounded (i.e. the CryptoAPI -- good relationship), as 4 seconds (1 each for heap/process/thread/module) would be tolerable. I have not yet written a standalone C program that simulates the same CryptoAPI call sequence. If no one on this list can say Yes, the RAND_Poll CryptoAPI calls work on Windows-7, this will be my next step. Thanks, James On Sun, Nov 8, 2009 at 6:36 AM, sandeep kiran p sandeepkir...@gmail.com wrote: RAND_poll runs very quickly with a near-empty heap. Do you mean that the calls to Heap32First, Heap32Next, Heap32ListFirst, Heap32ListNext are failing? Can you check the return values from these calls? (using GetLastError?). In any case, the heap traversals are bounded by the 1 sec limit. Even if the variable good is 0, the very first block of heap allocated by the current process is retrieved. Can you exactly specify which CryptoAPI is taking so much time? -Sandeep On Fri, Nov 6, 2009 at 11:45 AM, James Baker j...@j-baker.org wrote: Background: Testing a Ruby app on 64-bit Windows 7 Ultimate, I found that OpenSSL::PKey::RSA.generate() was taking 98 seconds. Jumping to C, sampling showed that the great majority of this time was spent in Heap32Next, which led me to the heap list and heap walking section of RAND_poll in crypto/rand/rand_win.c The heap walking (and thread and module walking) are limited to 1s unless the variable good is set, and advapi32.dll is loaded, which means that poll the CryptoAPI PRNG using the conjunction of CryptAcquireContextW and CryptGenRandom must be failing. The 98 seconds comes from walking the contents of the heap after loading a Rails environment - RAND_poll runs very quickly with a near-empty heap. Are the crypo-API calls ever expected to fail under any Windows platform, or is this the abnormality? I'm not aware of any changes in Win7 that would break those calls (though I'm investigating whether something permission/security-related is in play here), but I'm not aware of much about Win7 in general. I also don't see any Win7-related changes in the OpenSSL changelog - has this platform been validated already? Thanks, James __ OpenSSL Project http://www.openssl.org User Support Mailing List openssl-us...@openssl.org Automated List Manager majord...@openssl.org __ OpenSSL Project http://www.openssl.org User Support Mailing Listopenssl-users@openssl.org Automated List Manager majord...@openssl.org
Re: CryptoAPI calls failing in rand_win on Windows 7
RAND_poll runs very quickly with a near-empty heap. Do you mean that the calls to Heap32First, Heap32Next, Heap32ListFirst, Heap32ListNext are failing? Can you check the return values from these calls? (using GetLastError?). In any case, the heap traversals are bounded by the 1 sec limit. Even if the variable good is 0, the very first block of heap allocated by the current process is retrieved. Can you exactly specify which CryptoAPI is taking so much time? -Sandeep On Fri, Nov 6, 2009 at 11:45 AM, James Baker j...@j-baker.org wrote: Background: Testing a Ruby app on 64-bit Windows 7 Ultimate, I found that OpenSSL::PKey::RSA.generate() was taking 98 seconds. Jumping to C, sampling showed that the great majority of this time was spent in Heap32Next, which led me to the heap list and heap walking section of RAND_poll in crypto/rand/rand_win.c The heap walking (and thread and module walking) are limited to 1s unless the variable good is set, and advapi32.dll is loaded, which means that poll the CryptoAPI PRNG using the conjunction of CryptAcquireContextW and CryptGenRandom must be failing. The 98 seconds comes from walking the contents of the heap after loading a Rails environment - RAND_poll runs very quickly with a near-empty heap. Are the crypo-API calls ever expected to fail under any Windows platform, or is this the abnormality? I'm not aware of any changes in Win7 that would break those calls (though I'm investigating whether something permission/security-related is in play here), but I'm not aware of much about Win7 in general. I also don't see any Win7-related changes in the OpenSSL changelog - has this platform been validated already? Thanks, James __ OpenSSL Project http://www.openssl.org User Support Mailing Listopenssl-users@openssl.org Automated List Manager majord...@openssl.org