[issue37168] Decimal divisions sometimes 10x or 100x too large
Phil Frost added the comment: Not yet. I had a hell of a time getting gdb to follow gunicorn forks so it would watch the workers but not also every other thing using popen or os.system(). And the lldb package on Alpine doesn't seem to work. So we're currently in the process of testing a Debian container for production, hoping the issue just goes away. If not, I'll continue digging in with a debugger. -- nosy: +Phil Frost2 ___ Python tracker <https://bugs.python.org/issue37168> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37168] Decimal divisions sometimes 10x or 100x too large
Phil Frost added the comment: I'm afraid that won't be of much use since the object is statically allocated. Besides, the trouble isn't finding what created the object, but what mutated an int object which should be immutable. -- ___ Python tracker <https://bugs.python.org/issue37168> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37168] Decimal divisions sometimes 10x or 100x too large
Phil Frost added the comment: Probably too many C extensions to feasibly audit them all. Also we can't rule out something busted in Alpine. So I'm going to set a watchpoint on small_ints[1]->ob_ival with a little gdb script to dump core right when it changes; hopefully inspecting that backtrace will lead directly to the culprit. -- ___ Python tracker <https://bugs.python.org/issue37168> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37168] Decimal divisions sometimes 10x or 100x too large
Phil Frost added the comment: So this is uminteresting (gdb) p small_ints[0] $2 = (PyIntObject *) 0x558ce65df528 (gdb) p *small_ints[0] $3 = {ob_refcnt = 18, ob_type = 0x7fe019b694c0 , ob_ival = -5} (gdb) p *small_ints[1] $4 = {ob_refcnt = 65, ob_type = 0x7fe019b694c0 , ob_ival = -3} (gdb) p *small_ints[2] $5 = {ob_refcnt = 204, ob_type = 0x7fe019b694c0 , ob_ival = -3} (gdb) p *small_ints[3] $6 = {ob_refcnt = 1872, ob_type = 0x7fe019b694c0 , ob_ival = -2} When is 1/1=10? When -5 + 1 = -3. So now I suppose the question is how this comes to be. -- ___ Python tracker <https://bugs.python.org/issue37168> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37168] Decimal divisions sometimes 10x or 100x too large
Phil Frost added the comment: > Alpine Linux apparently uses musl. Is that well supported and tested? Heh. Not really. Switching to a Debian container is probably just about as much work as building Python from source. Would that be preferable? -- ___ Python tracker <https://bugs.python.org/issue37168> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37168] Decimal divisions sometimes 10x or 100x too large
Phil Frost added the comment: skrah: Yes, that's correct. Since I can only produce this bug in production it will take me some days to build and validate a source build. But absent any better ideas, I will try. tim.peters: I've observed this bug across hundreds of EC2 hosts, in dozens of code paths, with all kinds of inputs. Moreover, the hosts aren't displaying any other symptoms of hardware failure such as random segfaults or mysteriously corrupted data. I've also deeply investigated two cores now which show specifically that `exp` seems to get 2 added when it should have been 1. I have a hard time explaining how a hardware failure can cause precisely the same failure so reliably. So I doubt hardware is to blame. Although, it does seem the issue occurs in "clumps" on individual hosts. So we might go 10 hours without seeing the issue, then it may happen 5 times within 30 minutes on one host. We might observe 1 or 2 more such clumps on the same host until the next deploy of the application, at which point all the containers are replaced with fresh ones. So this suggests there is some ephemeral state within a host that creates a propensity for the issue. I've also been unable to reproduce the problem in a development environment, even when that development environment is using the same kernel, instance class, and docker container as production. So I suspect the bug is precipitated by some particular concurrency or interaction that I haven't been able to replicate. -- ___ Python tracker <https://bugs.python.org/issue37168> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37168] Decimal divisions sometimes 10x or 100x too large
Phil Frost added the comment: I should note that while we do observe this issue with Decimal, my troubleshooting so far indicates the logic in decimal.py is fine. It seems instead the instructions in decimal.py aren't being correctly executed. I'm not sure if the problem is limited to only Decimal, or if we just notice it there because it's correlated with financial code, which when encountering the bug, introduces an error which is eventually caught by the accounting department. There could very well be errors happening at least frequently elsewhere which simply go unnoticed. -- ___ Python tracker <https://bugs.python.org/issue37168> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37168] Decimal divisions sometimes 10x or 100x too large
New submission from Phil Frost : We've observed instances of Decimal divisions being incorrect a few times per day in a web application serving millions of requests daily. I've been unable to reproduce the issue but have been investigating core dumps which suggest either some issue in the C Python implementation or its execution environment. Python 2.7.15 on Alpine Linux Docker containers. A slightly modified decimal.py is attached. It's been modified to check that the answer is approximately correct and dumps core if not. It also has a few local variables added to provide additional insight. I've annotated the file with values from the core dump and my reasoning about what should happen. The crux of the problem is the loop at line 1389. There are 3 ways to determine how many times this loop executed: 1. the number of zeros removed from coeff 2. the number of times exp is incremented 3. the number of times division_counter is incremented Oddly, #1 and #3 indicate the loop happened 27 times, while #2 indicates the loop happened 28 times. One possible explanation (which makes about as much sense as any other) is that `exp += 1` sometimes adds 2 instead of 1. This means the loop happens 1 time fewer than it should, leaving `coeff` 10 times bigger than it should properly be. Or (very rarely) this happens twice and the result is 100 times bigger. I find it very odd that something as basic as `+= 1` should not work, but at the moment it is the best explanation I have. Unfortunately I can't share the core dump, as I've only been able to observe the issue in production so the core contains private information. But I'd welcome any ideas for further exploration. Or perhaps someone is aware of an existing bug that would explain this behavior. -- components: Interpreter Core files: decimal.py messages: 344772 nosy: Phil Frost priority: normal severity: normal status: open title: Decimal divisions sometimes 10x or 100x too large versions: Python 2.7 Added file: https://bugs.python.org/file48396/decimal.py ___ Python tracker <https://bugs.python.org/issue37168> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com