Re: Correct parsers for bounded integral values

Viktor Dukhovni Sun, 20 Jul 2025 13:09:17 -0700

On Sun, Jul 20, 2025 at 09:12:20PM +0200, Stefan Klinger wrote:

> I'd like to bring to your attention a discussion that I have started
> over at Haskell-cafe [1].  I was complaining about the silent overflow
> of parsers for bounded integers:
> 
>     > read "298" :: Word8
>     42


FWIW, there haven't AFAIK any complaints about ByteString's readInt,
readWord, readInteger, readNatural and various sized variants having
overflow checks.  But these have always been more like `reads` than
`read`, returning `Maybe (a, ByteString)`, so perhaps somewhat more
oriented towards detecting unexpected excess input, as well as for
some time now range overflow.  So there's some precedent for overflow
checking, but...

It is also fair to point out that once an Int or other bounded integral
type is read, arithmetic with that type (addition, subtraction and
multiplication) silently overflows.  And so silent overflow in `read`
is not inconsistent with the type's semantics.

If converting strings to numbers is in support of string-oriented
network protocols (e.g. the SIZE ESMTP extension), then one really
should make an effort to avoid silent overflow, but in that context the
various ByteString read methods are already available.

That said, if various middleware libraries hide overflows, because under
the covers thay're using `read`, that could be a problem, so we do want
the ecosystem at large to make sensible choices about when silent
overflow may or may not be appropriate.  Perhaps that means having
both wrapping and overflow-checked implementations available, and
clear docs with each about its behaviour and the corresponding
alternative.

> I find this unsatisfying, and I have demonstrated a solution [2] that
> seems correct and performant.

A few of quick observations about [2]:

    - It disallows expliccit leading "+" (just like "read", but perhaps
      that should be tolerated).

    - It disallows multiple leading zeros, perhaps these should be
      tolerated.

    - It disallows "-0", perhaps these should be tolerated, as well
      as "-0000", "-000001", ...  (With lazy ByteStrings, which might
      never terminate, there is a generous, but sensible limit on
      the number of leading zeros allowed).

    - One way to avoid difficulties with handling negative minBound is
      to parse signed values via the corresponding unsigned type, which
      can accommodate `-minBound` as a positive value, and then negate
      the final result.  This makse possible sharing the low-level
      digit-by-digit code between the positive and negative cases.

If parsing of Integer and Natual is also in scope, I would expect that
it avoids doing multi-precision arithmetic for each digit, parsing
groups of digits into ~Word sized blocks, and merge the blocks
hierarchically with only a logarithmic number of MP multiplies.

-- 
    Viktor.
_______________________________________________
ghc-devs mailing list
[email protected]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Re: Correct parsers for bounded integral values

Reply via email to