2026年2月24日(火) 16:21 youkidearitai <[email protected]>:
>
> 2026年2月24日(火) 11:38 Kentaro Takeda <[email protected]>:
> >
> > Hi Yuya,
> >
> > I think this is a good idea. While spec compliance is generally desirable, 
> > DoS via unbounded grapheme clusters is a real threat, and it's reasonable 
> > for a language-level implementation to impose practical limits that the 
> > Unicode spec itself doesn't define. This kind of gap between a 
> > general-purpose spec and a concrete implementation is not unusual.
> >
> > The default of 32 code points sounds sensible given that natural language 
> > grapheme clusters top out well below that.
> >
> > One minor note: it might help to clarify the intended behavior of 
> > `grapheme_limit_codepoints` a bit more — for instance, whether it is meant 
> > as a validation check (returning false when a cluster exceeds the limit) or 
> > something else.
> >
> > Regards,
> > Kentaro Takeda
> >
> >
> > 2026年2月23日(月) 20:28 youkidearitai <[email protected]>:
> >>
> >> Hi, Internals
> >>
> >> I noticed grapheme cluster is not limit code points in UAX#29.
> >> https://www.unicode.org/reports/tr29/
> >>
> >> And there is no limit code point in Unicode that confirmed in issue of ICU.
> >> https://unicode-org.atlassian.net/browse/ICU-23302
> >>
> >> So that means create many code points in 1 grapheme cluster,
> >> That is crash for program because computer resource is limited.
> >>
> >> For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt
> >> ```
> >> php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u
> >> {200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=600M >
> >> emoji_bomb.txt
> >> ```
> >> (PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH)
> >>
> >> So, I think we(php-src, programming language level) need to create new
> >> custom limit function.
> >> My idea is below:
> >>
> >> ```
> >> grapheme_limit_codepoints(string $str, integer $max_codepoints = 32): bool
> >> ```
> >>
> >> I don't have heavy opinion that $max_codepoints is 32.
> >> However, 32 code points is enough of grapheme cluster because
> >> human language max code points is maybe Hakṣhmalawarayaṁ(ཧ) in
> >> 9 code points.
> >>
> >> If need more than code points in grapheme cluster,
> >> Userland can to increase $max_codepoints.
> >>
> >> Please see also my speakerdeck.
> >> https://speakerdeck.com/youkidearitai/limit-of-code-point-for-grapheme-cluster
> >>
> >> What do you think about this idea?
> >>
> >> Regards
> >> Yuya
> >>
> >> --
> >> ---------------------------
> >> Yuya Hamada (tekimen)
> >> - https://tekitoh-memdhoi.info
> >> - https://github.com/youkidearitai
> >> -----------------------------
>
> Hi, Kentaro
>
> Thank you very much for your feedback.
>
> > One minor note: it might help to clarify the intended behavior of 
> > `grapheme_limit_codepoints` a bit more — for instance, whether it is meant 
> > as a validation check (returning false when a cluster exceeds the limit) or 
> > something else.
>
> Okay. I'll show you.
>
> ```
> // something string in $_POST['text']
> // Validate many code points in a grapheme cluster.
> if (grapheme_limit_codepoints($_POST['text'], 32) !== true) {
>    throw new InvalidException("Found invalid / many code points in
> grapheme cluster");
> }
>
> // Validate grapheme cluster length
> if (grapheme_strlen($_POST['text']) > 100) {
>   throw new InvalidException("Invalid grater than 100 graphemes");
> }
>
> // do anything...
> ```
> The intention is "count correct graphemes avoid DoS".
> And I want to overcoming to
> https://github.com/symfony/symfony/pull/13527 in grapheme_strlen
> function.
>
> Feel free to more comment.
> Regards
> Yuya.
>
> --
> ---------------------------
> Yuya Hamada (tekimen)
> - https://tekitoh-memdhoi.info
> - https://github.com/youkidearitai
> -----------------------------

Hi, Internals

I created a PoC and RFC.
https://github.com/php/php-src/pull/21311
https://wiki.php.net/rfc/grapheme_limit_codepoints

I tried to ask Unicode that UAX#29 add for limit of codepoint for
grapheme cluster.
Perhaps Unicode adds my suggestion if it is make sense. However, I
don't know what happen.

Anyway, I think make sense that grapheme cluster limits codepoint in PHP side.

Feel free to comment.

Regards
Yuya

-- 
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-----------------------------

Reply via email to