* Michael Ludwig <michael.lud...@xing.com> [2010-04-08 09:25]: > >since upgrading a string increases memory consumption and can > >significantly slow down regex matches against it. > > Is it some copying behind the scenes that increases memory > consumption?
Just the simple fact that some characters take multiple bytes to encode in the UTF8-based format. > Why does that have the potential to significantly slow down > regex matches? Because one byte and one character is no longer the same thing, so if you know you want the 17th character in the string, you can’t say where in memory that is. You have to scan the string. This is sort of access pattern is rare in practice – most operations either just copy the entire string or scan over it one character at a time. But the regex engine is one of those things that sometimes needs to jump around in the string rather than merely scanning linearly. (Perl’s regex engine does some caching to avoid the worst penalties with this, but that in itself also causes slowdown, so there’s a balance to strike.) > Does that mean that when doing lots of matching, it might be > preferable to use byte strings and byte semantics, not > character strings and character semantics? Almost all of the time the performance cost is negligible and not worth sweating at the application code level. Trying to work on text using byte semantics is a recipe for massive headaches, and an invitation for bugs. It’s doable if you are careful and disciplined, absolutely. But why punish yourself? You gain little, at significant effort. On 5.12, though, you can get a tiny potential improvement en passant, with basically zero effort. In that case – and only in that case: why not? The gain is small; but the cost is also. In the other direction, that doesn’t translate. Don’t go micro- optimising your code for this. > >Under older perls, it’s a question of getting the wrong > >results in less time and memory, so there’s not an option. > > Wrong results? Could you clarify? Thanks :-) Well, you get Latin-1 semantics, eg. upper-/lowercasing will ignore accented characters that fall outside the Latin-1 charset. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>