Re: Correct handling of case in unicode and regexps

MRAB Sat, 23 Feb 2013 12:26:49 -0800

On 2013-02-23 18:57, Devin Jeanpierre wrote:

On Sat, Feb 23, 2013 at 1:12 PM, MRAB <[email protected]>
wrote:

The basic rule is that a series of characters in the regex must
match a series of characters in the text, with no partial matches
in either.


For example, 'ss' can match 'ß', but 's' can't match 'ß' because
that would be matching part of 'ß'.

In a regex like 's+', you're asking it to match one or more
repetitions of 's', but that would mean that 's' would have to
match part of 'ß' in the first iteration and the remainder of 'ß'
in the second iteration.


That makes sense. I'll have to think about this and run some tests
through regex, as well.

Although it's theoretically possible to do that, the code is
already difficult enough. The cost outweighs the potential
benefit.

If you'd like to have a go at implementing it, the code _is_ open
source. :-)


Actually, the reason it's relevant to me is that I'm reimplementing
the re module using a more automata theoretic approach (it's my
second attack at the problem). Also, I've read the _sre source code
and it's unpleasant. Is regex much better?

I like to think so! ;-)

Part of the problem may be that it also supports fuzzy (approximate)
matching.

At least the way I'm planning on going about it, supporting this is
easier, as long as one can figure out what it means to match halfway
inside a ß. Since case folding is a homomorphism*, I can case fold
the regex** and case fold the input and then I'm done. Case folding
of the input can be done character by character, and to emulate the
regex module behavior I'd need to check at certain places whether or
not I'm in the middle of a casefolding expansion, and fail if so. On
the other hand, if I don't emulate the regex module's behavior in at
least some cases, I'd need to figure out what the value of a match
of 's' against 'ß' would be.

It seems like a reasonable restriction to me that start and end
positions should be integers, i.e. forbid partial matches within a
character.

This means that matching '(ss)' against 'ß' would be OK, as would '(s+)'
against 'ß', but '(s)' or '(s)(s)' against 'ß' would not, otherwise you
could get:

match('(s)', 'ß').span(0)

(0, 0.5)

match('(s)(s)', 'ß').span(0, 1, 2)

((0, 1), (0, 0.5), (0.5, 1)).

[*]  i.e. it can be done character by character (see Unicode 3.13
Default Case Algorithms) [**] Not as trivial as it sounds, but still
 easy. [ßa-z] goes to e.g. [a-z]|ss (not [ssa-z]).

--
http://mail.python.org/mailman/listinfo/python-list

Re: Correct handling of case in unicode and regexps

Reply via email to