Diez B. Roggisch wrote:
> [EMAIL PROTECTED] schrieb:
>> Hello
>>
>> I am looking for python code that takes as input a list of strings
>> (most similar,
>> but not necessarily, and rather short: say not longer than 50 chars)
>> and that computes and outputs the python regular expression that
>> matches
>> these string values (not necessarily strictly, perhaps the code is able
>> to determine
>> patterns, i.e. families of strings...).
>
> For matching the exact set, of course a concatenation can be used.
> However, this is of limited use, because then simple string find will
> suffice.
>
> In general, this can't be done. It might be possible that the underlying
> structure of the language isn't regular at all - for example, the simple
> language of palindromes isn't.
>
> And even if it was - the search space is just to large to explore. You
> could generate a bazillion matching expressions, but .* will always
> match - so how do you prevent the generator from converging towards that?
>
> If all you have are those strings, you are better off trying to infer
> some structure yourself.
>
> Diez
Here is how regular expressions raise confusion. One starts with a relatively
simple matching problem and before long finds oneself
in a "search space too large to explore", having to formulate the right
expression out of "a bazillion" permutations with a syntax
replete with potential traps: backslashes, special characters, precedence
order, etc. The fact emerges (once more) that a regular
expression does not spare the effort of acquiring a crystal clear conception of
what needs to be done as an absolute prerequisite of
writing code that in the end can be relied upon to do what it is supposed to
do. In order to acquire that crystal clear conception a
"space too large to explore" is hardly a working environment one should be
eager to get into. Yet regular expressions seem to exert
a strange attraction and are often sought for problems that don't really
require them. I suspect this is because the regular
expression parsers are made by highly qualified experts and so they may
recommend themselves to the programmer with a less than
perfect understanding of a task as a sort of magic bullet that "does it right"
all by itself. Regex eding as a substitute for
problem analysis so to speak ...
Most OPs are rather sketchy and leave it up to their readers to guess
purpose, requirements, incidentals, environmentals,
restrictions, etc. Accordingly, threads often veer off topic very quickly into
lofty realms of academic interest, the OP never to be
heard from again. So I would suggest that the OP explain what he intends to do
with his regular expression. A lot depends on it.
In the interim I'd like to resize the search space to a manageable
dimension with the proven hypothesis that the number of
useless permutations of the kind being discussed is a bazillion minus one,
because, fortunately, there is only one that makes sense.
It follows these two rules: 1. First come first serve. And 2. Of two contending
targets the longer prevails. That is not to say
other rules never make sense. It is to say that in ninety-nine percent of the
cases this rule applies and that the other one percent
can also be understood in terms of these two rules plus exception rules.
Chapter 3.1. from the SE doc visualizes the point by replacing a set of
overlapping targets with their upper-cased selves:
>>> overlapping_targets = 'be=BE being=BEING been=BEEN bee=BEE belong=BELONG
long=LONG longer=LONGER'
>>> high_strung_bee_story = "There was a bee belonging to hive nine longing
to be a beetle and thinking that being a bee was
okay, but she had been a bee long enough and wouldn't be one much longer."
>>> SE.SE (overlapping_targets)(high_strung_bee_story)
"There was a BEE BELONGing to hive nine LONGing to BE a BEEtle and thinking
that BEING a BEE was okay, but she had BEEN a BEE LONG
enough and wouldn't BE one much LONGER."
The matches appear correct, though BELONGing, LONGing and BEEtle exemplify
unintended matches. These could be avoided by refinement
of the target set. The partially successful result points out the direction the
refinement has to take. Precedence being unrelated
to the order of the definitions, editing the set is carefree in that it does
not impose precedence considerations. The benefit in
terms of modularity is substantial, as module sets can be combined.
Doing the example with a regular expression:
>>> r = re.compile ('be|being|been|bee|belong|long|longer')
>>> r.findall (high_strung_bee_story)
['be', 'be', 'long', 'long', 'be', 'be', 'be', 'be', 'be', 'be', 'long', 'be',
'long']
The regular expression also applies rule one: first come first serve, but
applies a different rule two, defining precedence in the
order in which the targets line up in the expression, left to right. In order
to implement the SE rule of sensibility, the patterns
should be lined up reverse-sorted, which places the longer ones to the left.
>>> r = re.compile ('longer|long|belong|being|been|bee|be')
>>> r.findall (high_strung_bee_story)
['bee', 'belong', 'long', 'be', 'bee', 'being', 'bee', 'been', 'bee', 'long',
'be', 'longer']
Which of the two is correct? Without knowing the purpose of the search it is
impossible to know. I expect it to be the second one.
But only the OP knows and could help his helpers help by letting them know what
he is up to.
Frederic
--
http://mail.python.org/mailman/listinfo/python-list