Re: Regular expression to capture model numbers

2009-04-24 Thread Piet van Oostrum
> John Machin  (JM) wrote:

>JM> On Apr 24, 1:29 am, Piet van Oostrum  wrote:

>>> obj = re.compile(r'(?:[a-z]+[-0-9]|[0-9]+[-a-z]|-+[0-9a-z])[-0-9a-z]*', 
>>> re.I)

>JM> Understandable and maintainable, I don't think. Suppose that instead
>JM> the first character is limited to being alphabetic. You have to go
>JM> through the whole process of elaborating the possibilites again, and I
>JM> don't consider that process qualifies as "express[ing] complicated
>JM> conditions like that".

No, I don't think regular expressions are the best tool for these kind
of tests. I just wanted to show that it *could* be done. By the way,
your additional hypothetical requirement that the first character should
be alphabetic just makes it easier: only the first alternative remains.
But on the other hand, suppose you would have the requirement that the
pattern should not end in a hyphen then it becomes even uglier. Or when
there should never be two hyphens in a row, I wouldn't even think of
using a re, although theoretically it would be possible.

Translating these requirements into re's is not `composable'.
-- 
Piet van Oostrum 
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression to capture model numbers

2009-04-23 Thread John Machin
On Apr 24, 1:29 am, Piet van Oostrum  wrote:
> > John Machin  (JM) wrote:
> >JM> On Apr 23, 8:01 am, krishnaposti...@gmail.com wrote:
> >>> Requirements:
> >>>   The text must contain a combination of numbers, alphabets and hyphen
> >>> with at least two of the three elements present.
> >JM> Unfortunately(?), regular expressions can't express complicated
> >JM> conditions like that.
>
> Yes, they can but it is not pretty.
>
> The pattern must start with a letter, a digit or a hyphen.
>
> If it starts with a letter, for example, there must be at least a hyphen
> or a digit somewhere. So let us concentrate on the first one of these
> that occurs in the string. Then the preceding things are only letters
> and after it can be any combination of letters, digits and hyphens. So
> the pattern for this is (when we write L for letters, and d for digits):
>
> L+[-d][-Ld]*.
>
> Similarly for strings starting with a digit and with a hyphen. Now
> replacing L with [A-Za-z] and d with [0-9] or \d and factoring out the
> [-Ld]* which is common to all 3 cases you get:
>
> (?:[A-Za-z]+[-0-9]|[0-9]+[-A-Za-z]|-+[0-9A-Za-z])[-0-9A-Za-z]*
>
> >>> obj = 
> >>> re.compile(r'(?:[A-Za-z]+[-0-9]|[0-9]+[-A-Za-z]|-+[0-9A-Za-z])[-0-9A-Za-z]*')
> >>> re.findall(obj, 'TestThis;1234;Test123AB-x')
>
> ['Test123AB-x']
>
> Or you can use re.I and mention only one case of letters:
>
> obj = re.compile(r'(?:[a-z]+[-0-9]|[0-9]+[-a-z]|-+[0-9a-z])[-0-9a-z]*', re.I)

Understandable and maintainable, I don't think. Suppose that instead
the first character is limited to being alphabetic. You have to go
through the whole process of elaborating the possibilites again, and I
don't consider that process qualifies as "express[ing] complicated
conditions like that".

--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression to capture model numbers

2009-04-23 Thread Piet van Oostrum
> John Machin  (JM) wrote:

>JM> On Apr 23, 8:01 am, krishnaposti...@gmail.com wrote:

>>> Requirements:
>>>   The text must contain a combination of numbers, alphabets and hyphen
>>> with at least two of the three elements present.

>JM> Unfortunately(?), regular expressions can't express complicated
>JM> conditions like that.

Yes, they can but it is not pretty.

The pattern must start with a letter, a digit or a hyphen. 

If it starts with a letter, for example, there must be at least a hyphen
or a digit somewhere. So let us concentrate on the first one of these
that occurs in the string. Then the preceding things are only letters
and after it can be any combination of letters, digits and hyphens. So
the pattern for this is (when we write L for letters, and d for digits):

L+[-d][-Ld]*.

Similarly for strings starting with a digit and with a hyphen. Now
replacing L with [A-Za-z] and d with [0-9] or \d and factoring out the
[-Ld]* which is common to all 3 cases you get:

(?:[A-Za-z]+[-0-9]|[0-9]+[-A-Za-z]|-+[0-9A-Za-z])[-0-9A-Za-z]*

>>> obj = 
>>> re.compile(r'(?:[A-Za-z]+[-0-9]|[0-9]+[-A-Za-z]|-+[0-9A-Za-z])[-0-9A-Za-z]*')
>>> re.findall(obj, 'TestThis;1234;Test123AB-x')
['Test123AB-x']

Or you can use re.I and mention only one case of letters:

obj = re.compile(r'(?:[a-z]+[-0-9]|[0-9]+[-a-z]|-+[0-9a-z])[-0-9a-z]*', re.I)
-- 
Piet van Oostrum 
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression to capture model numbers

2009-04-22 Thread Aahz
In article <1bbafe6d-e3bc-4d90-8a0a-0ca82808b...@d14g2000yql.googlegroups.com>,
  wrote:
>
>My quick attempt is below:
>obj = re.compile(r'\b[0-9|a-zA-Z]+[\w-]+')
 re.findall(obj, 'TestThis;1234;Test123AB-x')
>['TestThis', '1234', 'Test123AB-x']
>
>This is not working.

What isn't working?  Why not just split() on ";"?  You need to define
your problem more precisely if you want us to help.
-- 
Aahz (a...@pythoncraft.com)   <*> http://www.pythoncraft.com/

"If you think it's expensive to hire a professional to do the job, wait
until you hire an amateur."  --Red Adair
--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression to capture model numbers

2009-04-22 Thread John Machin
On Apr 23, 8:01 am, krishnaposti...@gmail.com wrote:
> My quick attempt is below:
> obj = re.compile(r'\b[0-9|a-zA-Z]+[\w-]+')

1. Provided the remainder of the pattern is greedy and it will be used
only for findall, the \b seems pointless.

2. What is the "|" for? Inside a character class, | has no special
meaning, and will match a literal "|" character (which isn't part of
your stated requirement).

3. \w will match underscore "_" ... not in your requirement.

4. Re [\w-] : manual says "If you want to include a ']' or a '-'
inside a set, precede it with a backslash, or place it as the first
character" which IIRC is the usual advice given with just about any
regex package -- actually, placing it at the end works but relying on
undocumented behaviour when there are alternatives that are as easy to
use and are documented is not a good habit to get into :-)

5. You have used "+" twice; does this mean a minimum length of 2 is
part of your requirement?

>   >>> re.findall(obj, 'TestThis;1234;Test123AB-x')
>
> ['TestThis', '1234', 'Test123AB-x']
>
> This is not working.
>
> Requirements:
>   The text must contain a combination of numbers, alphabets and hyphen
> with at least two of the three elements present.

Unfortunately(?), regular expressions can't express complicated
conditions like that.

> I can use it to set
> min length using ) {}

I presume that you mean enforcing a minimum length of (say) 4 by using
{4,} in the pattern ...

You are already faced with the necessity of filtering out unwanted
matches programmatically. You might as well leave the length check
until then.

So: first let's establish what the pattern should be, ignoring the "2
or more out of 3 classes" rule and the length rule.

First character: Digits? Maybe not. Hyphen? Probably not.
Last character: Hyphen? Probably not.
Other characters: Any of (ASCII) letters, digits, hyphen.

So based on my guesses for answers to the above questions, the pattern
should be r"[A-Za-z][-A-Za-z0-9]*[A-Za-z0-9]"

Note: this assumes that your data is impeccably clean, and there isn't
any such data outside textbooks. You may wish to make the pattern less
restrictive, so that you can pick up probable mistakes like "A123-
456" instead of "A123-456".

Checking a candidate returned by findall could be done something like
this:

# initial setup:
import string
alpha_set = set(string.ascii_letters)
digit_set = set('1234567890')
min_len = 4 # for example

# each candidate:
cand_set = set(cand)
ok = len(cand) >= min_len and (
   bool(cand_set & alpha_set)
   + bool(cand_set & digit set)
   + bool('-' in cand_set)
   ) >= 2

HTH,
John
--
http://mail.python.org/mailman/listinfo/python-list