On 2023-11-17 01:15, Mike Dewhirst via Python-list wrote:
On 15/11/2023 3:08 pm, MRAB via Python-list wrote:
On 2023-11-15 03:41, Mike Dewhirst via Python-list wrote:
On 15/11/2023 10:25 am, MRAB via Python-list wrote:
On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
I'd like to improve the code below, which works. It feels clunky to
me.
I need to clean up user-uploaded files the size of which I don't
know in
advance.
After cleaning they might be as big as 1Mb but that would be super
rare.
Perhaps only for testing.
I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4
def remove_alpha(txt):
""" r'[^0-9\- ]':
[^...]: Match any character that is not in the specified set.
0-9: Match any digit.
\: Escape character.
-: Match a hyphen.
Space: Match a space.
"""
cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
bits = cleaned_txt.split()
pieces = []
for bit in bits:
# minimum size of a CAS number is 7 so drop smaller
clumps of digits
pieces.append(bit if len(bit) > 6 else "")
return " ".join(pieces)
Many thanks for any hints
Why don't you use re.findall?
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
I think I can see what you did there but it won't make sense to me - or
whoever looks at the code - in future.
That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.
That little re.sub() came from ChatGPT and I can understand it without
too much effort because it came documented
I suppose ChatGPT is the answer to this thread. Or everything. Or
will be.
\b Word boundary
[0-9]{2,7} 2..7 digits
- "-"
[0-9]{2} 2 digits
- "-"
[0-9]{2} 2 digits
\b Word boundary
The "word boundary" thing is to stop it matching where there are
letters or digits right next to the digits.
For example, if the text contained, say, "123456789-12-1234", you
wouldn't want it to match because there are more than 7 digits at the
start and more than 2 digits at the end.
Thanks
I know I should invest some brainspace in re. Many years ago at a Perl
conferenceI did buy a coffee mug completely covered with a regex cheat
sheet. It currently holds pens and pencils on my desk. And spiders now I
look closely!
Then I took up Python and re is different.
Maybe I'll have another look ...
The patterns themselves aren't that different; Perl's just has more
features than the re module's.
--
https://mail.python.org/mailman/listinfo/python-list