I'd like to improve the code below, which works. It feels clunky to me.

I need to clean up user-uploaded files the size of which I don't know in advance.

After cleaning they might be as big as 1Mb but that would be super rare. Perhaps only for testing.

I'm extracting CAS numbers and here is the pattern xx-xx-x up to xxxxxxx-xx-x eg., 1012300-77-4

def remove_alpha(txt):

    """  r'[^0-9\- ]':

    [^...]: Match any character that is not in the specified set.

    0-9: Match any digit.

    \: Escape character.

    -: Match a hyphen.

    Space: Match a space.

    """

    cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)

    bits = cleaned_txt.split()

    pieces = []

    for bit in bits:

        # minimum size of a CAS number is 7 so drop smaller clumps of digits

        pieces.append(bit if len(bit) > 6 else "")

    return " ".join(pieces)


Many thanks for any hints

Cheers

Mike
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to