Hi Sia, 

Thanks for the problem! I hope you find these examples understandable.

Below, find an inflexible but fairly fast single-digit method and a slower (but 
still respectable) multi-digit method that copes with entirely absent digits 
after +/- and multi-digit skips such as 12 or 37 or 186 following letters 
rather than just 3 or 5 or 9. Not knowing your problem domain I'm not sure if 
these larger skips are likely or even possible, nor whether it's likely that 
there will be no digit.

Neither method flags the edge case where there are not enough letters to skip 
before the next +/-. They just swallow that chunk of the string entirely 
without error. Is that a problem?

Python 2.7 or later.

from string import maketrans
from itertools import takewhile

# function used by takewhile to detect digits
def is_digit(s): return s.isdigit()

class redux:

    def __init__(self):
       intab = '+-'
       outtab = '  '
       self.trantab = maketrans(intab, outtab)


    # simple-minded, fast, 1-digit algorithm
    def reduce_plusminus(self, s):
        list_form = [r[int(r[0]) + 1:] if r[0].isdigit() else r
                    for r
                    in s.translate(self.trantab).split()]
        return ''.join(list_form)

    # multi-digit algorithm
    def reduce_plusminus_multi_digit(self, s):
        chunks = s.translate(self.trantab).split()
        digits = [list(takewhile(is_digit, r))
                   for r
                   in chunks]
        # zero case ('else 0') is for missing digit(s)
        numbers = [int(''.join(r)) if r else 0
                   for r
                    in digits]
        # how far to skip (number and skipped letters)
        skips = [len(dig) + num
                 for dig, num
                 in zip(digits, numbers)]
        return ''.join([sequence[skipover:]
                        for skipover, sequence
                        in zip(skips, chunks)])

if __name__ == "__main__":
    p = redux()
    print (p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG"))
    print (p.reduce_plusminus("tA.-5AG.-2AG,-2ag"))
    print ('multi-digit...')
    print (p.reduce_plusminus_multi_digit(".+3ACG.+5CAACG.+3ACG.+3ACG"))
    print 
(p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+3ACG.+3ACG"))
    print 
(p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+ACG.+3ACG"))

    for n in range(100000): 
p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+3ACG.+3ACG")
    
    
Note that the multi-line version above tolerates missing digits: if the number 
is missing after the '+/-' it doesn't skip any letters. 

Explanation:

String slicing is good in Python, but list comprehensions are often the most 
powerful and efficient way to transform data. If you've come from another 
language or you're a newb to programming, it's well worth getting grips with 
list comprehensions [a for b in lst].

First, +/- are converted to spaces. Splitting the string into a list, breaking 
at spaces, effectively swallows the +/- characters and leaves us with chunks of 
string from which to drop the specified number of letters. Having dropped the 
letters, we convert the transformed list back into a string.

The single-digit version just assumes that the number is always present and 
always single-digit. If you allow multi-digits be ready to skip those multiple 
digits, as well as skipping letters, adding another stage to the calculation. 
E.g. in: 

+3ACG. -> . 

you skip 1 + 3 characters, 1 for the digit, 3 for the following letters as 
specified by the digit 3. In: 

-11ACGACGACGACG. -> G. 

You skip 2 + 11 characters, 2 characters in "11" and 11 characters following. 
And incidentally in: 

+ACG. -> ACG. 

there's no digit, so you skip 0 digits + 0 letters. If a lack of digits is a 
data error then it would be easy to test for-- just look for an empty list in 
the 'digits' list.

For each chunk (stuff between +/-) I use takewhile to separate the zero or more 
initial digits from the following letters. itertools.takewhile is a nice tool 
for matching stuff until the first time a condition ceases to hold. In this 
case I'm matching digits up until the first non-digit.

If takewhile doesn't find any digits at the start of the sequence, it returns 
the empty list []. ''.join(list) swallows empty lists so takewhile and 
''.join() do behind-the-scenes leg-work for me in detecting the case of absent 
digit(s).

For my money speed isn't bad for these two methods (again, I don't know your 
domain.) On my (ancient) hardware the 1-digit version does over 100,000/s for 
your longer string, the multi-digit about 35,000/s.

HTH,

Nick

On Saturday, 5 January 2013 19:35:26 UTC+11, Sia  wrote:
> I have strings such as:
> 
> 
> 
> tA.-2AG.-2AG,-2ag
> 
> or
> 
> .+3ACG.+5CAACG.+3ACG.+3ACG
> 
> 
> 
> The plus and minus signs are always followed by a number (say, i). I want 
> python to find each single plus or minus, remove the sign, the number after 
> it and remove i characters after that. So the two strings above become:
> 
> 
> 
> tA..,
> 
> and
> 
> ...
> 
> 
> 
> How can I do that?
> 
> Thanks.
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to