Re: [Tutor] Removing control characters

2009-02-19 Thread Mark Tolonen


"Kent Johnson"  wrote in message 
news:1c2a2c590902191500y71600feerff0b73a88fb49...@mail.gmail.com...

On Thu, Feb 19, 2009 at 5:41 PM, Dinesh B Vadhia
 wrote:

Okay, here is a combination of Mark's suggestions and yours:



# replace unwanted chars in string s with " "
t = "".join([(" " if n in c else n) for n in s if n not in c])
t

'Product ConceptsHard candy with an innovative twist, Internet Archive:
Wayback Machine. [online] Mar. 25, 2004. Retrieved from the Internet 

http://www.confectionery-innovations.com>.'

This last bit doesn't work ie. replacing the unwanted chars with " " - 
eg.

'ConceptsHard'.  What's missing?


The "if n not in c" at the end of the list comp rejects the unwanted
characters from the result immediately. What you wrote is the same as
t = "".join([n for n in s if n not in c])

because "n in c" will never be true in the first conditional.

BTW if you care about performance, this is the wrong approach. At
least use a set for c; better would be to use translate().


Sorry, I didn't catch the "replace with space" part.  Kent is right, 
translate is what you want.  The join is still nice for making the 
translation table:


table = ''.join(' ' if n < 32 or n > 126 else chr(n) for n in 
xrange(256))

string.translate('here is\x01my\xffstring',table)

'here is my string'

-Mark


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Removing control characters

2009-02-19 Thread Kent Johnson
On Thu, Feb 19, 2009 at 5:41 PM, Dinesh B Vadhia
 wrote:
> Okay, here is a combination of Mark's suggestions and yours:

>> # replace unwanted chars in string s with " "
>> t = "".join([(" " if n in c else n) for n in s if n not in c])
>> t
> 'Product ConceptsHard candy with an innovative twist, Internet Archive:
> Wayback Machine. [online] Mar. 25, 2004. Retrieved from the Internet  http://www.confectionery-innovations.com>.'
>
> This last bit doesn't work ie. replacing the unwanted chars with " " - eg.
> 'ConceptsHard'.  What's missing?

The "if n not in c" at the end of the list comp rejects the unwanted
characters from the result immediately. What you wrote is the same as
t = "".join([n for n in s if n not in c])

because "n in c" will never be true in the first conditional.

BTW if you care about performance, this is the wrong approach. At
least use a set for c; better would be to use translate().

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Removing control characters

2009-02-19 Thread Dinesh B Vadhia
Okay, here is a combination of Mark's suggestions and yours:

> # string of all chars
> a = ''.join([chr(n) for n in range(256)])
> a
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
 
!"#$%&\'()*+,-./0123456789:;<=>?...@abcdefghijklmnopqrstuvwxyz[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

> # string of wanted chars
> b = ''.join([n for n in a if ord(n) >= 32 and ord(n) <= 126])
> b
' 
!"#$%&\'()*+,-./0123456789:;<=>?...@abcdefghijklmnopqrstuvwxyz[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'

> # string of unwanted chars > ord(126)
> c = ''.join([n for n in a if ord(n) < 32 or ord(n) > 126])
> c
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

> # the string to process
> s = "Product Concepts\xe2\x80\x94Hard candy with an innovative twist, 
> Internet Archive: Wayback Machine. [online] Mar. 25, 2004. Retrieved from the 
> Internet http://www.confectionery-innovations.com>."

> # replace unwanted chars in string s with " "
> t = "".join([(" " if n in c else n) for n in s if n not in c])
> t
'Product ConceptsHard candy with an innovative twist, Internet Archive: Wayback 
Machine. [online] Mar. 25, 2004. Retrieved from the Internet http://www.confectionery-innovations.com>.'

This last bit doesn't work ie. replacing the unwanted chars with " " - eg. 
'ConceptsHard'.  What's missing?

Dinesh



From: Kent Johnson 
Sent: Thursday, February 19, 2009 12:36 PM
To: Dinesh B Vadhia 
Cc: tutor@python.org 
Subject: Re: [Tutor] Removing control characters


On Thu, Feb 19, 2009 at 2:25 PM, Dinesh B Vadhia
 wrote:

> # 3) Replacing a set of characters with a single character ie.
>
> for c in str:
> if c in set:
> string.replace (c, r)
>
> to give
>
>> 'Chris Perkins : $$$-'
> My solution is:
>
> print ''.join[string.replace(c, r) for c in str if c in set]

With the syntax corrected this will not do what you want; the "if c in
set" filters the characters in the result, so the result will contain
only the replacement characters. You would need something like
''.join([ (r if c in set else c) for c in str])

Note that both 'set' and 'str' are built-in names and therefore poor
choices for variable names.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Removing control characters

2009-02-19 Thread Kent Johnson
On Thu, Feb 19, 2009 at 2:25 PM, Dinesh B Vadhia
 wrote:

> # 3) Replacing a set of characters with a single character ie.
>
> for c in str:
> if c in set:
> string.replace (c, r)
>
> to give
>
>> 'Chris Perkins : $$$-'
> My solution is:
>
> print ''.join[string.replace(c, r) for c in str if c in set]

With the syntax corrected this will not do what you want; the "if c in
set" filters the characters in the result, so the result will contain
only the replacement characters. You would need something like
''.join([ (r if c in set else c) for c in str])

Note that both 'set' and 'str' are built-in names and therefore poor
choices for variable names.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Removing control characters

2009-02-19 Thread Marc Tompkins
On Thu, Feb 19, 2009 at 11:25 AM, Dinesh B Vadhia  wrote:

> My solution is:
>
> print ''.join[string.replace(c, r) for c in str if c in set]
>
> But, this returns a syntax error.  Any idea why?
>

Probably because you didn't use parentheses - join() is a function.

-- 
www.fsrtechnologies.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Removing control characters

2009-02-19 Thread Dinesh B Vadhia
At the bottom of the link http://code.activestate.com/recipes/303342/ there are 
list comprehensions for string manipulation ie.

import string

str = 'Chris Perkins : 224-7992'
set = '0123456789'
r = '$'

# 1) Keeping only a given set of characters.

print  ''.join([c for c in str if c in set])

> '2247992'

# 2) Deleting a given set of characters.

print  ''.join([c for c in str if c not in set])

> 'Chris Perkins : -'

The missing one is

# 3) Replacing a set of characters with a single character ie.

for c in str:
if c in set:
string.replace (c, r)

to give

> 'Chris Perkins : $$$-'

My solution is:

print ''.join[string.replace(c, r) for c in str if c in set]

But, this returns a syntax error.  Any idea why?

Ta!

Dinesh




From: Kent Johnson 
Sent: Thursday, February 19, 2009 8:03 AM
To: Dinesh B Vadhia 
Cc: tutor@python.org 
Subject: Re: [Tutor] Removing control characters


On Thu, Feb 19, 2009 at 10:14 AM, Dinesh B Vadhia
 wrote:
> I want a regex to remove control characters (< chr(32) and > chr(126)) from
> strings ie.
>
> line = re.sub(r"[^a-z0-9-';.]", " ", line)   # replace all chars NOT A-Z,
> a-z, 0-9, [-';.] with " "
>
> 1.  What is the best way to include all the required chars rather than list
> them all within the r"" ?

You have to list either the chars you want, as you have done, or the
ones you don't want. You could use
r'[\x00-\x1f\x7f-\xff]' or
r'[^\x20-\x7e]'

> 2.  How do you handle the inclusion of the quotation mark " ?

Use \", that works even in a raw string.

By the way string.translate() is likely to be faster for this purpose
than re.sub(). This recipe might help:
http://code.activestate.com/recipes/303342/

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Removing control characters

2009-02-19 Thread Mark Tolonen
A regex isn't always the best solution:

>>> a=''.join(chr(n) for n in range(256))
>>> a
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
 
!"#$%&\'()*+,-./0123456789:;<=>?...@abcdefghijklmnopqrstuvwxyz[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>>> b=''.join(n for n in a if ord(n) >= 32 and ord(n) <= 126)
>>> b
' 
!"#$%&\'()*+,-./0123456789:;<=>?...@abcdefghijklmnopqrstuvwxyz[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'

-Mark

  "Dinesh B Vadhia"  wrote in message 
news:col103-ds55714842811febeb4a97ca3...@phx.gbl...
  I want a regex to remove control characters (< chr(32) and > chr(126)) from 
strings ie.

  line = re.sub(r"[^a-z0-9-';.]", " ", line)   # replace all chars NOT A-Z, 
a-z, 0-9, [-';.] with " " 

  1.  What is the best way to include all the required chars rather than list 
them all within the r"" ?
  2.  How do you handle the inclusion of the quotation mark " ?

  Cheers

  Dinesh




--


  ___
  Tutor maillist  -  Tutor@python.org
  http://mail.python.org/mailman/listinfo/tutor
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Removing control characters

2009-02-19 Thread Kent Johnson
On Thu, Feb 19, 2009 at 10:14 AM, Dinesh B Vadhia
 wrote:
> I want a regex to remove control characters (< chr(32) and > chr(126)) from
> strings ie.
>
> line = re.sub(r"[^a-z0-9-';.]", " ", line)   # replace all chars NOT A-Z,
> a-z, 0-9, [-';.] with " "
>
> 1.  What is the best way to include all the required chars rather than list
> them all within the r"" ?

You have to list either the chars you want, as you have done, or the
ones you don't want. You could use
r'[\x00-\x1f\x7f-\xff]' or
r'[^\x20-\x7e]'

> 2.  How do you handle the inclusion of the quotation mark " ?

Use \", that works even in a raw string.

By the way string.translate() is likely to be faster for this purpose
than re.sub(). This recipe might help:
http://code.activestate.com/recipes/303342/

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor