Re: Stripping ASCII codes when parsing

2005-10-17 Thread Tony Nelson
In article <[EMAIL PROTECTED]>,
 David Pratt <[EMAIL PROTECTED]> wrote:

> This is very nice :-)  Thank you Tony.  I think this will be the way to  
> go.  My concern ATM is where it will be best to unicode. The data after  
> this will go into dict and a few processes and into database. Because  
> input source if not explicit encoding, I will have to assume ISO-8859-1  
> I believe but could well be cp1252 for most part ( because it says no  
> ASCII (0-30) but alright ASCII chars 128-254) and because most are  
> Windows users.  Am thinking to unicode after stripping these characters  
> and validating text, then unicoding (utf-8) so it is unicode in dict.  
> Then when I perform these other processes it should be uniform and then  
> it will go into database as unicode.  I think this should be ok.

Definitely "".translate() then unicode().  See the docs for 
"".translate().  As far as charset, well, if you can't know in advance 
you'll want to have some way to configure it for when it's wrong.  Also, 
maybe 255 is not allowed and should be checked for?

TonyN.:'[EMAIL PROTECTED]
  '  
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Stripping ASCII codes when parsing

2005-10-17 Thread Erik Max Francis
David Pratt wrote:

> I am working with a text format that advises to strip any ascii control 
> characters (0 - 30) as part of parsing data and also the ascii pipe 
> character (124) from the data. I think many of these characters are 
> from a different time. Since I have never seen most of these characters 
> in text I am not sure how these first 30 control characters are all 
> represented (other than say tab (\t), newline(\n), line return(\r) ) so 
> what should I do to remove these characters if they are ever 
> encountered. Many thanks.

Use ''.translate.  Pass in the identity mapping for the first argument, 
and for the second parameter, specify the list of all the characters you 
wish to delete.  This would probably be something like:

IDENTITY_MAP = ''.join([chr(x) for x in range(256)])
BAD_MAP = ''.join([chr(x) for x in range(32) + [124])

aNewString = aString.translate(IDENTITY_MAP, BAD_MAP)

Note that ASCII 31 is also a control character (US).

-- 
Erik Max Francis && [EMAIL PROTECTED] && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
   The believer is happy; the doubter is wise.
   -- (an Hungarian proverb)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Stripping ASCII codes when parsing

2005-10-17 Thread David Pratt
This is very nice :-)  Thank you Tony.  I think this will be the way to  
go.  My concern ATM is where it will be best to unicode. The data after  
this will go into dict and a few processes and into database. Because  
input source if not explicit encoding, I will have to assume ISO-8859-1  
I believe but could well be cp1252 for most part ( because it says no  
ASCII (0-30) but alright ASCII chars 128-254) and because most are  
Windows users.  Am thinking to unicode after stripping these characters  
and validating text, then unicoding (utf-8) so it is unicode in dict.  
Then when I perform these other processes it should be uniform and then  
it will go into database as unicode.  I think this should be ok.

Regards,
David

On Monday, October 17, 2005, at 01:48 PM, Tony Nelson wrote:

> In article <[EMAIL PROTECTED]>,
>  David Pratt <[EMAIL PROTECTED]> wrote:
>
>> I am working with a text format that advises to strip any ascii  
>> control
>> characters (0 - 30) as part of parsing data and also the ascii pipe
>> character (124) from the data. I think many of these characters are
>> from a different time. Since I have never seen most of these  
>> characters
>> in text I am not sure how these first 30 control characters are all
>> represented (other than say tab (\t), newline(\n), line return(\r) )  
>> so
>> what should I do to remove these characters if they are ever
>> encountered. Many thanks.
>
> Most of those characters are hard to see.
>
> Represent arbitrary characters in a string in hex: "\x00\x01\x02" or
> with chr(n).
>
> If you just want to remove some characters, look into "".translate().
>
> nullxlate = "".join([chr(n) for n in xrange(256)])
> delchars = nullxlate[:31] + chr(124)
> outputstr = inputstr.translate(nullxlate, delchars)
> ___ 
> _
> TonyN.:' 
> [EMAIL PROTECTED]
>   '   
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Stripping ASCII codes when parsing

2005-10-17 Thread David Pratt
Hi Steve.  My plan is to parse the data removing the control characters  
and validate to data as records are being added to a dictionary. I am  
going to Unicode after this step but before it gets into storage (in  
which case I think the translate method could work well).

The encoding itself is not explicit for this data except to say that it  
is ASCII and that besides not using chars 0-30, ASCII 128-254 is  
permitted. I am not certain whether I should assume cp1252 or  
ISO-8859-1. I can't say that everyone is using Windows although likely  
vast majority for sure.

Would you think it safe to unicode before or after seeking out control  
characters and validating stage? My validations are relatively simple  
but to ensure that if I am expecting a date, integer, string etc the  
data is what it is supposed to be,  (since next stage is database),  
unify whitespace, remove control characters, and check for SQL strings  
in the data to prevent any stupid things from happening if someone  
wanted to be malicious.

Regards,
David

On Monday, October 17, 2005, at 12:49 PM, Steve Holden wrote:

> David Pratt wrote:
> [about ord(), chr() and stripping control characters]
>> Many thanks Steve. This is good information. I think this should work
>> fine. I was doing a string.replace in a cleanData() method with the
>> following characters but don't know if that would have done it. This
>> contains all the control characters that I really know about in normal
>> use. ord(c) < 32 sounds like a much better way to go and  
>> comprehensive.
>>   So I guess instead of string.replace, I should do a...  for char
>> in ...and check evaluate each character, correct? - or is there a
>> better way of eliminating these other that reading a string in
>> character by character.
>>
>> '\a','\b','\e','\f','\n','\r','\t','\v','|'
>>
>
> There are a number of different things you might want to try. One is
> translate() which, given a string and a translate table, will perform
> the translation all in one go. For example:
>
 delchars = "".join(chr(i) for i in range(32)) + "|"
 print repr(delchars)
> '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12 
> \x13\x14\
> x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f|'
 nultxfrm = "".join(chr(i) for i in range(256))

>
> So delchars is a list of characters you want to remove, and nultxfrm is
> a 256-character string where the nultxfrm[n] == chr(n) - this performs
> no translation at all. So then
>
>  s = s.translate(nultxfrm, delchars)
>
> will remove all the "illegal" characters from s.
>
> Note that I am sort-of cheating here, as this is only going to work for
> 8-bit characters. Once Unicode enters the picture all bets are off.
>
> regards
>   Steve
> -- 
> Steve Holden   +44 150 684 7255  +1 800 494 3119
> Holden Web LLC www.holdenweb.com
> PyCon TX 2006  www.python.org/pycon/
>
> -- 
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Stripping ASCII codes when parsing

2005-10-17 Thread Tony Nelson
In article <[EMAIL PROTECTED]>,
 David Pratt <[EMAIL PROTECTED]> wrote:

> I am working with a text format that advises to strip any ascii control 
> characters (0 - 30) as part of parsing data and also the ascii pipe 
> character (124) from the data. I think many of these characters are 
> from a different time. Since I have never seen most of these characters 
> in text I am not sure how these first 30 control characters are all 
> represented (other than say tab (\t), newline(\n), line return(\r) ) so 
> what should I do to remove these characters if they are ever 
> encountered. Many thanks.

Most of those characters are hard to see.

Represent arbitrary characters in a string in hex: "\x00\x01\x02" or 
with chr(n).

If you just want to remove some characters, look into "".translate().  

nullxlate = "".join([chr(n) for n in xrange(256)])
delchars = nullxlate[:31] + chr(124)
outputstr = inputstr.translate(nullxlate, delchars)

TonyN.:'[EMAIL PROTECTED]
  '  
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Stripping ASCII codes when parsing

2005-10-17 Thread Steve Holden
David Pratt wrote:
[about ord(), chr() and stripping control characters]
> Many thanks Steve. This is good information. I think this should work 
> fine. I was doing a string.replace in a cleanData() method with the 
> following characters but don't know if that would have done it. This 
> contains all the control characters that I really know about in normal 
> use. ord(c) < 32 sounds like a much better way to go and comprehensive. 
>   So I guess instead of string.replace, I should do a...  for char 
> in ...and check evaluate each character, correct? - or is there a 
> better way of eliminating these other that reading a string in 
> character by character.
> 
> '\a','\b','\e','\f','\n','\r','\t','\v','|'
> 

There are a number of different things you might want to try. One is 
translate() which, given a string and a translate table, will perform 
the translation all in one go. For example:

  >>> delchars = "".join(chr(i) for i in range(32)) + "|"
  >>> print repr(delchars)
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\
x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f|'
  >>> nultxfrm = "".join(chr(i) for i in range(256))
  >>>

So delchars is a list of characters you want to remove, and nultxfrm is 
a 256-character string where the nultxfrm[n] == chr(n) - this performs 
no translation at all. So then

 s = s.translate(nultxfrm, delchars)

will remove all the "illegal" characters from s.

Note that I am sort-of cheating here, as this is only going to work for 
8-bit characters. Once Unicode enters the picture all bets are off.

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006  www.python.org/pycon/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Stripping ASCII codes when parsing

2005-10-17 Thread David Pratt
Many thanks Steve. This is good information. I think this should work 
fine. I was doing a string.replace in a cleanData() method with the 
following characters but don't know if that would have done it. This 
contains all the control characters that I really know about in normal 
use. ord(c) < 32 sounds like a much better way to go and comprehensive. 
  So I guess instead of string.replace, I should do a...  for char 
in ...and check evaluate each character, correct? - or is there a 
better way of eliminating these other that reading a string in 
character by character.

'\a','\b','\e','\f','\n','\r','\t','\v','|'

Regards,
David


On Monday, October 17, 2005, at 06:04 AM, Steve Holden wrote:

> David Pratt wrote:
>> I am working with a text format that advises to strip any ascii 
>> control
>> characters (0 - 30) as part of parsing data and also the ascii pipe
>> character (124) from the data. I think many of these characters are
>> from a different time. Since I have never seen most of these 
>> characters
>> in text I am not sure how these first 30 control characters are all
>> represented (other than say tab (\t), newline(\n), line return(\r) ) 
>> so
>> what should I do to remove these characters if they are ever
>> encountered. Many thanks.
>
> You will find the ord() function useful: control characters all have
> ord(c) < 32.
>
> You can also use the chr() function to return a character whose ord() 
> is
> a specific value, and you can use hex escapes to include arbitrary
> control characters in string literals:
>
>myString = "\x00\x01\x02"
>
> regards
>   Steve
> -- 
> Steve Holden   +44 150 684 7255  +1 800 494 3119
> Holden Web LLC www.holdenweb.com
> PyCon TX 2006  www.python.org/pycon/
>
> -- 
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Stripping ASCII codes when parsing

2005-10-17 Thread Steve Holden
David Pratt wrote:
> I am working with a text format that advises to strip any ascii control 
> characters (0 - 30) as part of parsing data and also the ascii pipe 
> character (124) from the data. I think many of these characters are 
> from a different time. Since I have never seen most of these characters 
> in text I am not sure how these first 30 control characters are all 
> represented (other than say tab (\t), newline(\n), line return(\r) ) so 
> what should I do to remove these characters if they are ever 
> encountered. Many thanks.

You will find the ord() function useful: control characters all have 
ord(c) < 32.

You can also use the chr() function to return a character whose ord() is 
a specific value, and you can use hex escapes to include arbitrary 
control characters in string literals:

   myString = "\x00\x01\x02"

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006  www.python.org/pycon/

-- 
http://mail.python.org/mailman/listinfo/python-list