Re: Filtering out non-readable characters
def StripNoPrint(self, S): from string import printable return .join([ ch for ch in S if ch in printable ]) Adriaan Renting| Email: [EMAIL PROTECTED] ASTRON | Phone: +31 521 595 217 P.O. Box 2 | GSM: +31 6 24 25 17 28 NL-7990 AA Dwingeloo | FAX: +31 521 597 332 The Netherlands| Web: http://www.astron.nl/~renting/ MKoool [EMAIL PROTECTED] 07/16/05 2:33 AM I have a file with binary and ascii characters in it. I massage the data and convert it to a more readable format, however it still comes up with some binary characters mixed in. I'd like to write something to just replace all non-printable characters with '' (I want to delete non-printable characters). I am having trouble figuring out an easy python way to do this... is the easiest way to just write some regular expression that does something like replace [^\p] with ''? Or is it better to go through every character and do ord(character), check the ascii values? What's the easiest way to do something like this? thanks -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Adriaan Renting wrote: def StripNoPrint(self, S): from string import printable return .join([ ch for ch in S if ch in printable ]) Adriaan Renting| Email: [EMAIL PROTECTED] ASTRON | Phone: +31 521 595 217 P.O. Box 2 | GSM: +31 6 24 25 17 28 NL-7990 AA Dwingeloo | FAX: +31 521 597 332 The Netherlands| Web: http://www.astron.nl/~renting/ MKoool [EMAIL PROTECTED] 07/16/05 2:33 AM I have a file with binary and ascii characters in it. I massage the data and convert it to a more readable format, however it still comes up with some binary characters mixed in. I'd like to write something to just replace all non-printable characters with '' (I want to delete non-printable characters). I am having trouble figuring out an easy python way to do this... is the easiest way to just write some regular expression that does something like replace [^\p] with ''? Or is it better to go through every character and do ord(character), check the ascii values? What's the easiest way to do something like this? thanks I'd consider using the string's translate() method for this. Provide it with two arguments: the first should be a string of the 256 ordinals from 0 to 255 (because you won't be changing any characters, so you need a translate table that effects the null transformation) and the second argument should a string containing all the characters you want to remove. So tt = .join([chr(i) for i in range(256)]) generates the null translate table quite easily. Then import string ds = tt.translate(tt, string.printable) sets ds to be all the non-printable characters (according to the string module, anyway). Now you should be able to remove the non-printable characters from s by writing s = s.translate(tt, ds) regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Steve Holden wrote: tt = .join([chr(i) for i in range(256)]) Or: tt = string.maketrans('', '') STeVe -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
On 17 Jul 2005 18:32:28 -0700, Raymond Hettinger [EMAIL PROTECTED] wrote: [George Sakkis] It's only obvious in the sense that _after_ you see this idiom, you can go back to the docs and realize it's not doing something special; OTOH if you haven't seen it, it's not at all the obvious solution to how do I get the first 256 characters. So IMO it should be mentioned, given that string.translate often operates on the identity table. I think a single sentence is adequate for the reference docs. For Py2.5, I've accepted a feature request to allow string.translate's first argument to be None and then run as if an identity string had been provided. My news service has been timing out on postings, but I had a couple that made reference to that ;-) Maybe this post will get through. Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
On Sat, 16 Jul 2005 23:28:02 -0400, George Sakkis [EMAIL PROTECTED] wrote: Peter Hansen [EMAIL PROTECTED] wrote: George Sakkis wrote: Peter Hansen [EMAIL PROTECTED] wrote: Where did you learn that, George? Actually I first read about this in the Cookbook; there are two or three recipes related to string.translate. As for string.maketrans, it doesn't do anything special for empty string arguments: ... I guess so. I was going to offer to suggest a new paragraph on that usage for the docs, but as you and Jp both seem to think the behaviour is obvious, I conclude it's just me so I suppose I shouldn't bother. It's only obvious in the sense that _after_ you see this idiom, you can go back to the docs and realize it's not doing something special; OTOH if you haven't seen it, it's not at all the obvious solution to how do I get the first 256 characters. So IMO it should be mentioned, given that string.translate often operates on the identity table. I think a single sentence is adequate for the reference docs. I would suggest changing maketrans(from, to) Return a translation table suitable for passing to translate() or regex.compile(), that will map each character in from into the character at the same position in to; from and to must have the same length. to something that would make the idiom more easily inferrable, e.g., maketrans(from, to) Return a translation table suitable for passing to translate() or regex.compile(), that will map each character in from into the character at the same position in to, while leaving all characters other than those in from unchanged; from and to must have the same length. Meanwhile, if my python feature request #1193128 on sourceforge gets implemented, we'll be able to write s.translate(None, badchars) instead of having to build an identity table to pass as the first argument. Maybe 2.5? (Not being pushy ;-) Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
On Sun, 17 Jul 2005 15:42:08 -0600, Steven Bethard [EMAIL PROTECTED] wrote: Bengt Richter wrote: Thanks for the nudge. Actually, I know about generator expressions, but at some point I must have misinterpreted some bug in my code to mean that join in particular didn't like generator expression arguments, and wanted lists. I suspect this is bug 905389 [1]: def gen(): ... yield 1 ... raise TypeError('from gen()') ... ''.join([x for x in gen()]) Traceback (most recent call last): File interactive input, line 1, in ? File interactive input, line 3, in gen TypeError: from gen() ''.join(x for x in gen()) Traceback (most recent call last): File interactive input, line 1, in ? TypeError: sequence expected, generator found I run into this every month or so, and have to remind myself that it means that my generator is raising a TypeError, not that join doesn't accept generator expressions... STeVe [1] http://www.python.org/sf/905389 That must have been it, thanks. Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
On 15 Jul 2005 17:33:39 -0700, MKoool [EMAIL PROTECTED] wrote: I have a file with binary and ascii characters in it. I massage the data and convert it to a more readable format, however it still comes up with some binary characters mixed in. I'd like to write something to just replace all non-printable characters with '' (I want to delete non-printable characters). I am having trouble figuring out an easy python way to do this... is the easiest way to just write some regular expression that does something like replace [^\p] with ''? Or is it better to go through every character and do ord(character), check the ascii values? What's the easiest way to do something like this? thanks Easiest way is open the file with EdXor (freeware editor), select all, Format Wipe Non-Ascii. Ok it's not python, but it's the easiest. -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
On Tue, 19 Jul 2005 20:28:31 +1200, Ross wrote: On 15 Jul 2005 17:33:39 -0700, MKoool [EMAIL PROTECTED] wrote: I have a file with binary and ascii characters in it. I massage the data and convert it to a more readable format, however it still comes up with some binary characters mixed in. I'd like to write something to just replace all non-printable characters with '' (I want to delete non-printable characters). I am having trouble figuring out an easy python way to do this... is the easiest way to just write some regular expression that does something like replace [^\p] with ''? Or is it better to go through every character and do ord(character), check the ascii values? What's the easiest way to do something like this? thanks Easiest way is open the file with EdXor (freeware editor), select all, Format Wipe Non-Ascii. Ok it's not python, but it's the easiest. 1 Open Internet Explorer 2 Go to Google 3 Search for EdXor 4 Browser locks up 5 Force quit with ctrl-alt-del 6 Run anti-virus program 7 Download new virus definitions 8 Remove viruses 9 Run anti-spyware program 10 Download new definitions 11 Remove spyware 12 Open Internet Explorer 13 Download Firefox 14 Install Firefox 15 Open Firefox 16 Go to Google 17 Search for EdXor 18 Download application 19 Run installer 20 Reboot 21 Run EdXor 22 Open file 23 Select all 24 Select FormatWipe Non-ASCII 25 Select Save 26 Quit EdXor Hmmm. Perhaps not *quite* the easiest way :-) -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Peter Hansen wrote: ''.join(chr(c) for c in range(65, 91)) 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' Wouldn't this be a candidate for making the Python language stricter? Do you remember old Python versions treating l.append(n1,n2) the same way like l.append((n1,n2)). I'm glad this is forbidden now. Ciao, Michael. -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Michael Ströder wrote: Peter Hansen wrote: ''.join(chr(c) for c in range(65, 91)) 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' Wouldn't this be a candidate for making the Python language stricter? Why would that be true? I believe str.join() takes any iterable, and a generator (as returned by a generator expression) certainly qualifies. -Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Michael Ströder wrote: Peter Hansen wrote: ''.join(chr(c) for c in range(65, 91)) 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' Wouldn't this be a candidate for making the Python language stricter? Do you remember old Python versions treating l.append(n1,n2) the same way like l.append((n1,n2)). I'm glad this is forbidden now. That wasn't a syntax issue; it was an API issue. list.append() allowed multiple arguments and interpreted them as if they were a single tuple. That was confusing and unnecessary. Allowing generator expressions to forgo extra parentheses where they aren't required is something different, and in my opinion, a good thing. -- Robert Kern [EMAIL PROTECTED] In the fields of hell where the grass grows high Are the graves of dreams allowed to die. -- Richard Harter -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Steven D'Aprano wrote: On Sat, 16 Jul 2005 16:42:58 -0400, Peter Hansen wrote: Come on, Steven. Don't tell us you didn't have access to a Python interpreter to check before you posted: Er, as I wrote in my post: Steven who is still using Python 2.3, and probably will be for quite some time Sorry, missed that! I don't generally notice signatures much, partly because Thunderbird is smart enough to grey them out (the main text is displayed as black, quoted material in blue, and signatures in a light gray.) I don't have a firm answer (though I suspect the language reference does) about when dedicated parentheses are required around a generator expression. I just know that, so far, they just work when I want them to. Like most of Python. :-) -Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Bengt Richter wrote: Thanks for the nudge. Actually, I know about generator expressions, but at some point I must have misinterpreted some bug in my code to mean that join in particular didn't like generator expression arguments, and wanted lists. I suspect this is bug 905389 [1]: def gen(): ... yield 1 ... raise TypeError('from gen()') ... ''.join([x for x in gen()]) Traceback (most recent call last): File interactive input, line 1, in ? File interactive input, line 3, in gen TypeError: from gen() ''.join(x for x in gen()) Traceback (most recent call last): File interactive input, line 1, in ? TypeError: sequence expected, generator found I run into this every month or so, and have to remind myself that it means that my generator is raising a TypeError, not that join doesn't accept generator expressions... STeVe [1] http://www.python.org/sf/905389 -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
[George Sakkis] It's only obvious in the sense that _after_ you see this idiom, you can go back to the docs and realize it's not doing something special; OTOH if you haven't seen it, it's not at all the obvious solution to how do I get the first 256 characters. So IMO it should be mentioned, given that string.translate often operates on the identity table. I think a single sentence is adequate for the reference docs. For Py2.5, I've accepted a feature request to allow string.translate's first argument to be None and then run as if an identity string had been provided. Raymond Hettinger -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Wow, that was the most thorough answer to a comp.lang.python question since the Martellibot got busy in the search business. -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Bengt Richter wrote: identity = ''.join([chr(i) for i in xrange(256)]) unprintable = ''.join([c for c in identity if c not in string.printable]) And note that with Python 2.4, in each case the above square brackets are unnecessary (though harmless), because of the arrival of generator expressions in the language. (Bengt knows this already, of course, but his brain is probably resisting the reprogramming. :-) ) -Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
On Sat, 16 Jul 2005 10:25:29 -0400, Peter Hansen wrote: Bengt Richter wrote: identity = ''.join([chr(i) for i in xrange(256)]) unprintable = ''.join([c for c in identity if c not in string.printable]) And note that with Python 2.4, in each case the above square brackets are unnecessary (though harmless), because of the arrival of generator expressions in the language. But to use generator expressions, wouldn't you need an extra pair of round brackets? eg identity = ''.join( ( chr(i) for i in xrange(256) ) ) with the extra spaces added for clarity. That is, the brackets after join make the function call, and the nested brackets make the generator. That, at least, is my understanding. -- Steven who is still using Python 2.3, and probably will be for quite some time -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Steven D'Aprano wrote: On Sat, 16 Jul 2005 10:25:29 -0400, Peter Hansen wrote: Bengt Richter wrote: identity = ''.join([chr(i) for i in xrange(256)]) And note that with Python 2.4, in each case the above square brackets are unnecessary (though harmless), because of the arrival of generator expressions in the language. But to use generator expressions, wouldn't you need an extra pair of round brackets? eg identity = ''.join( ( chr(i) for i in xrange(256) ) ) Come on, Steven. Don't tell us you didn't have access to a Python interpreter to check before you posted: c:\python Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on win32 ''.join(chr(c) for c in range(65, 91)) 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' -Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
On Sat, 16 Jul 2005 10:25:29 -0400, Peter Hansen [EMAIL PROTECTED] wrote: Bengt Richter wrote: identity = ''.join([chr(i) for i in xrange(256)]) unprintable = ''.join([c for c in identity if c not in string.printable]) And note that with Python 2.4, in each case the above square brackets are unnecessary (though harmless), because of the arrival of generator expressions in the language. (Bengt knows this already, of course, but his brain is probably resisting the reprogramming. :-) ) Thanks for the nudge. Actually, I know about generator expressions, but at some point I must have misinterpreted some bug in my code to mean that join in particular didn't like generator expression arguments, and wanted lists. Actually it seems to like anything at all that can be iterated produce a sequence of strings. So I'm glad to find that join is fine after all, and to get that misap[com?:-)]prehension out of my mind ;-) Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Bengt Richter [EMAIL PROTECTED] wrote: identity = ''.join([chr(i) for i in xrange(256)]) unprintable = ''.join([c for c in identity if c not in string.printable]) Or equivalently: identity = string.maketrans('','') unprintable = identity.translate(identity, string.printable) George -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
On Sat, 16 Jul 2005 19:01:50 -0400, Peter Hansen [EMAIL PROTECTED] wrote: George Sakkis wrote: Bengt Richter [EMAIL PROTECTED] wrote: identity = ''.join([chr(i) for i in xrange(256)]) Or equivalently: identity = string.maketrans('','') Wow! That's handy, not to mention undocumented. (At least in the string module docs.) Where did you learn that, George? http://python.org/doc/lib/node109.html -Peter -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Jp Calderone wrote: On Sat, 16 Jul 2005 19:01:50 -0400, Peter Hansen [EMAIL PROTECTED] wrote: George Sakkis wrote: identity = string.maketrans('','') Wow! That's handy, not to mention undocumented. (At least in the string module docs.) Where did you learn that, George? http://python.org/doc/lib/node109.html Perhaps I was unclear. I thought it would be obvious that I knew where to find the docs for maketrans(), but that the particular behaviour shown (i.e. arguments of '' having that effect) was undocumented in that page. -Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Peter Hansen [EMAIL PROTECTED] wrote: Jp Calderone wrote: On Sat, 16 Jul 2005 19:01:50 -0400, Peter Hansen [EMAIL PROTECTED] wrote: George Sakkis wrote: identity = string.maketrans('','') Wow! That's handy, not to mention undocumented. (At least in the string module docs.) Where did you learn that, George? http://python.org/doc/lib/node109.html Perhaps I was unclear. I thought it would be obvious that I knew where to find the docs for maketrans(), but that the particular behaviour shown (i.e. arguments of '' having that effect) was undocumented in that page. -Peter Actually I first read about this in the Cookbook; there are two or three recipes related to string.translate. As for string.maketrans, it doesn't do anything special for empty string arguments: maketrans( from, to) Return a translation table suitable for passing to translate() or regex.compile(), that will map each character in from into the character at the same position in to; from and to must have the same length. So if from and to are empty, maketrans will map zero characters, hence the identity. It's not the only way to get the identity translation table by the way: string.maketrans('', '') == string.maketrans('a', 'a') == string.maketrans('hello', 'hello') True George -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
George Sakkis wrote: Peter Hansen [EMAIL PROTECTED] wrote: Where did you learn that, George? Actually I first read about this in the Cookbook; there are two or three recipes related to string.translate. As for string.maketrans, it doesn't do anything special for empty string arguments: ... I guess so. I was going to offer to suggest a new paragraph on that usage for the docs, but as you and Jp both seem to think the behaviour is obvious, I conclude it's just me so I suppose I shouldn't bother. -Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
Peter Hansen [EMAIL PROTECTED] wrote: George Sakkis wrote: Peter Hansen [EMAIL PROTECTED] wrote: Where did you learn that, George? Actually I first read about this in the Cookbook; there are two or three recipes related to string.translate. As for string.maketrans, it doesn't do anything special for empty string arguments: ... I guess so. I was going to offer to suggest a new paragraph on that usage for the docs, but as you and Jp both seem to think the behaviour is obvious, I conclude it's just me so I suppose I shouldn't bother. It's only obvious in the sense that _after_ you see this idiom, you can go back to the docs and realize it's not doing something special; OTOH if you haven't seen it, it's not at all the obvious solution to how do I get the first 256 characters. So IMO it should be mentioned, given that string.translate often operates on the identity table. I think a single sentence is adequate for the reference docs. George -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering out non-readable characters
On Sat, 16 Jul 2005 16:42:58 -0400, Peter Hansen wrote: Steven D'Aprano wrote: On Sat, 16 Jul 2005 10:25:29 -0400, Peter Hansen wrote: Bengt Richter wrote: identity = ''.join([chr(i) for i in xrange(256)]) And note that with Python 2.4, in each case the above square brackets are unnecessary (though harmless), because of the arrival of generator expressions in the language. But to use generator expressions, wouldn't you need an extra pair of round brackets? eg identity = ''.join( ( chr(i) for i in xrange(256) ) ) Come on, Steven. Don't tell us you didn't have access to a Python interpreter to check before you posted: Er, as I wrote in my post: Steven who is still using Python 2.3, and probably will be for quite some time So, no, I didn't have access to a Python interpreter running version 2.4. I take it then that generator expressions work quite differently than list comprehensions? The equivalent implied delimiters for a list comprehension would be something like this: L = [1, 2, 3] L[ i for i in range(2) ] File stdin, line 1 L[ i for i in range(2) ] ^ SyntaxError: invalid syntax which is a very different result from: L[ [i for i in range(2)] ] Traceback (most recent call last): File stdin, line 1, in ? TypeError: list indices must be integers In other words, a list comprehension must have the [ ] delimiters to be recognised as a list comprehension, EVEN IF the square brackets are there from some other element. But a generator expression doesn't care where the round brackets come from, so long as they are there: they can be part of the function call. I hope that makes sense to you. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Filtering out non-readable characters
I have a file with binary and ascii characters in it. I massage the data and convert it to a more readable format, however it still comes up with some binary characters mixed in. I'd like to write something to just replace all non-printable characters with '' (I want to delete non-printable characters). I am having trouble figuring out an easy python way to do this... is the easiest way to just write some regular expression that does something like replace [^\p] with ''? Or is it better to go through every character and do ord(character), check the ascii values? What's the easiest way to do something like this? thanks -- http://mail.python.org/mailman/listinfo/python-list