It's customary to copy the list with answers, so everyone can benefit
who may run into the same issue, too.
On 20-Nov-11 11:38, dave selby wrote:
It came from some automated HTML generation app ... I just had the
idea of looking at in with ghex .... every other character is \00
!!!!, thats mad. OK will try ans replace('\00', '') in the string
before splitting
Those bytes are there for a reason, it's not mad. It's using wide
characters, possibly due to Unicode encoding. If there are special
characters involved (multinational applications or whatever), you'll
destroy them by killing the null bytes and won't handle the case of that
high-order byte being something other than zero.
Check out Python's Unicode handling, and character set encode/decode
features for a robust way to translate the output you're getting.
Cheers
Dave
On 20 November 2011 19:15, Steve Willoughby<st...@alchemy.com> wrote:
Where did the string come from? It looks at first glance like you have two
bytes for each character instead of the one you expect. Is this perhaps a
Unicode string instead of ASCII?
Sent from my iPad
On 2011/11/20, at 10:28, dave selby<dave6...@gmail.com> wrote:
Hi All,
I have a long string which is an HTML file, I strip the HTML tags away
and make a list with
text = re.split('<.*?>', HTML)
I then tried to search for a string with text.index(...) but it was
not found, printing HTML to a terminal I get what I expect, a block of
tags and text, I split the HTML and print text and I get loads of
\x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character.
Any idea what is happening and how to get back to a list of ascii strings ?
Cheers
Dave
--
Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
--
Steve Willoughby / st...@alchemy.com
"A ship in harbor is safe, but that is not what ships are built for."
PGP Fingerprint 4615 3CCE 0F29 AE6C 8FF4 CA01 73FE 997A 765D 696C
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor