Re: [Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?

2011-11-21 Thread Steven D'Aprano

Dave Angel wrote:

On 11/20/2011 04:45 PM, Steven D'Aprano wrote:



Something in the tool chain before it reached Python has saved it 
using a wide (four byte) encoding, most likely UTF-16 as that is 
widely used by Windows and Java. With the right settings, it could 
take as little as opening the file in Notepad, then clicking Save.




UTF-16 is a two byte format.  That's typically what Windows uses for 
Unicode.  It's Unices that are more likely to use a four-byte format.


Oops, you're right of course, two bytes, not four:

py> u'M'.encode('utf-16BE')
'\x00M'

I was thinking of four hex digits:

py> u'M'.encode('utf-16BE').encode('hex')
'004d'




--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?

2011-11-20 Thread Dave Angel

On 11/20/2011 04:45 PM, Steven D'Aprano wrote:



Something in the tool chain before it reached Python has saved it 
using a wide (four byte) encoding, most likely UTF-16 as that is 
widely used by Windows and Java. With the right settings, it could 
take as little as opening the file in Notepad, then clicking Save.




UTF-16 is a two byte format.  That's typically what Windows uses for 
Unicode.  It's Unices that are more likely to use a four-byte format.


--

DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?

2011-11-20 Thread Steven D'Aprano

dave selby wrote:


I split the HTML and print text and I get loads of

\x00T\x00r\x00i\x00a\x00  ie I get \x00 breaking up every character.

Any idea what is happening and how to get back to a list of ascii strings ?



How did you generate the HTML file? What other applications have you 
used to save the document?


Something in the tool chain before it reached Python has saved it using 
a wide (four byte) encoding, most likely UTF-16 as that is widely used 
by Windows and Java. With the right settings, it could take as little as 
opening the file in Notepad, then clicking Save.


If this isn't making sense to you, you should read this:

http://www.joelonsoftware.com/articles/Unicode.html

If my guess is right that the file is UTF-16, then you can "fix" it by 
doing this:



# Untested.
f = open("my_html_file.html", "r")
text = f.read().decode("utf-16")  # convert bytes to text
f.close()
bytes = text.encode("ascii")  # If this fails, try "latin-1" instead
f = open("my_html_file2.html", "w")  # write bytes back to disk
f.write(bytes)
f.close()

Once you've inspected the re-written file my_html_file2.html and it is 
okay to your satisfaction, you can delete the original one.



--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?

2011-11-20 Thread Steve Willoughby
It's customary to copy the list with answers, so everyone can benefit 
who may run into the same issue, too.


On 20-Nov-11 11:38, dave selby wrote:

It came from some automated HTML generation app ... I just had the
idea of looking at in with ghex  every other character is \00
, thats mad. OK will try ans replace('\00', '') in the string
before splitting


Those bytes are there for a reason, it's not mad.  It's using wide 
characters, possibly due to Unicode encoding.  If there are special
characters involved (multinational applications or whatever), you'll 
destroy them by killing the null bytes and won't handle the case of that 
high-order byte being something other than zero.


Check out Python's Unicode handling, and character set encode/decode 
features for a robust way to translate the output you're getting.





Cheers

Dave

On 20 November 2011 19:15, Steve Willoughby  wrote:

Where did the string come from?  It looks at first glance like you have two 
bytes for each character instead of the one you expect.  Is this perhaps a 
Unicode string instead of ASCII?

Sent from my iPad

On 2011/11/20, at 10:28, dave selby  wrote:


Hi All,

I have a long string which is an HTML file, I strip the HTML tags away
and make a list with

text = re.split('<.*?>', HTML)

I then tried to search for a string with text.index(...) but it was
not found, printing HTML to a terminal I get what I expect, a block of
tags and text, I split the HTML and print text and I get loads of

\x00T\x00r\x00i\x00a\x00  ie I get \x00 breaking up every character.

Any idea what is happening and how to get back to a list of ascii strings ?

Cheers

Dave

--

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor









--
Steve Willoughby / st...@alchemy.com
"A ship in harbor is safe, but that is not what ships are built for."
PGP Fingerprint 4615 3CCE 0F29 AE6C 8FF4 CA01 73FE 997A 765D 696C
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?

2011-11-20 Thread Steve Willoughby

On 20-Nov-11 12:04, Sarma Tangirala wrote:

Would the html parser library in python be a better idea as opposed to
using split? That way you have greater control over what is in the html.


Absolutely. And it would handle improper HTML (like unmatched brackets) 
gracefully where the split will just do the wrong thing.




On 20 Nov 2011 23:58, "dave selby" mailto:dave6...@gmail.com>> wrote:

Hi All,

I have a long string which is an HTML file, I strip the HTML tags away
and make a list with

text = re.split('<.*?>', HTML)

I then tried to search for a string with text.index(...) but it was
not found, printing HTML to a terminal I get what I expect, a block of
tags and text, I split the HTML and print text and I get loads of

\x00T\x00r\x00i\x00a\x00  ie I get \x00 breaking up every character.

Any idea what is happening and how to get back to a list of ascii
strings ?

Cheers

Dave

--

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
___
Tutor maillist  - Tutor@python.org 
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor



--
Steve Willoughby / st...@alchemy.com
"A ship in harbor is safe, but that is not what ships are built for."
PGP Fingerprint 4615 3CCE 0F29 AE6C 8FF4 CA01 73FE 997A 765D 696C
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?

2011-11-20 Thread Sarma Tangirala
Would the html parser library in python be a better idea as opposed to
using split? That way you have greater control over what is in the html.
On 20 Nov 2011 23:58, "dave selby"  wrote:

> Hi All,
>
> I have a long string which is an HTML file, I strip the HTML tags away
> and make a list with
>
> text = re.split('<.*?>', HTML)
>
> I then tried to search for a string with text.index(...) but it was
> not found, printing HTML to a terminal I get what I expect, a block of
> tags and text, I split the HTML and print text and I get loads of
>
> \x00T\x00r\x00i\x00a\x00  ie I get \x00 breaking up every character.
>
> Any idea what is happening and how to get back to a list of ascii strings ?
>
> Cheers
>
> Dave
>
> --
>
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?

2011-11-20 Thread Steve Willoughby
Where did the string come from?  It looks at first glance like you have two 
bytes for each character instead of the one you expect.  Is this perhaps a 
Unicode string instead of ASCII?

Sent from my iPad

On 2011/11/20, at 10:28, dave selby  wrote:

> Hi All,
> 
> I have a long string which is an HTML file, I strip the HTML tags away
> and make a list with
> 
> text = re.split('<.*?>', HTML)
> 
> I then tried to search for a string with text.index(...) but it was
> not found, printing HTML to a terminal I get what I expect, a block of
> tags and text, I split the HTML and print text and I get loads of
> 
> \x00T\x00r\x00i\x00a\x00  ie I get \x00 breaking up every character.
> 
> Any idea what is happening and how to get back to a list of ascii strings ?
> 
> Cheers
> 
> Dave
> 
> -- 
> 
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor