Re: UTF-8 and latin1
On Wed, 26 Oct 2022 at 05:09, Barry Scott wrote: > > > > > On 25 Oct 2022, at 11:16, Stefan Ram wrote: > > > > r...@zedat.fu-berlin.de (Stefan Ram) writes: > >> You can let Python guess the encoding of a file. > >> def encoding_of( name ): > >> path = pathlib.Path( name ) > >> for encoding in( "utf_8", "cp1252", "latin_1" ): > >> try: > >> with path.open( encoding=encoding, errors="strict" )as file: > > > > I also read a book which claimed that the tkinter.Text > > widget would accept bytes and guess whether these are > > encoded in UTF-8 or "ISO 8859-1" and decode them > > accordingly. However, today I found that here it does > > accept bytes but it always guesses "ISO 8859-1". > > The best you can do is assume that if the text cannot decode as utf-8 it may > be 8859-1. > Except when it's Windows-1252. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
> On 25 Oct 2022, at 11:16, Stefan Ram wrote: > > r...@zedat.fu-berlin.de (Stefan Ram) writes: >> You can let Python guess the encoding of a file. >> def encoding_of( name ): >> path = pathlib.Path( name ) >> for encoding in( "utf_8", "cp1252", "latin_1" ): >> try: >> with path.open( encoding=encoding, errors="strict" )as file: > > I also read a book which claimed that the tkinter.Text > widget would accept bytes and guess whether these are > encoded in UTF-8 or "ISO 8859-1" and decode them > accordingly. However, today I found that here it does > accept bytes but it always guesses "ISO 8859-1". The best you can do is assume that if the text cannot decode as utf-8 it may be 8859-1. Barry > > main.py > > import tkinter > > text = tkinter.Text() > text.insert( tkinter.END, "AÄäÖöÜüß".encode( encoding='ISO 8859-1' )) > text.insert( tkinter.END, "AÄäÖöÜüß".encode( encoding='UTF-8' )) > text.pack() > print( text.get( "1.0", "end" )) > > output > > AÄäÖöÜüßAÃäÃöÃüà > > > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
On Thu, 18 Aug 2022 11:33:59 -0700, Tobiah declaimed the following: > >So how does this break down? When a person enters >Montréal, Quebéc into a form field, what are they >doing on the keyboard to make that happen? As the >string sits there in the text box, is it latin1, or utf-8 >or something else? How does the browser know what >sort of data it has in that text box? > If this were my ancient Amiga -- most of the accented characters in ISO-Latin-1 were entered by using one of the meta/alt keys simultaneously with one of five or six designated "dead keys" (in days of typewriters, a dead key was one that did not advance the carriage to the next character space). The dead key indicated which accent mark was to be applied to the subsequent "regular" character. On Windows, many of the characters might be entered using (where are keys on the numeric pad!) (such as 1254 => µ). As for what the browser receives? Unless the browser is asking for raw key codes and translating them internally to some encoding, it is likely receiving characters in whatever encoding has been defined for the computer/OS (Windows, most likely CP1252, which is a superset of latin-1 as I recall). Whether the browser then re-encodes that to UTF-8 is something I can't answer. -- Wulfraed Dennis Lee Bieber AF6VN wlfr...@ix.netcom.comhttp://wlfraed.microdiversity.freeddns.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
On Fri, 19 Aug 2022 at 08:15, Tobiah wrote: > > > You configure the web server to send: > > > > Content-Type: text/html; charset=... > > > > in the HTTP header when it serves HTML files. > > So how does this break down? When a person enters > Montréal, Quebéc into a form field, what are they > doing on the keyboard to make that happen? As the > string sits there in the text box, is it latin1, or utf-8 > or something else? How does the browser know what > sort of data it has in that text box? > As it sits there in the text box, it is *a text string*. When it gets sent to the server, the encoding is defined by the browser (with reference to the server's specifications) and identified in a request header. The server should then receive that and interpret it as a text string. Encodings should ONLY be relevant when data is stored in files or transmitted across a network etc, and the rest of the time, just think in Unicode. Also - migrate to Python 3, your life will become a lot easier. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
On 2022-08-18, Tobiah wrote: >> You configure the web server to send: >> >> Content-Type: text/html; charset=... >> >> in the HTTP header when it serves HTML files. > > So how does this break down? When a person enters > Montréal, Quebéc into a form field, what are they > doing on the keyboard to make that happen? It depends on what keybaord they have. Using a standard UK or US ("qwerty") keyboard and Windows you should be able to type "é" by holding down the 'Alt' key to the right of the spacebar, and typing 'e'. If they're using a French ("azerty") keyboard then I think they can enter it by holding 'shift' and typing '2'. > As the string sits there in the text box, is it latin1, or utf-8 > or something else? That depends on which browser you're using. I think it's quite likely it will use UTF-32 (i.e. fixed-width 32 bits per character). > How does the browser know what sort of data it has in that text box? It's a text box, so it knows it's text. -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
You configure the web server to send: Content-Type: text/html; charset=... in the HTTP header when it serves HTML files. So how does this break down? When a person enters Montréal, Quebéc into a form field, what are they doing on the keyboard to make that happen? As the string sits there in the text box, is it latin1, or utf-8 or something else? How does the browser know what sort of data it has in that text box? -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
On 2022-08-18, Tobiah wrote: >> Generally speaking browser submisisons were/are supposed to be sent >> using the same encoding as the page, so if you're sending the page >> as "latin1" then you'll see that a fair amount I should think. If you >> send it as "utf-8" then you'll get 100% utf-8 back. > > The only trick I know is to use . Would > that 'send' the post as utf-8? I always expected it had more > to do with the way the user entered the characters. How do > they by the way, enter things like Montréal, Quebéc. When they > enter that into a text box on a web page can we say it's in > a particular encoding at that time? At submit time? You configure the web server to send: Content-Type: text/html; charset=... in the HTTP header when it serves HTML files. Another way is to put: or: in the section of your HTML document. The HTML "standard" nowadays says that you are only allowed to use the "utf-8" encoding, but if you use another encoding then browsers will generally use that as both the encoding to use when reading the HTML file and the encoding to use when submitting form data. -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
On 2022-08-17, Barry wrote: >> On 17 Aug 2022, at 18:30, Jon Ribbens via Python-list >> wrote: >> On 2022-08-17, Tobiah wrote: >>> I get data from various sources; client emails, spreadsheets, and >>> data from web applications. I find that I can do >>> some_string.decode('latin1') >>> to get unicode that I can use with xlsxwriter, >>> or put in the header of a web page to display >>> European characters correctly. But normally UTF-8 is recommended as >>> the encoding to use today. latin1 works correctly more often when I >>> am using data from the wild. It's frustrating that I have to play >>> a guessing game to figure out how to use incoming text. I'm just wondering >>> if there are any thoughts. What if we just globally decided to use utf-8? >>> Could that ever happen? >> >> That has already been decided, as much as it ever can be. UTF-8 is >> essentially always the correct encoding to use on output, and almost >> always the correct encoding to assume on input absent any explicit >> indication of another encoding. (e.g. the HTML "standard" says that >> all HTML files must be UTF-8.) >> >> If you are finding that your specific sources are often encoded with >> latin-1 instead then you could always try something like: >> >>try: >>text = data.decode('utf-8') >>except UnicodeDecodeError: >>text = data.decode('latin-1') >> >> (I think latin-1 text will almost always fail to be decoded as utf-8, >> so this would work fairly reliably assuming those are the only two >> encodings you see.) > > Only if a reserved byte is used in the string. > It will often work in either. Because it's actually ASCII and hence there's no difference between interpreting it as utf-8 or iso-8859-1? In which case, who cares? > For web pages it cannot be assumed that markup saying it’s utf-8 is > correct. Many pages are I fact cp1252. Usually you find out because > of a smart quote that is 0xa0 is cp1252 and illegal in utf-8. Hence what I said above. But if a source explicitly states an encoding and it's false then these days I see little need for sympathy. -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
Generally speaking browser submisisons were/are supposed to be sent using the same encoding as the page, so if you're sending the page as "latin1" then you'll see that a fair amount I should think. If you send it as "utf-8" then you'll get 100% utf-8 back. The only trick I know is to use . Would that 'send' the post as utf-8? I always expected it had more to do with the way the user entered the characters. How do they by the way, enter things like Montréal, Quebéc. When they enter that into a text box on a web page can we say it's in a particular encoding at that time? At submit time? -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
On 2022-08-17, Tobiah wrote: >> That has already been decided, as much as it ever can be. UTF-8 is >> essentially always the correct encoding to use on output, and almost >> always the correct encoding to assume on input absent any explicit >> indication of another encoding. (e.g. the HTML "standard" says that >> all HTML files must be UTF-8.) > I got an email from a client with blast text that > was in French with stuff like: Montréal, Quebéc. > latin1 did the trick. There's no accounting for the Québécois. They think they speak French. > Also, whenever I get a spreadsheet from a client and save as .csv, > or take browser data through PHP, it always seems to work with latin1, > but not UTF-8. That depends on how you "saved as .csv" and what you did with PHP. Generally speaking browser submisisons were/are supposed to be sent using the same encoding as the page, so if you're sending the page as "latin1" then you'll see that a fair amount I should think. If you send it as "utf-8" then you'll get 100% utf-8 back. -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
On 18/08/2022 03.33, Stefan Ram wrote: > Tobiah writes: >> I get data from various sources; client emails, spreadsheets, and >> data from web applications. I find that I can do >> some_string.decode('latin1') > > Strings have no "decode" method. ("bytes" objects do.) > >> to get unicode that I can use with xlsxwriter, >> or put in the header of a web page to display >> European characters correctly. > > |You should always use the UTF-8 character encoding. (Remember > |that this means you also need to save your content as UTF-8.) > World Wide Web Consortium (W3C) (2014) > >> am using data from the wild. It's frustrating that I have to play >> a guessing game to figure out how to use incoming text. I'm just wondering > > You can let Python guess the encoding of a file. > > def encoding_of( name ): > path = pathlib.Path( name ) > for encoding in( "utf_8", "cp1252", "latin_1" ): > try: > with path.open( encoding=encoding, errors="strict" )as file: > text = file.read() > return encoding > except UnicodeDecodeError: > pass > return None > >> if there are any thoughts. What if we just globally decided to use utf-8? >> Could that ever happen? > > That decisions has been made long ago. Unfortunately, much of our data was collected long before then - and as we've discovered, the OP is still living in Python 2 times. What about if the path "name" (above) is not in utf-8? eg the OP's Montréal in Latin1, as Montréal.txt or Montréal.rpt -- Regards, =dn -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
> On 17 Aug 2022, at 18:30, Jon Ribbens via Python-list > wrote: > > On 2022-08-17, Tobiah wrote: >> I get data from various sources; client emails, spreadsheets, and >> data from web applications. I find that I can do >> some_string.decode('latin1') >> to get unicode that I can use with xlsxwriter, >> or put in the header of a web page to display >> European characters correctly. But normally UTF-8 is recommended as >> the encoding to use today. latin1 works correctly more often when I >> am using data from the wild. It's frustrating that I have to play >> a guessing game to figure out how to use incoming text. I'm just wondering >> if there are any thoughts. What if we just globally decided to use utf-8? >> Could that ever happen? > > That has already been decided, as much as it ever can be. UTF-8 is > essentially always the correct encoding to use on output, and almost > always the correct encoding to assume on input absent any explicit > indication of another encoding. (e.g. the HTML "standard" says that > all HTML files must be UTF-8.) > > If you are finding that your specific sources are often encoded with > latin-1 instead then you could always try something like: > >try: >text = data.decode('utf-8') >except UnicodeDecodeError: >text = data.decode('latin-1') > > (I think latin-1 text will almost always fail to be decoded as utf-8, > so this would work fairly reliably assuming those are the only two > encodings you see.) Only if a reserved byte is used in the string. It will often work in either. For web pages it cannot be assumed that markup saying it’s utf-8 is correct. Many pages are I fact cp1252. Usually you find out because of a smart quote that is 0xa0 is cp1252 and illegal in utf-8. Barry > > Or you could use something fancy like https://pypi.org/project/chardet/ > > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
That has already been decided, as much as it ever can be. UTF-8 is essentially always the correct encoding to use on output, and almost always the correct encoding to assume on input absent any explicit indication of another encoding. (e.g. the HTML "standard" says that all HTML files must be UTF-8.) I got an email from a client with blast text that was in French with stuff like: Montréal, Quebéc. latin1 did the trick. Also, whenever I get a spreadsheet from a client and save as .csv, or take browser data through PHP, it always seems to work with latin1, but not UTF-8. -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
On 8/17/22 08:33, Stefan Ram wrote: Tobiah writes: I get data from various sources; client emails, spreadsheets, and data from web applications. I find that I can do some_string.decode('latin1') Strings have no "decode" method. ("bytes" objects do.) I'm using 2.7. Maybe that's why. Toby -- https://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 and latin1
On 2022-08-17, Tobiah wrote: > I get data from various sources; client emails, spreadsheets, and > data from web applications. I find that I can do some_string.decode('latin1') > to get unicode that I can use with xlsxwriter, > or put in the header of a web page to display > European characters correctly. But normally UTF-8 is recommended as > the encoding to use today. latin1 works correctly more often when I > am using data from the wild. It's frustrating that I have to play > a guessing game to figure out how to use incoming text. I'm just wondering > if there are any thoughts. What if we just globally decided to use utf-8? > Could that ever happen? That has already been decided, as much as it ever can be. UTF-8 is essentially always the correct encoding to use on output, and almost always the correct encoding to assume on input absent any explicit indication of another encoding. (e.g. the HTML "standard" says that all HTML files must be UTF-8.) If you are finding that your specific sources are often encoded with latin-1 instead then you could always try something like: try: text = data.decode('utf-8') except UnicodeDecodeError: text = data.decode('latin-1') (I think latin-1 text will almost always fail to be decoded as utf-8, so this would work fairly reliably assuming those are the only two encodings you see.) Or you could use something fancy like https://pypi.org/project/chardet/ -- https://mail.python.org/mailman/listinfo/python-list