RE: Changing filenames from Greeklish = Greek (subprocess complain)
To: python-list@python.org From: breamore...@yahoo.co.uk Subject: Re: Changing filenames from Greeklish = Greek (subprocess complain) Date: Sun, 2 Jun 2013 15:51:31 +0100 [...] Steve is going for the pink ball - and for those of you who are watching in black and white, the pink is next to the green. Snooker commentator 'Whispering' Ted Lowe. Mark Lawrence -- http://mail.python.org/mailman/listinfo/python-list le+666l -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sunday 02 June 2013 13:10:30 Chris Angelico did opine: On Mon, Jun 3, 2013 at 2:21 AM, حéêüëلïٍ تïٌلٍ nikos.gr...@gmail.com wrote: Paying for someone to just remove a dash to get the script working is too much to ask for One dash: 1c Knowing where to remove it: $99.99 Total bill: $100.00 Knowing that it ought really to be utf8mb4 and giving hints that the docs should be read rather than just taking this simple example and plugging it in: Priceless. ChrisA Chuckle. Chris, I do believe you have topped yourself. Love it. Cheers, Gene -- There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. -Ed Howdershelt (Author) My web page: http://coyoteden.dyndns-free.com:85/gene is up! My views http://www.armchairpatriot.com/What%20Has%20America%20Become.shtml Unnamed Law: If it happens, it must be possible. A pen in the hand of this president is far more dangerous than 200 million guns in the hands of law-abiding citizens. -- http://mail.python.org/mailman/listinfo/python-list
RE: Changing filenames from Greeklish = Greek (subprocess complain)
' Server: ApacheBooster/1.6' isn't a signature of httpd. I think you are really running something different. From: nob...@nowhere.com Subject: Re: Changing filenames from Greeklish = Greek (subprocess complain) Date: Tue, 4 Jun 2013 14:01:48 +0100 To: python-list@python.org On Tue, 04 Jun 2013 00:58:42 -0700, Νικόλαος Κούρας wrote: Τη Τρίτη, 4 Ιουνίου 2013 10:39:08 π.μ. UTC+3, ο χρήστης Nobody έγραψε: Chrome didn't choose ISO-8859-1, the server did; the HTTP response says: Content-Type: text/html;charset=ISO-8859-1 From where do you see this $ wget -S -O - http://superhost.gr/data/apps/ --2013-06-04 14:00:10-- http://superhost.gr/data/apps/ Resolving superhost.gr... 82.211.30.133 Connecting to superhost.gr|82.211.30.133|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Server: ApacheBooster/1.6 Date: Tue, 04 Jun 2013 13:00:19 GMT Content-Type: text/html;charset=ISO-8859-1 Transfer-Encoding: chunked Connection: keep-alive Vary: Accept-Encoding X-Cacheable: YES X-Varnish: 2000177813 Via: 1.1 varnish age: 0 X-Cache: MISS -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 06/10/2013 01:11 AM, Νικόλαος Κούρας wrote: Τη Δευτέρα, 10 Ιουνίου 2013 10:51:34 π.μ. UTC+3, ο χρήστης Larry Hudson έγραψε: I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. 0 - 127, yes. 128 - 255 - one byte of a multibyte code. you mean that in utf-8 for 1 character to be stored, we need 2 bytes? I still havign troubl e understanding this. Utf-8 characters are encoded in different sizes, NOT a single fixed number of bytes. The high _bits_ of the first byte define the number of bytes of the individual character code. (I'm copying this from Wikipedia...) 0xxx - 1 byte 110x - 2 bytes 1110 - 3 bytes 0xxx - 4 bytes 10xx - 5 bytes 110x - 6 bytes Notice that in the 1-byte version, since the high bit is always 0, only 7 bits are available for the character code, and this is the standard 0-127 ASCII (and ASCII-compatible) code set. Since 2^8 = 256, utf-8 would need 1 byte to store the 1st 256 characters but instead its using 1 byte up to the first 127 value and then 2 bytes for anyhtign above. Why? As I indicated above, one bit is reserved as a flag to indicate that the code is one-byte code and not a multibyte code, only 7 bits are available for the actual 1-byte (ASCII) code. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: py c = 'α' py ord(c) 945 The number 945 is the characters 'α' ordinal value in the unicode charset correct? The command in the python interactive session to show me how many bytes this character will take upon encoding to utf-8 is: s = 'α' s.encode('utf-8') b'\xce\xb1' I see that the encoding of this char takes 2 bytes. But why two exactly? How do i calculate how many bits are needed to store this char into bytes? Trying to to the same here but it gave me no bytes back. s = 'a' s.encode('utf-8') b'a' py c.encode('utf-8') b'\xce\xb1' 2 bytes here. why 2? py c.encode('utf-16be') b'\x03\xb1' 2 byets here also. but why 3 different bytes? the ordinal value of char 'a' is the same in unicode. the encodign system just takes the ordinal value end encode, but sinc eit uses 2 bytes should these 2 bytes be the same? py c.encode('utf-32be') b'\x00\x00\x03\xb1 every char here takes exactly 4 bytes to be stored. okey. py c.encode('iso-8859-7') b'\xe1' And also does '\x' means that the value is being respresented in hex way? and when i bin(6) i see '0b101' I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 06/09/2013 03:37 AM, Νικόλαος Κούρας wrote: I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. NO!! 0 - 127, yes. 128 - 255 - one byte of a multibyte code. That's why the decode fails, it sees it as incomplete data so it can't do anything with it. A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? Is this what a surrogate is? a pari of 2 chars? You're confusing character encodings with the way NON-CHARACTER keys on the KEYBOARD are encoded (function keys, arrow keys and such). These are NOT text characters but KEYBOARD key codes. These are NOT text codes and are entirely different and not related to any character encoding. How programs interpret and use these codes depends entirely on the individual programs. There are common conventions on how many are used, but there are no standards. Also the control-codes are the first 32 values of the ASCII (and ASCII-compatible) character set and are not multi-character key codes like the keyboard non-character keys. However, there are a few keyboard keys that actually produce control-codes. A few examples: Return/Enter - Ctrl-M Tab - Ctrl-I Backspace - Ctrl-H So character 'A' - 65 (in decimal uses in charset's table) - 01011100 (as binary stored in disk) - 0xEF (as hex, when we open the file with a hex editor) You are trying to put too much meaning to this. The value stored on disk, in memory, or whatever is binary bits, nothing else. How you describe the value, in decimal, in octal, in hex, in base-12, or... is totally irrelevant. These are simply different ways of describing or naming these numeric values. It's the same as saying 3 in English is three, in Spanish is tres, in German is drei... (I don't know Greek, sorry.) No matter what you call it, it is still the numeric integer value that is between 2 and 4. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Δευτέρα, 10 Ιουνίου 2013 10:51:34 π.μ. UTC+3, ο χρήστης Larry Hudson έγραψε: I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. 0 - 127, yes. 128 - 255 - one byte of a multibyte code. you mean that in utf-8 for 1 character to be stored, we need 2 bytes? I still havign troubl e understanding this. Since 2^8 = 256, utf-8 would need 1 byte to store the 1st 256 characters but instead its using 1 byte up to the first 127 value and then 2 bytes for anyhtign above. Why? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 10.06.2013 09:10, nagia.rets...@gmail.com wrote: Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: py c = 'α' py ord(c) 945 The number 945 is the characters 'α' ordinal value in the unicode charset correct? Yes, the unicode character set is just a big list of characters. The 946th character in that list (starting from 0) happens to be 'α'. The command in the python interactive session to show me how many bytes this character will take upon encoding to utf-8 is: s = 'α' s.encode('utf-8') b'\xce\xb1' I see that the encoding of this char takes 2 bytes. But why two exactly? That's how the encoding is designed. Haven't you read the wikipedia article which was already mentioned several times? How do i calculate how many bits are needed to store this char into bytes? You need to understand how UTF-8 works. Read the wikipedia article. Trying to to the same here but it gave me no bytes back. s = 'a' s.encode('utf-8') b'a' The encode method returns a byte object. It's length will tell you how many bytes there are: len(b'a') 1 len(b'\xce\xb1') 2 The python interpreter will represent all values below 256 as ASCII characters if they are printable: ord(b'a') 97 hex(97) '0x61' b'\x61' == b'a' True The Python designers have decided to use b'a' instead of b'\x61'. py c.encode('utf-8') b'\xce\xb1' 2 bytes here. why 2? Same as your first question. py c.encode('utf-16be') b'\x03\xb1' 2 byets here also. but why 3 different bytes? the ordinal value of char 'a' is the same in unicode. the encodign system just takes the ordinal value end encode, but sinc eit uses 2 bytes should these 2 bytes be the same? 'utf-16be' is a different encoding scheme, thus it uses other rules to determine how each character is translated into a byte sequence. py c.encode('iso-8859-7') b'\xe1' And also does '\x' means that the value is being respresented in hex way? and when i bin(6) i see '0b101' I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say? '\x' is an escape sequence and means that the following two characters should be interpreted as a number in hexadecimal notation (see also the table of allowed escape sequences: http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals ). '0b' tells you that the number is printed in binary notation. Leading zeros are usually discarded when a number is printed: bin(70) '0b1000110' 0b100110 == 0b00100110 True 0b100110 == 0b00100110 True It's the same with decimal notation. You wouldn't say 00123 is different from 123, would you? Bye, Andreas -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Δευτέρα, 10 Ιουνίου 2013 11:15:38 π.μ. UTC+3, ο χρήστης Andreas Perstinger έγραψε: What is the difference between len('nikos') and len(b'nikos') First beeing the length of string nikos in characters while the second being the length of an ??? The python interpreter will represent all values below 256 as ASCII characters if they are printable: ord(b'a') 97 hex(97) '0x61' b'\x61' == b'a' True The Python designers have decided to use b'a' instead of b'\x61'. b'a' and b'\x61' are the bytestrings of char 'a' after utf-8 encoding? This ord(b'a' )should give an error in my opinion: ord('a') should return the ordinal value of char 'a', not ord(b'a') -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
s = 'α' s.encode('utf-8') b'\xce\xb1' 'b' stands for binary right? b'\xce\xb1' = we are looking at a byte in a hexadecimal format? if yes how could we see it in binary and decimal represenation? I see that the encoding of this char takes 2 bytes. But why two exactly? How do i calculate how many bits are needed to store this char into bytes? Because utf-8 takes 1 to 4 bytes to encode characters Since 2^8 = 256, utf-8 should store the first 256 chars of unicode charset using 1 byte. Also Since 2^16 = 65535, utf-8 should store the first 65535 chars of unicode charset using 2 bytes and so on. But i know that this is not the case. But i dont understand why. s = 'a' s.encode('utf-8') b'a' utf-8 takes ASCII as it is, as 1 byte. They are the same EBCDIC and ASCII and Unicode are charactet sets, correct? iso-8859-1, iso-8859-7, utf-8, utf-16, utf-32 and so on are encoding methods, right? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 10.06.2013 11:59, Νικόλαος Κούρας wrote: s = 'α' s.encode('utf-8') b'\xce\xb1' 'b' stands for binary right? No, here it stands for bytes: http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals b'\xce\xb1' = we are looking at a byte in a hexadecimal format? No, b'\xce\xb1' represents a byte object containing 2 bytes. Yes, each byte is represented in hexadecimal format. if yes how could we see it in binary and decimal represenation? s = b'\xce\xb1' s[0] 206 bin(s[0]) '0b11001110' s[1] 177 bin(s[1]) '0b10110001' A byte object is a sequence of bytes (= integer values) and support indexing. http://docs.python.org/3/library/stdtypes.html#bytes Since 2^8 = 256, utf-8 should store the first 256 chars of unicode charset using 1 byte. Also Since 2^16 = 65535, utf-8 should store the first 65535 chars of unicode charset using 2 bytes and so on. But i know that this is not the case. But i dont understand why. Because your method doesn't work. If you use all possible 256 bit-combinations to represent a valid character, how do you decide where to stop in a sequence of bytes? s = 'a' s.encode('utf-8') b'a' utf-8 takes ASCII as it is, as 1 byte. They are the same EBCDIC and ASCII and Unicode are charactet sets, correct? iso-8859-1, iso-8859-7, utf-8, utf-16, utf-32 and so on are encoding methods, right? Look at http://www.unicode.org/glossary/ for an explanation of all the terms. Bye, Andreas -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Mon, 10 Jun 2013 00:10:38 -0700, nagia.retsina wrote: Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: py c = 'α' py ord(c) 945 The number 945 is the characters 'α' ordinal value in the unicode charset correct? Correct. The command in the python interactive session to show me how many bytes this character will take upon encoding to utf-8 is: s = 'α' s.encode('utf-8') b'\xce\xb1' I see that the encoding of this char takes 2 bytes. But why two exactly? Because that's how UTF-8 works. If it was a different encoding, it might be 4 bytes, or 2, or 1, or 101, or 7, or 3. But it is UTF-8, so it takes 2 bytes. If you want to understand how UTF-8 works, look it up on Wikipedia. How do i calculate how many bits are needed to store this char into bytes? Every byte is made of 8 bits. There are two bytes. So multiply 8 by 2. Trying to to the same here but it gave me no bytes back. s = 'a' s.encode('utf-8') b'a' There is a byte there. The byte is printed by Python as b'a', which in my opinion is a design mistake. That makes it look like a string, but it is not a string, and would be better printed as b'\x61'. But regardless of the display, it is still a single byte. py c.encode('utf-8') b'\xce\xb1' 2 bytes here. why 2? Because that's how UTF-8 works. py c.encode('utf-16be') b'\x03\xb1' 2 byets here also. but why 3 different bytes? Because it is a different encoding. the ordinal value of char 'a' is the same in unicode. The same as what? the encodign system just takes the ordinal value end encode, but sinc eit uses 2 bytes should these 2 bytes be the same? No. That's like saying that since a dog in Germany has four legs and one head, and a dog in France has four legs and one head, dog should be spelled Hund in both Germany and France. Different encodings are like different languages. They spell the same word differently. py c.encode('utf-32be') b'\x00\x00\x03\xb1 every char here takes exactly 4 bytes to be stored. okey. py c.encode('iso-8859-7') b'\xe1' And also does '\x' means that the value is being respresented in hex way? and when i bin(6) i see '0b101' I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say? b for Binary. Just like 0o1234 uses octal, o for Octal. And 0x123EF uses hexadecimal. x for heXadecimal. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Δευτέρα, 10 Ιουνίου 2013 2:59:03 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: On Mon, 10 Jun 2013 00:10:38 -0700, nagia.retsina wrote: Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: py c = 'α' py ord(c) 945 The number 945 is the characters 'α' ordinal value in the unicode charset correct? Correct. The command in the python interactive session to show me how many bytes this character will take upon encoding to utf-8 is: s = 'α' s.encode('utf-8') b'\xce\xb1' I see that the encoding of this char takes 2 bytes. But why two exactly? Because that's how UTF-8 works. If it was a different encoding, it might be 4 bytes, or 2, or 1, or 101, or 7, or 3. But it is UTF-8, so it takes 2 bytes. If you want to understand how UTF-8 works, look it up on Wikipedia. How do i calculate how many bits are needed to store this char into bytes? Every byte is made of 8 bits. There are two bytes. So multiply 8 by 2. Trying to to the same here but it gave me no bytes back. s = 'a' s.encode('utf-8') b'a' There is a byte there. The byte is printed by Python as b'a', which in my opinion is a design mistake. That makes it look like a string, but it is not a string, and would be better printed as b'\x61'. But regardless of the display, it is still a single byte. Perhaps, up to 127 ASCII chars python thinks its better for human to read the character representaion of the stored byte, instead of hex's. Just a guess. Just like 0o1234 uses octal, o for Octal. And 0x123EF uses hexadecimal. x for heXadecimal. Why the leadin zero before octal's 'o' and hex's 'x' and binary's 'b' ? Iam not goin to tired you any more, because ia hve exhaust myself tlo days now tryign to get my head around this. Please confirm i ahve understood correctly: I did but docs confuse me even more. Can you pleas ebut it simple. Unicode as i understand it was created out of need for a bigger character set compared to ASCII which could hold up to 127 chars(and extended versions of it up to 256), that could be able to hold all worlds symbols. ASCII and Unicode are character sets. Everything else sees to be an encoding system that work upon those characters sets. If what i said is true the last thing that still confuses me is that iso-8859-7(256 chars) seems like charactet set and an encoding method too. Can it be both or it is iso-8859-7 encoding method of Unicode character set similar as uTF8 is also Unicode's encoding method? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
- A coding scheme works with three sets. A *unique* set of CHARACTERS, a *unique* set of CODE POINTS and a *unique* set of ENCODED CODE POINTS, unicode or not. The relation between the set of characters and the set of the code points is a *human* table, created with a sheet of paper and a pencil, a deliberate choice of characters with integers as labels. The relation between the set of the code points and the set of encoded code points is a mathematical operation. In the case of an 8bits coding scheme, like iso-XXX, this operation is a no-op, the relation is an identity. Shortly: set of code points == set of encoded code points. In the case of unicode, The Unicode consortium endorses three such mathematical operations called UTF-8, UTF-16 and UTF-32 where UTF means Unicode Transformation Format, a confusing wording meaning at the same time, the process and the result of the process. This Unicode Transformation does not produce bytes, it produces words/chunks/tokens of *bits* with lengths 8, 16, 32, called Unicode Transformation Units (from this the names UTF-8, -16, -32). At this level, only a structure has been defined (there is no computing). Very important, an healthy coding scheme works conceptually only with this *unique set of encoded code points, not with bytes, characters or code points. The last step, the machine implementation: it is up to the processor, the compiler, the language to implement all these Unicode Transformation Units with of course their related specifities: char, w_char, int, long, endianess, rune (Go language), ... Not too over-simplified or not too over-complicated and enough to understand one, if not THE, design mistake of the flexible string representation. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Monday, June 10, 2013 3:48:08 PM UTC-4, jmfauth wrote: - A coding scheme works with three sets. A *unique* set of CHARACTERS, a *unique* set of CODE POINTS and a *unique* set of ENCODED CODE POINTS, unicode or not. The relation between the set of characters and the set of the code points is a *human* table, created with a sheet of paper and a pencil, a deliberate choice of characters with integers as labels. The relation between the set of the code points and the set of encoded code points is a mathematical operation. In the case of an 8bits coding scheme, like iso-XXX, this operation is a no-op, the relation is an identity. Shortly: set of code points == set of encoded code points. In the case of unicode, The Unicode consortium endorses three such mathematical operations called UTF-8, UTF-16 and UTF-32 where UTF means Unicode Transformation Format, a confusing wording meaning at the same time, the process and the result of the process. This Unicode Transformation does not produce bytes, it produces words/chunks/tokens of *bits* with lengths 8, 16, 32, called Unicode Transformation Units (from this the names UTF-8, -16, -32). At this level, only a structure has been defined (there is no computing). This is a really good description of the issues involved with character sets and encodings, thanks. Very important, an healthy coding scheme works conceptually only with this *unique set of encoded code points, not with bytes, characters or code points. You don't explain why it is important to work with encoded code points. What's wrong with working with code points? The last step, the machine implementation: it is up to the processor, the compiler, the language to implement all these Unicode Transformation Units with of course their related specifities: char, w_char, int, long, endianess, rune (Go language), ... Not too over-simplified or not too over-complicated and enough to understand one, if not THE, design mistake of the flexible string representation. jmf Again you've made the claim that the flexible string representation is a mistake. But you haven't said WHY. I can't tell if you are trolling us, or are deluded, or genuinely don't understand what you are talking about. Some day you might explain yourself. I look forward to it. --Ned. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sun, 09 Jun 2013 07:46:40 +0300, Νικόλαος Κούρας wrote: Why does every character in a character set needs to be associated with a numeric value? Because computers are digital, not analog, and because bytes are numbers. Here are a few of the 256 possible bytes, written in binary, decimal and hexadecimal: 0b 0 0x00 0b0001 1 0x01 0b0010 2 0x02 [...] 0b0111 127 0x7F 0b1000 128 0x80 [...] 0b1110 254 0xFE 0b 255 0xFF EVERYTHING in computers are numbers, because everything is stored as bytes. Text is stored as bytes. Sound files are stored as bytes. Images are stored as bytes. Programs are stored as bytes. So everything is being stored as numbers. But the *meaning* we give to those numbers depends on what we do with them, whether we treat them as characters, bitmapped images, floating point values, or something else. Once we decide we want to store the character A as bytes, we need to decide which number it should be. That is the job of the charset. ASCII: 65 -- 'A' 66 -- 'B' 67 -- 'C' etc. I mean couldn't we just have characters sets that wouldn't have numeric associations like: 'A' = encoding process(i.e. uf-8) = bytes bytes = decoding process(i.e. utf-8) = character 'A' No. How would you store it in a computer's memory, or on a hard drive? By carving a tiny, microscopic A onto the hard drive? How would you read it back? It is theoretically possible to build an analog computer, out of clockwork, or water flowing through pipes, or something, but nobody really bothers because it is much harder and not very useful. An ordinal = ordered numbers like 7,8,910 and so on? Yes. Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for values up to 256? Because then how do you tell when you need one byte, and when you need two? If you read two bytes, and see 0x4C 0xFA, does that mean two characters, with ordinal values 0x4C and 0xFA, or one character with ordinal value 0x4CFA? UTF-8 solves this problem by reserving some values to mean this byte, on its own, and others to mean this byte, plus the next byte, together, and so forth, up to four bytes. If you look up UTF-8 on Wikipedia, you will see more about this. UTF-8 and UTF-16 and UTF-32 I though the number beside of UTF- was to declare how many bits the character set was using to store a character into the hdd, no? Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit values to make a surrogate pair. UTF-8 uses 8-bit values, but sometimes it combines two, three or four of them to represent a single code-point. Narrow Unicode uses two bytes per character. Since two bytes is only enough for about 65,000 characters, not 1,000,000+, the rest of the characters are stored as pairs of two-byte surrogates. Can you please explain this line the rest of thecharacters are stored as pairs of two-byte surrogates more easily for me to understand it? I'm still having troubl understanding what a surrogate is. Look up UTF-16 and surrogate pair on Wikepedia. But basically, there are 65000+ different possible 16-bit values available for UTF-16 to use. Some of those values are reserved to mean this value is not a character, it is half of a surrogate pair. Since they are *pairs*, they must always come in twos. A surrogate pair makes up a valid character. Half of a surrogate pair, on its own, is an error. A lot of this complexity is because of historical reasons. For example, when Unicode was first invented, there was only 65 thousand characters, and a fixed 16 bits was all you needed. But it was soon learned that 65 thousand was not enough (there are more than 65,000 Asian characters alone!) and so UTF-16 developed the trick with surrogate pairs to cover the extras. [...] When locale to linux system is set to utf-8 that would mean that the linux applications, should try to encode string into hdd by using system's default encoding to utf-8 nad read them back from bytes by also using utf-8. Is that correct? Yes. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote: chr('A') would give me the mapping of this char, the number 65 while ord(65) would output the char 'A' likewise. Correct. Python uses Unicode, where code-point 65 (ordinal value 65) means letter A. There are older encodings. For example, a very old one, used on IBM mainframes, is EBCDIC, where ordinal value 65 means the letter â, and the letter A has ordinal value 193. What would happen if we we try to re-encode bytes on the disk? like trying: s = νίκος utf8_bytes = s.encode('utf-8') greek_bytes = utf_bytes.encode('iso-8869-7') Can we re-encode twice or as many times we want and then decode back respectively lke? Of course. Bytes have no memory of where they came from, or what they are used for. All you are doing is flipping bits on a memory chip, or on a hard drive. So long as *you* remember which encoding is the right one, there is no problem. If you forget, and start using the wrong one, you will get garbage characters, mojibake, or errors. [...] And also is there a deiffrence between encoding and compressing ? Of course. They are totally unrelated. Isnt the latter useing some form of encoding to take a string or bytes to make hold less space on disk? Correct, except forget about encoding. It's not relevant (except, maybe, in a mathematical sense) and will just confuse you. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Thanks Stevn, i ll read them in a bit. When i read them can you perhaps tell me whats wrong and ima still getting decode issues? [CODE] # = # If user downloaded a file, thank the user !!! # = if filename: #update file counter if cookie does not exist if not nikos: cur.execute('''UPDATE files SET hits = hits + 1, host = %s, lastvisit = %s WHERE url = %s''', (host, lastvisit, filename) ) print('''h2font color=blueΤο αρχείο font color=red %s font color=blueκατεβαίνει!''' % filename ) print('''brimg src=/data/images/thanks.gif''') print('''brbrbrh3font color=blueΚαι τώρα Tetris μέχρι να ολοκληρωθεί :-)''' ) print('''brobject classid=clsid:d27cdb6e-ae6d-11cf-96b8-44455354 codebase=http://fpdownload.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,0,0; width=450 height=300param name=menu value=false /param name=movie value=http://www.fugly.com/f/1e6d8cd7b905f4e1bf72; /param name=quality value=high /embed src=http://www.fugly.com/f/1e6d8cd7b905f4e1bf72; AllowScriptAccess=always menu=false quality=high width=450 height=300 name=FuglyGame align=middle type=application/x-shockwave-flash pluginspage=http://www.macromedia.com/go/getflashplayer; //object''') print( '''meta http-equiv=REFRESH content=2;/data/apps/%s''' % filename ) sys.exit(0) # = # Display download button for each file and download it on click # = print('''body background='/data/images/star.jpg' centerimg src='/data/images/download.gif'brbr table border=5 cellpadding=5 bgcolor=green ''') # # Collect directory and its filenames as bytes path = b'/home/nikos/public_html/data/apps/' files = os.listdir( path ) for filename in files: # Compute 'path/to/filename' filepath_bytes = path + filename for encoding in ('utf-8', 'iso-8859-7', 'latin-1'): try: filepath = filepath_bytes.decode( encoding ) except UnicodeDecodeError: continue # Rename to something valid in UTF-8 if encoding != 'utf-8': os.rename( filepath_bytes, filepath.encode('utf-8') ) assert os.path.exists( filepath ) break else: # This only runs if we never reached the break raise ValueError( 'unable to clean filename %r' % filepath_bytes ) # # Collect filenames of the path dir as strings filenames = os.listdir( '/home/nikos/public_html/data/apps/' ) # Load'em for filename in filenames: try: # Check the presence of a file against the database and insert if it doesn't exist cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) ) data = cur.fetchone() if not data: # First time for file; primary key is automatic, hit is defaulted print( iam here, filename + '\n' ) cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) ) except pymysql.ProgrammingError as e: print( repr(e) ) # # Collect filenames of the path dir as strings filenames = os.listdir( '/home/nikos/public_html/data/apps/' ) filepaths = set() # Build a set of 'path/to/filename' based on the objects of path dir for filename in filenames: filepaths.add( filename ) # Delete spurious cur.execute('''SELECT url FROM files''') data = cur.fetchall() # Check database's filenames against path's filenames for rec in data: if rec not in filepaths: cur.execute('''DELETE FROM files WHERE url = %s''', rec ) [/CODE] When trying to run it is still erroting out: [CODE] [Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] Original exception was:, referer: http://superhost.gr/ [Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] Traceback (most recent call last):, referer: http://superhost.gr/ [Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] File /home/nikos/public_html/cgi-bin/files.py, line 83, in module, referer: http://superhost.gr/ [Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] assert os.path.exists(
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 09Jun2013 06:25, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: | [... heaps of useful explaination ...] | When locale to linux system is set to utf-8 that would mean that the | linux applications, should try to encode string into hdd by using | system's default encoding to utf-8 nad read them back from bytes by | also using utf-8. Is that correct? | | Yes. Although I'd point out that only application that care about text as _text_ need to consider Unicode and the encoding. A command like mv does not care. You type the command and mv receives byte strings as its arguments. So it is doing straight forward bytes file renames. It does not care or even know about encodings. In this scenario, really it is the Terminal program (eg Putty) which cares about text (what you type, and what gets displayed). It is because of mismatches between your Terminal local settings and the encoding that was chosen for the filenames that you get garbage listings, one way or another. Cheers, -- Cameron Simpson c...@zip.com.au But then, I'm only 50. Things may well get a bit much for me when I reach the gasping heights of senile decrepitude of which old Andy Woodward speaks with such feeling. - Chris Malcolm, c...@uk.ac.ed.aifh, DoD #205 -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
I'm sorry posted by mistake unnessary code: here is the correct one that prodiuced the above error: # # Collect directory and its filenames as bytes path = b'/home/nikos/public_html/data/apps/' files = os.listdir( path ) for filename in files: # Compute 'path/to/filename' filepath_bytes = path + filename for encoding in ('utf-8', 'iso-8859-7', 'latin-1'): try: filepath = filepath_bytes.decode( encoding ) except UnicodeDecodeError: continue # Rename to something valid in UTF-8 if encoding != 'utf-8': os.rename( filepath_bytes, filepath.encode('utf-8') ) assert os.path.exists( filepath ) break else: # This only runs if we never reached the break raise ValueError( 'unable to clean filename %r' % filepath_bytes ) # # Collect filenames of the path dir as strings filenames = os.listdir( '/home/nikos/public_html/data/apps/' ) # Load'em for filename in filenames: try: # Check the presence of a file against the database and insert if it doesn't exist cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) ) data = cur.fetchone() if not data: # First time for file; primary key is automatic, hit is defaulted print( iam here, filename + '\n' ) cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) ) except pymysql.ProgrammingError as e: print( repr(e) ) # # Collect filenames of the path dir as strings filenames = os.listdir( '/home/nikos/public_html/data/apps/' ) filepaths = set() # Build a set of 'path/to/filename' based on the objects of path dir for filename in filenames: filepaths.add( filename ) # Delete spurious cur.execute('''SELECT url FROM files''') data = cur.fetchall() # Check database's filenames against path's filenames for rec in data: if rec not in filepaths: cur.execute('''DELETE FROM files WHERE url = %s''', rec ) -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sun, 09 Jun 2013 00:00:53 -0700, nagia.retsina wrote: path = b'/home/nikos/public_html/data/apps/' files = os.listdir( path ) for filename in files: # Compute 'path/to/filename' filepath_bytes = path + filename for encoding in ('utf-8', 'iso-8859-7', 'latin-1'): try: filepath = filepath_bytes.decode( encoding ) except UnicodeDecodeError: continue # Rename to something valid in UTF-8 if encoding != 'utf-8': os.rename( filepath_bytes, filepath.encode('utf-8') ) assert os.path.exists( filepath ) break else: # This only runs if we never reached the break raise ValueError( 'unable to clean filename %r' % filepath_bytes ) Editing the traceback to get rid of unnecessary noise from the logging: Traceback (most recent call last): File /home/nikos/public_html/cgi-bin/files.py, line 83, in module assert os.path.exists( filepath ) File /usr/local/lib/python3.3/genericpath.py, line 18, in exists os.stat(path) UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-37: ordinal not in range(128) Why am i still receing unicode decore errors? With the help of you guys we have writen a prodecure just to avoid this kind of decoding issues and rename all greek_byted_filenames to utf-8_byted. That's a very good question. It works for me when I test it, so I cannot explain why it fails for you. Please try this: log into the Linux server, and then start up a Python interactive session by entering: python3.3 at the $ prompt. Then, at the prompt, enter these lines of code. You can copy and paste them: import os, sys print(sys.version) s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}' '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}' '\N{GREEK SMALL LETTER EPSILON}') print(s) filename = '/tmp/' + s open(filename, 'w') os.path.exists(filename) Copy and paste the results back here please. Is it the assert that fail? Do we have some logic error someplace i dont see? Please read the error message. Does it say AssertionError? If it says AssertionError, then the assert has failed. If it says something else, the code failed before the assert can run. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote: chr('A') would give me the mapping of this char, the number 65 while ord(65) would output the char 'A' likewise. Correct. Python uses Unicode, where code-point 65 (ordinal value 65) means letter A. Actually, that's the other way around: chr(65) 'A' ord('A') 65 What would happen if we we try to re-encode bytes on the disk? like trying: s = νίκος utf8_bytes = s.encode('utf-8') greek_bytes = utf_bytes.encode('iso-8869-7') Can we re-encode twice or as many times we want and then decode back respectively lke? Of course. Bytes have no memory of where they came from, or what they are used for. All you are doing is flipping bits on a memory chip, or on a hard drive. So long as *you* remember which encoding is the right one, there is no problem. If you forget, and start using the wrong one, you will get garbage characters, mojibake, or errors. Uhm, no: encode transforms a Unicode string into an array of bytes, decode does the opposite transformation. You cannot do the former on an arbitrary array of bytes: s = νίκος utf8_bytes = s.encode('utf-8') greek_bytes = utf8_bytes.encode('iso-8869-7') Traceback (most recent call last): File stdin, line 1, in module AttributeError: 'bytes' object has no attribute 'encode' ciao, lele. -- nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia. l...@metapensiero.it | -- Fortunato Depero, 1929. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Steven wrote: Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for values up to 256? Because then how do you tell when you need one byte, and when you need two? If you read two bytes, and see 0x4C 0xFA, does that mean two characters, with ordinal values 0x4C and 0xFA, or one character with ordinal value 0x4CFA? I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. UTF-8 and UTF-16 and UTF-32 I though the number beside of UTF- was to declare how many bits the character set was using to store a character into the hdd, no? Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit values to make a surrogate pair. A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? Is this what a surrogate is? a pari of 2 chars? UTF-8 uses 8-bit values, but sometimes it combines two, three or four of them to represent a single code-point. 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is 127 ) 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal 65000 ) The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? UTF-8 solves this problem by reserving some values to mean this byte, on its own, and others to mean this byte, plus the next byte, together, and so forth, up to four bytes. Some of the utf-8 bits that are used to represent a character's ordinal value are actually been also used to seperate or join the ordinal values themselves? Can you give an example please? How there are beign seperated? Computers are digital and work with numbers. So character 'A' - 65 (in decimal uses in charset's table) - 01011100 (as binary stored in disk) - 0xEF (as hex, when we open the file with a hex editor) Is this how the thing works? (above values are fictional) -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Κυριακή, 9 Ιουνίου 2013 11:02:48 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: In this scenario, really it is the Terminal program (eg Putty) which cares about text (what you type, and what gets displayed). It is because of mismatches between your Terminal local settings and the encoding that was chosen for the filenames that you get garbage listings, one way or another. Ca n you give an example please that shows a string being greek-iso encoded and then being utf8 decoded and presented back as: 1. properly 2. garbage ( means trash but dont what a garbage char is) 3. error -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Κυριακή, 9 Ιουνίου 2013 11:55:43 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote: chr('A') would give me the mapping of this char, the number 65 while ord(65) would output the char 'A' likewise. Correct. Python uses Unicode, where code-point 65 (ordinal value 65) means letter A. Actually, that's the other way around: chr(65) 'A' ord('A') 65 What would happen if we we try to re-encode bytes on the disk? like trying: s = νίκος utf8_bytes = s.encode('utf-8') greek_bytes = utf_bytes.encode('iso-8869-7') Can we re-encode twice or as many times we want and then decode back respectively lke? Of course. Bytes have no memory of where they came from, or what they are used for. All you are doing is flipping bits on a memory chip, or on a hard drive. So long as *you* remember which encoding is the right one, there is no problem. If you forget, and start using the wrong one, you will get garbage characters, mojibake, or errors. Uhm, no: encode transforms a Unicode string into an array of bytes, decode does the opposite transformation. You cannot do the former on an arbitrary array of bytes: s = νίκος utf8_bytes = s.encode('utf-8') greek_bytes = utf8_bytes.encode('iso-8869-7') Traceback (most recent call last): File stdin, line 1, in module AttributeError: 'bytes' object has no attribute 'encode' So something encoded into bytes cannot be re-encoded to some other bytes. How about a string i wonder? s = νίκος what_are these_bytes = s.encode('iso-8869-7').encode(utf-8') -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | Steven wrote: | Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for | values up to 256? | | Because then how do you tell when you need one byte, and when you need | two? If you read two bytes, and see 0x4C 0xFA, does that mean two | characters, with ordinal values 0x4C and 0xFA, or one character with | ordinal value 0x4CFA? | | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your suggestion will not. I'd point out that if you did this, you'd be back in the same situation you just encountered with ASCII: the first above-255 value would raise a UnicodeEncodeError (an error which does not even exist at present:-) | UTF-8 and UTF-16 and UTF-32 | I though the number beside of UTF- was to declare how many bits the | character set was using to store a character into the hdd, no? | | Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. | UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit | values to make a surrogate pair. | | A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? | Is this what a surrogate is? a pari of 2 chars? Essentially. The combination represents a code point. | UTF-8 uses 8-bit values, but sometimes | it combines two, three or four of them to represent a single code-point. | | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) | 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is 127 ) | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal 65000 ) | | The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? Essentially. You can read up on the exact process in Wikipedia or the Unicode Standard. Cheers, -- Cameron Simpson c...@zip.com.au The most annoying thing about being without my files after our disc crash was discovering once again how widespread BLINK was on the web. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Κυριακή, 9 Ιουνίου 2013 11:15:07 π.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: Please try this: log into the Linux server, and then start up a Python import os, sys print(sys.version) s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}' '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}' '\N{GREEK SMALL LETTER EPSILON}') print(s) filename = '/tmp/' + s open(filename, 'w') os.path.exists(filename) Copy and paste the results back here please. Of course: here it is: root@nikos [/home/nikos/www/cgi-bin]# python Python 3.3.2 (default, Jun 3 2013, 16:18:05) [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux Type help, copyright, credits or license for more information. import os, sys print(sys.version) 3.3.2 (default, Jun 3 2013, 16:18:05) [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}' ... '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}' ... '\N{GREEK SMALL LETTER EPSILON}') print(s) αβγδε filename = '/tmp/' + s open(filename, 'w') _io.TextIOWrapper name='/tmp/αβγδε' mode='w' encoding='UTF-8' os.path.exists(filename) True -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 09Jun2013 08:15, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: | On Sun, 09 Jun 2013 00:00:53 -0700, nagia.retsina wrote: | path = b'/home/nikos/public_html/data/apps/' | files = os.listdir( path ) | | for filename in files: | # Compute 'path/to/filename' | filepath_bytes = path + filename | for encoding in ('utf-8', 'iso-8859-7', 'latin-1'): | try: | filepath = filepath_bytes.decode( encoding ) | except UnicodeDecodeError: | continue | | # Rename to something valid in UTF-8 | if encoding != 'utf-8': | os.rename( filepath_bytes, | filepath.encode('utf-8') ) | assert os.path.exists( filepath ) | break | else: | # This only runs if we never reached the break |raise ValueError( | 'unable to clean filename %r' % filepath_bytes ) | | Editing the traceback to get rid of unnecessary noise from the logging: | | Traceback (most recent call last): | File /home/nikos/public_html/cgi-bin/files.py, line 83, in module | assert os.path.exists( filepath ) | File /usr/local/lib/python3.3/genericpath.py, line 18, in exists | os.stat(path) | UnicodeEncodeError: 'ascii' codec can't encode characters in position | 34-37: ordinal not in range(128) | | Why am i still receing unicode decore errors? With the help of you guys | we have writen a prodecure just to avoid this kind of decoding issues | and rename all greek_byted_filenames to utf-8_byted. | | That's a very good question. It works for me when I test it, so I cannot | explain why it fails for you. If he's lucky the UnicodeEncodeError occurred while trying to print an error message, printing a greek Unicode string in the error with ASCII as the output encoding (default when not a tty IIRC). Cheers, -- Cameron Simpson c...@zip.com.au I generally avoid temptation unless I can't resist it. - Mae West -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Νικόλαος Κούρας nikos.gr...@gmail.com writes: Τη Κυριακή, 9 Ιουνίου 2013 11:55:43 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε: Uhm, no: encode transforms a Unicode string into an array of bytes, decode does the opposite transformation. You cannot do the former on an arbitrary array of bytes: s = νίκος utf8_bytes = s.encode('utf-8') greek_bytes = utf8_bytes.encode('iso-8869-7') Traceback (most recent call last): File stdin, line 1, in module AttributeError: 'bytes' object has no attribute 'encode' So something encoded into bytes cannot be re-encoded to some other bytes. How about a string i wonder? s = νίκος what_are these_bytes = s.encode('iso-8869-7').encode(utf-8') Ignoring the usual syntax error, this is just a variant of the code I posted: “s.encode('iso-8869-7')” produces a bytes instance which *cannot* be re-encoded again in whatever encoding. ciao, lele. -- nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia. l...@metapensiero.it | -- Fortunato Depero, 1929. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Κυριακή, 9 Ιουνίου 2013 12:12:36 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | Steven wrote: | Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for | values up to 256? | | Because then how do you tell when you need one byte, and when you need | two? If you read two bytes, and see 0x4C 0xFA, does that mean two | characters, with ordinal values 0x4C and 0xFA, or one character with | ordinal value 0x4CFA? | | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your suggestion will not. I dont follow. | UTF-8 and UTF-16 and UTF-32 | I though the number beside of UTF- was to declare how many bits the | character set was using to store a character into the hdd, no? | | Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. | UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit | values to make a surrogate pair. | | A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? | Is this what a surrogate is? a pari of 2 chars? Essentially. The combination represents a code point. | UTF-8 uses 8-bit values, but sometimes | it combines two, three or four of them to represent a single code-point. | | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) | 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is 127 ) | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal 65000 ) | | The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? Essentially. You can read up on the exact process in Wikipedia or the Unicode Standard. When you say essentially means you agree with my statements? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Κυριακή, 9 Ιουνίου 2013 12:20:58 μ.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε: How about a string i wonder? s = νίκος what_are these_bytes = s.encode('iso-8869-7').encode(utf-8') Ignoring the usual syntax error, this is just a variant of the code I posted: s.encode('iso-8869-7') produces a bytes instance which *cannot* be re-encoded again in whatever encoding. s = 'a' s = s.encode('iso-8859-7').decode('utf-8') print( s ) a (we got the original character back) s = 'α' s = s.encode('iso-8859-7').decode('utf-8') print( s ) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data Why this error? because 'a' ordinal value 127 ? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Κυριακή, 9 Ιουνίου 2013 12:14:12 μ.μ. UTC+3, ο χρήστης Νικόλαος Κούρας έγραψε: Τη Κυριακή, 9 Ιουνίου 2013 11:15:07 π.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: Please try this: log into the Linux server, and then start up a Python import os, sys print(sys.version) s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}' '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}' '\N{GREEK SMALL LETTER EPSILON}') print(s) filename = '/tmp/' + s open(filename, 'w') os.path.exists(filename) Copy and paste the results back here please. Of course: here it is: root@nikos [/home/nikos/www/cgi-bin]# python Python 3.3.2 (default, Jun 3 2013, 16:18:05) [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux Type help, copyright, credits or license for more information. import os, sys print(sys.version) 3.3.2 (default, Jun 3 2013, 16:18:05) [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}' ... '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}' ... '\N{GREEK SMALL LETTER EPSILON}') print(s) αβγδε filename = '/tmp/' + s open(filename, 'w') _io.TextIOWrapper name='/tmp/αβγδε' mode='w' encoding='UTF-8' os.path.exists(filename) True I dont much but it lloks correct to me, but then agian why this error? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
I k nwo i have been a pain in the ass these days but this is the lats explanation i want from you, just to understand it completely. Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for values up to 256? Because then how do you tell when you need one byte, and when you need two? If you read two bytes, and see 0x4C 0xFA, does that mean two characters, with ordinal values 0x4C and 0xFA, or one character with ordinal value 0x4CFA? I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. UTF-8 and UTF-16 and UTF-32 I though the number beside of UTF- was to declare how many bits the character set was using to store a character into the hdd, no? Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit values to make a surrogate pair. A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? Is this what a surrogate is? a pari of 2 chars? UTF-8 uses 8-bit values, but sometimes it combines two, three or four of them to represent a single code-point. 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is 127 ) 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal 65000 ) The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? UTF-8 solves this problem by reserving some values to mean this byte, on its own, and others to mean this byte, plus the next byte, together, and so forth, up to four bytes. Some of the utf-8 bits that are used to represent a character's ordinal value are actually been also used to seperate or join the ordinal values themselves? Can you give an example please? How there are beign seperated? Computers are digital and work with numbers. So character 'A' - 65 (in decimal uses in charset's table) - 01011100 (as binary stored in disk) - 0xEF (as hex, when we open the file with a hex editor) Is this how the thing works? (above values are fictional) -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sun, 09 Jun 2013 10:55:43 +0200, Lele Gaifax wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote: chr('A') would give me the mapping of this char, the number 65 while ord(65) would output the char 'A' likewise. Correct. Python uses Unicode, where code-point 65 (ordinal value 65) means letter A. Actually, that's the other way around: chr(65) 'A' ord('A') 65 /facepalm Of course you are right. What would happen if we we try to re-encode bytes on the disk? like trying: s = νίκος utf8_bytes = s.encode('utf-8') greek_bytes = utf_bytes.encode('iso-8869-7') Can we re-encode twice or as many times we want and then decode back respectively lke? Of course. [...] Uhm, no: encode transforms a Unicode string into an array of bytes, decode does the opposite transformation. You cannot do the former on an arbitrary array of bytes: And two for two. I misunderstood Nikos' question. As you point out, no, Python 3 will not allow you to re-encode bytes. You must first decode them to a string first, then encode them using a different encoding. (I thought that this was was Nikos actually meant, but I on re-reading his question more closely, that's not actually what he asked.) Sorry for any confusion. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Please and tell me that this actually can be solved. Iam willing to try anything for 'files.py' to load propelry. Every thign works as expected in my webiste, have manages to correct pelatologio.poy and koukos.py. This is the last thing the webiste needs, that is files.py to load so users can grab importan files in greek format. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 09.06.2013 11:38, Νικόλαος Κούρας wrote: s = 'α' s = s.encode('iso-8859-7').decode('utf-8') print( s ) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data Why this error? because 'a' ordinal value 127 ? s = 'α' s.encode('iso-8859-7') b'\xe1' bin(0xe1) '0b1111' Now look at the table on https://en.wikipedia.org/wiki/UTF-8#Description to find out how many bytes a UTF-8 decoder expects when it reads that value. Bye, Andreas -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sun, 09 Jun 2013 02:00:46 -0700, Νικόλαος Κούρας wrote: Steven wrote: Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for values up to 256? Because then how do you tell when you need one byte, and when you need two? If you read two bytes, and see 0x4C 0xFA, does that mean two characters, with ordinal values 0x4C and 0xFA, or one character with ordinal value 0x4CFA? I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. Think about it. Draw up a big table of one million plus characters: Ordinal Character 0 NUL control code 1 SOH control code ... 84LATIN CAPITAL LETTER T 85LATIN CAPITAL LETTER U ... 255 LATIN SMALL LETTER Y WITH DIAERESIS 256 LATIN CAPITAL LETTER A WITH MACRON ... 8485 OUNCE SIGN and so on, all the way to 1114111. Now, suppose you read a file, and see two bytes, shown in decimal: 84 followed by 85, or in hexadecimal, 0x54 followed by 0x55. How do you tell whether that means two characters, T followed by U, or a single character, ℥ (OUNCE SIGN)? With UTF-32, you can, because every value takes exactly the same space. So a T followed by a U is: 0x0054 0x0055 while a single ℥ is: 0x2125 and it is easy to tell them apart: each block of 4 bytes is exactly one character. But notice how many NUL bytes there are? In the three characters shown, there are eight NUL bytes. Most text will be filled with NUL bytes, which is very wasteful. UTF-8 is designed to be compact, and also to be backwards-compatible with ASCII. Characters which are in ASCII will be a single byte, so there are no null bytes used for padding, (except for NUL itself, of course). So the three characters TU℥ will be: 0x54 0x55 0xE2 0x84 0xA5 Five bytes in total, instead of 12 for UTF-32. But the only tricky part is that character with ordinal value 0xE2 (decimal 226, â) cannot be encoded as the single byte 0xE2, otherwise you would mistake the three bytes 0xE284A5 as starting with 'â' followed by two more characters. And indeed, 'â' is encoded as two bytes: 0xC3 0xA2 Likewise, character with ordinal value 0xC3 (decimal 195, Ã) is also encoded as two bytes: 0xC3 0x83 And so on. This way, there is never any confusion as to whether (say) three bytes are three one-byte characters, or one three-byte character. UTF-8 and UTF-16 and UTF-32 I though the number beside of UTF- was to declare how many bits the character set was using to store a character into the hdd, no? Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit values to make a surrogate pair. A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? Is this what a surrogate is? a pari of 2 chars? Yes, a surrogate pair is a pair of two characters. But they're not *real* characters. They don't exist in any human language. They are just values that tells the program these go together, and count as a single character. (This is why Unicode prefers to talk about *code points* rather than characters. Some code points are characters, and some are not.) UTF-8 uses 8-bit values, but sometimes it combines two, three or four of them to represent a single code-point. 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) Correct. 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is 127 ) That looks like two characters to me, 'α' followed by '΄'. That will take 4 bytes, two for 'α' and two for '΄'. 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal 65000 ) Not necessarily four bytes. Could be three. Depends on the ideogram. The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? Yes. UTF-8 solves this problem by reserving some values to mean this byte, on its own, and others to mean this byte, plus the next byte, together, and so forth, up to four bytes. Some of the utf-8 bits that are used to represent a character's ordinal value are actually been also used to seperate or join the ordinal values themselves? Can you give an example please? How there are beign seperated? Did you look up UTF-8 on Wikipedia like I suggested? Computers are digital and work with numbers. So character 'A' - 65 (in decimal uses in charset's table) - 01011100 (as binary stored in disk) - 0xEF (as hex, when we open the file with a hex editor) Is this how the thing works? (above values are fictional) You can check this in Python: py c = 'A' py ord(c) 65 py bin(65) '0b101' py hex(65) '0x41' py c = 'α' py ord(c) 945 py c.encode('utf-8') b'\xce\xb1' py c.encode('utf-16be') b'\x03\xb1' py c.encode('utf-32be') b'\x00\x00\x03\xb1' py c.encode('iso-8859-7') b'\xe1' -- Steven
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sun, 09 Jun 2013 19:16:06 +1000, Cameron Simpson wrote: If he's lucky the UnicodeEncodeError occurred while trying to print an error message, That's not what happens at the interactive console: py assert os.path.exists('Ж1') Traceback (most recent call last): File stdin, line 1, in module AssertionError printing a greek Unicode string in the error with ASCII as the output encoding (default when not a tty IIRC). An interesting thought. How would we test that? -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sun, 09 Jun 2013 02:38:13 -0700, Νικόλαος Κούρας wrote: s = 'α' s = s.encode('iso-8859-7').decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data Why this error? because 'a' ordinal value 127 ? Look at it this way... consider encoding and decoding to be like translating from one language to another. Suppose you start with the English word street. You encode it to German by looking it up in an English-To-German dictionary: street - Straße The you decode the German by looking Straße up in a German-To-English dictionary: Straße - street and everything is good. But suppose that after encoding the English to German, you get confused, and think that it is Italian, not German. So when it comes to decoding, you try to look up 'Staße' in an Italian-To- English dictionary, and discover that there is no such thing as letter ß in Italian. So you cannot look the word up, and you get frustrated and shout this is rubbish, there's no such thing as ß, that's not a letter! Not in Italian, but it is a perfectly good letter in German. But you're looking it up in the wrong dictionary. Same thing with UTF-8. You encoded the string 'α' by looking it up in the Unicode To ISO-8859-7 bytes dictionary. Then you try to decode it by looking for those bytes in the UTF-8 bytes To Unicode dictionary. But you can't find byte 0xe1 on its own in UTF-8 bytes, so Python shouts this is rubbish, there's no such thing as 0xe1 on its own in UTF-8! and raises UnicodeDecodeError. Sometimes you don't get an exception. Suppose that you are encoding from French to German: qui - die (both words mean who in English) Now if you get confused, and decode the word 'die' by looking it up in an English-To-French dictionary, instead of German-To-French, you get: die - mourir So instead of getting 'qui' back again, you get 'mourir'. This is like mojibake: the results are garbage, but there is no exception raised to warn you. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Κυριακή, 9 Ιουνίου 2013 3:36:51 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: printing a greek Unicode string in the error with ASCII as the output encoding (default when not a tty IIRC). An interesting thought. How would we test that? Please elaborare this for me. I ditn undertood what you are trying to say, your assumption of why still ima getting decode issues. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sun, Jun 9, 2013 at 2:38 AM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote: Τη Κυριακή, 9 Ιουνίου 2013 12:20:58 μ.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε: How about a string i wonder? s = νίκος what_are these_bytes = s.encode('iso-8869-7').encode(utf-8') Ignoring the usual syntax error, this is just a variant of the code I posted: s.encode('iso-8869-7') produces a bytes instance which *cannot* be re-encoded again in whatever encoding. s = 'a' s = s.encode('iso-8859-7').decode('utf-8') print( s ) a (we got the original character back) s = 'α' s = s.encode('iso-8859-7').decode('utf-8') print( s ) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data Why this error? because 'a' ordinal value 127 ? -- No. You get that error because the string is not encoded in UTF-8. It's encoded in ISO-8859-7. For ASCII strings (ord(x) 127), ISO-8859-7 and UTF-8 look exactly the same. For anything else, they are different. If you were to try to decode it as ISO-8859-1, it would succeed, but you would get the character á back instead of α. You're misunderstanding the decode function. Decode doesn't turn it into a string with the specified encoding. It takes it *from* the string with the specified encoding and turns it into Python's internal string representation. In Python 3.3, that encoding doesn't even have a name because it's not a standard encoding. So you want the decode argument to match the encode argument. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sun, Jun 9, 2013 at 2:20 AM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote: Τη Κυριακή, 9 Ιουνίου 2013 12:12:36 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | Steven wrote: | Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for | values up to 256? | | Because then how do you tell when you need one byte, and when you need | two? If you read two bytes, and see 0x4C 0xFA, does that mean two | characters, with ordinal values 0x4C and 0xFA, or one character with | ordinal value 0x4CFA? | | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your suggestion will not. I dont follow. The point in the UTF formats is that they can encode any of the 1.1 million codepoints available in Unicode. Your suggestion can only encode 256 code points. We have that encoding already- it's called Latin-1 and it can't encode any of your Greek characters (hence why ISO-8859-7 exists, which can encode the Greek characters but not the Latin ones). If you were to use the whole byte to store the first 256 characters, you wouldn't be able to store character number 256 because the computer wouldn't be able to tell the difference between character 257 (0x01 0x01) and two chr(1)s. UTF-8 gets around this by reserving the top bit as a am I part of a multibyte sequence flag, | UTF-8 and UTF-16 and UTF-32 | I though the number beside of UTF- was to declare how many bits the | character set was using to store a character into the hdd, no? | | Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. | UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit | values to make a surrogate pair. | | A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? | Is this what a surrogate is? a pari of 2 chars? Essentially. The combination represents a code point. | UTF-8 uses 8-bit values, but sometimes | it combines two, three or four of them to represent a single code-point. | | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) | 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is 127 ) | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal 65000 ) | | The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? Essentially. You can read up on the exact process in Wikipedia or the Unicode Standard. When you say essentially means you agree with my statements? -- In UTF-8 or UTF-16, the number of bytes required for the character is dependent on its code point, yes. That isn't the case for UTF-32, where every character uses exactly four bytes. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Σάββατο, 8 Ιουνίου 2013 5:52:22 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: On 07Jun2013 11:52, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | ni...@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] File /home/nikos/public_html/cgi-bin/files.py, line 81 | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 'greek' ) | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] ^ | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of script headers: files.py | --- | i dont know why that if statement errors. Python statements that continue (if, while, try etc) end in a colon, so: Oh iam very sorry. Oh my God i cant beleive i missed a colon *again*: I have corrected this: # # Collect filenames of the path dir as bytes filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' ) for filename in filename_bytes: # Compute 'path/to/filename' into bytes filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename' flag = False try: # Assume current file is utf8 encoded filepath = filepath_bytes.decode('utf-8') flag = 'utf8' except UnicodeDecodeError: try: # Since current filename is not utf8 encoded then it has to be greek-iso encoded filepath = filepath_bytes.decode('iso-8859-7') flag = 'greek' except UnicodeDecodeError: print( '''I give up! File name is unreadable!''' ) if flag == 'greek': # Rename filename from greek bytes -- utf-8 bytes os.rename( filepath_bytes, filepath.encode('utf-8') ) == Now everythitng were supposed to work but instead iam getting this surrogate error once more. What is this surrogate thing? Since i make use of error cathcing and handling like 'except UnicodeDecodeError:' then it utf8's decode fails for some reason, it should leave that file alone and try the next file? try: # Assume current file is utf8 encoded filepath = filepath_bytes.decode('utf-8') flag = 'utf8' except UnicodeDecodeError: This is what it supposed to do, correct? == [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] File /home/nikos/public_html/cgi-bin/files.py, line 94, in module [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) ) [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] File /usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py, line 108, in execute [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] query = query.encode(charset) [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'utf-8' codec can't encode character '\\udcce' in position 35: surrogates not allowed -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sat, Jun 8, 2013 at 4:49 PM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote: Oh my God i cant beleive i missed a colon *again*: For most Python programmers, this is a matter of moments to solve. Run the program, get a SyntaxError, fix it. Non-interesting event. (Maybe even sooner than that, if the editor highlights it for you.) This is why you really need to start yourself a testbox. DO NOT PLAY ON YOUR LIVE SYSTEM. This is sysadminning 101. And Python programming 101: The error traceback points to the error, or just after it. Get to know how error messages work. This is not even Python-specific. *Every* language behaves this way. You look at the highlighted line, if you can't see an error there you look a little bit higher. You should not need to beg for help for such trivial problems. This is the mark of a novice. You ought no longer to be a novice, based on how long you've been doing this stuff. You ought not to behave like one. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Fri, 07 Jun 2013 23:49:17 -0700, Νικόλαος Κούρας wrote: [...] Oh iam very sorry. Oh my God i cant beleive i missed a colon *again*: I have corrected this: [snip code] Stop posting your code after every trivial edit!!! -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sat, Jun 8, 2013 at 5:26 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Fri, 07 Jun 2013 23:49:17 -0700, Νικόλαος Κούρας wrote: [...] Oh iam very sorry. Oh my God i cant beleive i missed a colon *again*: I have corrected this: [snip code] Stop posting your code after every trivial edit!!! I think he uses the python-list archives as ersatz source control. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Thu, 06 Jun 2013 23:35:33 -0700, nagia.retsina wrote: Working with bytes is only for when the file names are turned to garbage. Your file names (some of them) are turned to garbage. Fix them, and then use file names as strings. Can't '~/data/apps/' is filled every day with more and more files which are uploaded via FileZilla client, which i think it behaves pretty much like putty, uploading filenames as greek-iso bytes. Well, that is certainly a nuisance. Try something like this: # Untested. dir = b'/home/nikos/public_html/data/apps/' # This must be bytes. files = os.listdir(dir) for name in files: pathname_as_bytes = dir + name for encoding in ('utf-8', 'iso-8859-7', 'latin-1'): try: pathname = pathname_as_bytes.decode(encoding) except UnicodeDecodeError: continue # Rename to something valid in UTF-8. if encoding != 'utf-8': os.rename(pathname_as_bytes, pathname.encode('utf-8')) assert os.path.exists(pathname) break else: # This only runs if we never reached the break. raise ValueError('unable to clean filename %r'%pathname_as_bytes) -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Νικόλαος Κούρας schreef: Session settings afaik is for putty to remember hosts to connect to, not terminal options. I might be worng though. No matter how many times i change its options next time i run it always defaults back. Putty can most definitely remember its settings: - Start PuTTY; you should get the PuTTY Configuration window - Select a session in the list of sessions - Click Load - Change any setting you want to change - Go back to Session in the Category treeview - Click Save HTH -- People almost invariably arrive at their beliefs not on the basis of proof but on the basis of what they find attractive. -- Pascal Blaise r...@roelschroeven.net -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 08/06/2013 07:49, Νικόλαος Κούρας wrote: Τη Σάββατο, 8 Ιουνίου 2013 5:52:22 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: On 07Jun2013 11:52, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | ni...@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] File /home/nikos/public_html/cgi-bin/files.py, line 81 | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 'greek' ) | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] ^ | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of script headers: files.py | --- | i dont know why that if statement errors. Python statements that continue (if, while, try etc) end in a colon, so: Oh iam very sorry. Oh my God i cant beleive i missed a colon *again*: I have corrected this: # # Collect filenames of the path dir as bytes filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' ) for filename in filename_bytes: # Compute 'path/to/filename' into bytes filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename' flag = False try: # Assume current file is utf8 encoded filepath = filepath_bytes.decode('utf-8') flag = 'utf8' except UnicodeDecodeError: try: # Since current filename is not utf8 encoded then it has to be greek-iso encoded filepath = filepath_bytes.decode('iso-8859-7') flag = 'greek' except UnicodeDecodeError: print( '''I give up! File name is unreadable!''' ) if flag == 'greek': # Rename filename from greek bytes -- utf-8 bytes os.rename( filepath_bytes, filepath.encode('utf-8') ) == Now everythitng were supposed to work but instead iam getting this surrogate error once more. What is this surrogate thing? Since i make use of error cathcing and handling like 'except UnicodeDecodeError:' then it utf8's decode fails for some reason, it should leave that file alone and try the next file? try: # Assume current file is utf8 encoded filepath = filepath_bytes.decode('utf-8') flag = 'utf8' except UnicodeDecodeError: This is what it supposed to do, correct? == [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] File /home/nikos/public_html/cgi-bin/files.py, line 94, in module [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) ) [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] File /usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py, line 108, in execute [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] query = query.encode(charset) [Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'utf-8' codec can't encode character '\\udcce' in position 35: surrogates not allowed Look at the traceback. It says that the exception was raised by: query = query.encode(charset) which was called by: cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) ) But what is 'filename'? And what has it to do with the first code snippet? Does the traceback have _anything_ to do with the first code snippet? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Sorry for th delay guys, was busy with other thigns today and i am still reading your resposes, still ahvent rewad them all just Cameron's: Here is what i have now following Cameron's advices: # # Collect filenames of the path directory as bytes path = b'/home/nikos/public_html/data/apps/' filenames_bytes = os.listdir( path ) for filename_bytes in filenames_bytes: try: filename = filename_bytes.decode('utf-8) except UnicodeDecodeError: # Since its not a utf8 bytestring then its for sure a greek bytestring # Prepare arguments for rename to happen utf8_filename = filename_bytes.encode('utf-8') greek_filename = filename_bytes.encode('iso-8859-7') utf8_path = path + utf8_filename greek_path = path + greek_filename # Rename current filename from greek bytes -- utf8 bytes os.rename( greek_path, utf8_path ) == I know this is wrong though. Since filename_bytes is the current filename encoded as utf8 or greek-iso then i cannot just *encode* what is already encoded by doing this: utf8_filename = filename_bytes.encode('utf-8') greek_filename = filename_bytes.encode('iso-8859-7') -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Okey after reading also Steven post, i was relived form the previous suck position i was, so with an alternation of a few variable names here is the code now: # # Collect directory and its filenames as bytes path = b'/home/nikos/public_html/data/apps/' files = os.listdir( path ) for filename in files: # Compute 'path/to/filename' filepath_bytes = path + filename for encoding in ('utf-8', 'iso-8859-7', 'latin-1'): try: filepath = filepath_bytes.decode( encoding ) except UnicodeDecodeError: continue # Rename to something valid in UTF-8 if encoding != 'utf-8': os.rename( filepath_bytes, filepath.encode('utf-8') ) assert os.path.exists( filepath ) break else: # This only runs if we never reached the break raise ValueError( 'unable to clean filename %r' % filepath_bytes ) = I dont know why it is still failing when it tried to decode stuff since it tries 3 ways of decoding. Here is the exact error. ni...@superhost.gr [~/www/cgi-bin]# [Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] Error in sys.excepthook: [Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] ValueError: underlying buffer has been detached [Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] [Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] Original exception was: [Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] Traceback (most recent call last): [Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] File /home/nikos/public_html/cgi-bin/files.py, line 78, in module [Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] assert os.path.exists( filepath ) [Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] File /usr/local/lib/python3.3/genericpath.py, line 18, in exists [Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] os.stat(path) [Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-37: ordinal not in range(128) -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 08/06/2013 17:53, Νικόλαος Κούρας wrote: Sorry for th delay guys, was busy with other thigns today and i am still reading your resposes, still ahvent rewad them all just Cameron's: Here is what i have now following Cameron's advices: # # Collect filenames of the path directory as bytes path = b'/home/nikos/public_html/data/apps/' filenames_bytes = os.listdir( path ) for filename_bytes in filenames_bytes: try: filename = filename_bytes.decode('utf-8) except UnicodeDecodeError: # Since its not a utf8 bytestring then its for sure a greek bytestring # Prepare arguments for rename to happen utf8_filename = filename_bytes.encode('utf-8') greek_filename = filename_bytes.encode('iso-8859-7') utf8_path = path + utf8_filename greek_path = path + greek_filename # Rename current filename from greek bytes -- utf8 bytes os.rename( greek_path, utf8_path ) == I know this is wrong though. Yet you did it anyway! Since filename_bytes is the current filename encoded as utf8 or greek-iso then i cannot just *encode* what is already encoded by doing this: utf8_filename = filename_bytes.encode('utf-8') greek_filename = filename_bytes.encode('iso-8859-7') Try reading and understanding the code I originally posted. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 8/6/2013 5:49 πμ, Cameron Simpson wrote: On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: | | | errors='replace' mean dont break in case or error? | | | Yes. The result will be correct for correct iso-8859-7 and slightly mangled | | for something that would not decode smoothly. | | | How can it be correct? We have encoded out string in utf-8 and then | | we tried to decode it as greek-iso? How can this possibly be | | correct? | | If it is a valid iso-8859-7 sequence (which might cover everything, | since I expect it is an 8-bit 1:1 mapping from bytes values to a | set of codepoints, just like iso-8859-1) then it may decode to the | wrong characters, but the reverse process (characters encoded as | bytes) should produce the original bytes. With a mapping like this, | errors='replace' may mean nothing; there will be no errors because | the only Unicode characters in play are all from iso-8859-7 to start | with. Of course another string may not be safe. | | Visually, the names will be garbage. And if you go: |mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3' | while using the iso-8859-7 locale, the wrong thing will occur | (assuming it even works, though I think it should because all these | characters are represented in iso-8859-7, yes?) | | All the rest you i understood only the above quotes its still unclear to me. | I cant see to understand it. | | Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar? Yes. It is certainly true for utf-8 and latin-iso and ASCII. I expect it to be so for greek-iso, but have not checked. They're all essentially the ASCII set plus a range of other character codepoints for the upper values. The 8-bit sets iso-8859-1 (which I take you to mean by latin-iso) and iso-8859-7 (which I take you to mean by greek-iso) are single byte mapping with the top half mapped to characters commonly used in a particular region. Unicode has a much greater range, but the UTF-8 encoding of Unicode deliberately has the bottom 0-127 identical to ASCII, and higher values represented by multibyte sequences commences with at least the first byte in the 128-255 range. In this way pure ASCII files are already in UTF-8 (and, in fact, work just fine for the iso-8859-x encodings as well). Hold on! In the beginning there was ASCII with 0-127 values and then there was Unicode with 0-127 of ASCII's + i dont know how much many more? Now ASCIII needs 1 byte to store a single character while Unicode needs 2 bytes to store a character and that is because it has 256 characters to store 2^8bits ? Is this correct? Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters into the hard drive? Because in some post i have read that 'UTF-8 encoding of Unicode'. Can you please explain to me whats the difference of ASCII-Unicode themselves aand then of them compared to 'Charsets' . I'm still confused about this. Is it like we said in C++: ' int a', a variable with name 'a' of type integer. 'char a', a variable with name 'a' of type char So taken form above example(the closest i could think of), the way i understand them is: A 'string' can be of (unicode's or ascii's) type and that type needs a way (thats a charset) to store this string into the hdd as a sequense of bytes? -- Webhost http://superhost.gr Weblog http://psariastonafro.wordpress.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sat, 08 Jun 2013 21:01:23 +0300, Νικόλαος Κούρας wrote: In the beginning there was ASCII with 0-127 values No, there were encoding systems that existed before ASCII, such as EBCDIC. But we can ignore those, and just start with ASCII. and then there was Unicode with 0-127 of ASCII's + i dont know how much many more? No, you have missed the utter chaos of dozens and dozens of Windows codepages and charsets. We still have to live with the pain of that. But now we have Unicode, with 0x10 (decimal 1114111) code points. You can consider a code point to be the same as a character, at least for now. Now ASCIII needs 1 byte to store a single character ASCII actually needs 7 bits to store a character. Since computers are optimized to work with bytes, not bits, normally ASCII characters are stored in a single byte, with one bit wasted. while Unicode needs 2 bytes to store a character No. Since there are 0x10 different Unicode characters (really code points, but ignore the difference) two bytes is not enough. Unicode needs 21 bits to store a character. Since that is more than 2 bytes, but less than 3, there are a few different ways for Unicode to be stored in memory, including: Wide Unicode uses four bytes per character. Why four instead of three? Because computers are more efficient when working with chunks of memory that is a multiple of four. Narrow Unicode uses two bytes per character. Since two bytes is only enough for about 65,000 characters, not 1,000,000+, the rest of the characters are stored as pairs of two-byte surrogates. and that is because it has 256 characters to store 2^8bits ? Correct. Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters into the hard drive? Your computer cannot carve a tiny little A into the hard drive when it stores that letter in a file. It has to write some bytes. So you need to know: - what byte, or bytes, represents the letter A? - what byte, or bytes, represents the letter B? - what byte, or bytes, represents the letter λ? and so on. This set of rules, byte means letter , is called an encoding. If you don't know what encoding to use, you cannot tell what the byte means. Because in some post i have read that 'UTF-8 encoding of Unicode'. Can you please explain to me whats the difference of ASCII-Unicode themselves aand then of them compared to 'Charsets' . I'm still confused about this. A charset is an ordered set of characters. For example, ASCII has 127 characters, starting with NUL: NUL ... A B C D E ... Z [ \ ] ^ ... a b c ... z ... where NULL is at position 0, 'A' is at position 65, 'B' at position 66, and so on. Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is also similar, also 256 positions, but the characters are different. And so on, with dozens of charsets. And then there is Unicode, which includes *every* character is all of those dozens of charsets. It has 1114111 positions (most are currently unfilled). An encoding is simply a program that takes a character and returns a byte, or visa versa. For instance, the ASCII encoding will take character 'A'. That is found at position 65, which is 0x41 in hexadecimal, so the ASCII encoding turns character 'A' into byte 0x41, and visa versa. Is it like we said in C++: ' int a', a variable with name 'a' of type integer. 'char a', a variable with name 'a' of type char So taken form above example(the closest i could think of), the way i understand them is: A 'string' can be of (unicode's or ascii's) type and that type needs a way (thats a charset) to store this string into the hdd as a sequense of bytes? Correct. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sun, Jun 9, 2013 at 4:01 AM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote: Hold on! In the beginning there was ASCII with 0-127 values and then there was Unicode with 0-127 of ASCII's + i dont know how much many more? Now ASCIII needs 1 byte to store a single character while Unicode needs 2 bytes to store a character and that is because it has 256 characters to store 2^8bits ? Is this correct? No. Let me start from the beginning. Computers don't work with characters, or strings, natively. They work with numbers. To be specific, they work with bits; and it's only by convention that we can work with anything larger. For instance, there's a VERY common convention around the PC world that a set of bits can be interpreted as a signed integer; if the highest bit is set, it's negative. There are also standards for floating-point (IEEE 754), and so on. ASCII is a character set. It defines a mapping of numbers to characters - for instance, @ is 64, SOH is 1, $ is 36, etcetera, etcetera. There are 128 such mappings. Since they all fit inside a 7-bit number, there's a trivial way to represent ASCII characters in a PC's 8-bit byte: you just leave the high bit clear and use the other seven. There have been various schemes for using the eighth bit - serial ports with parity, WordStar (I think) marking the ends of words, and most notably, Extended ASCII schemes that give you another whole set of 128 characters. And that was the beginning of Code Pages, because nobody could agree on what those extra 128 should be. Norwegians used Norwegian, the Greeks were taught their Greek, Arabians created themselves an Arabian codepage with the speed of summer lightning, and Hebrews allocated from 255 down to 128, which is absolutely frightening. But I digress. There were a variety of multi-byte schemes devised at various times, but we'll ignore all of them and jump straight to Unicode. With Unicode, there's (theoretically) no need to use any other system ever again, because whatever character you want, it'll exist in Unicode. In theory, of course; there are debates over that. Now, Unicode currently has defined an address space of roughly 20 bits, and in a throwback to the first programming I ever did, it's a segmented system: sixteen or seventeen planes of 65,536 characters each. (Fortunately the planes are identified by low numbers, not high numbers, and there's no stupidity of overlapping planes the way the 8086 did with memory!) The highest planes are special (plane 14 has a few special-purpose characters, planes 15 and 16 are for private use), and most of the middle ones have no characters assigned to them, so for the most part, you'll see characters from the first three planes. So what do we now have? A mapping of characters to code points, which are numbers. (I'm leaving aside the issues of combining characters and such for the moment.) But computers don't work with numbers, they work with bits. Somehow we have to store those bits in memory. There are a good few ways to do that; one is to note that every Unicode character can be represented inside 32 bits, so we can use the standard integer scheme safely. (Since they fit inside 31 bits, we don't even need to care if it's signed or unsigned.) That's called UTF-32 or UCS-4, and it's a great way to handle the full Unicode range in a manner that makes a Texan look agoraphobic. Wide builds of Python up to 3.2 did this. Or you can try to store them in 16-bit numbers, but then you have to worry about the ones that don't fit in 16 bits, because it's really hard to squeeze 20 bits of information into 16 bits of storage. UTF-16 is one way to do this; special numbers mean grab another number. It has its issues, but is (in my opinion, unfortunately) fairly prevalent. Narrow builds of Python up to 3.2 did this. Finally, you can use a more complicated scheme that uses anywhere from 1 to 4 bytes for each character, by carefully encoding information into the top bit - if it's set, you have a multi-byte character. That's how UTF-8 works, and is probably the most prevalent disk/network encoding. All of the UTF-X systems are called UCS Transformation Formats (UCS meaning Universal Character Set, roughly Unicode). They are mappings from Unicode numbers to bytes. Between Unicode and UTF-X, you have a mapping from character to byte sequence. Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters into the hard drive? The ISO standard 8859 specifies a number of ASCII-compatible encodings, referred to as ISO-8859-1 through ISO-8859-16. You've been working with ISO-8859-1, also called Latin-1, and ISO-8859-7, which has your Greek characters in it. These are all ways of translating characters into numbers; and since they all fit within 8 bits, they're most commonly represented on PCs with single bytes. So taken form above example(the closest i could think of), the way i understand them is: A 'string' can be of (unicode's or ascii's) type and that type needs a
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: ASCII actually needs 7 bits to store a character. Since computers are optimized to work with bytes, not bits, normally ASCII characters are stored in a single byte, with one bit wasted. So ASCII and Unicode are 2 Encoding Systems currently in use. How should i imagine them, visualize them? Like tables 'A' = 65, 'B' = 66 and so on? But if i do then that would be the visualization of a 'charset' not of an encoding system. What the diffrence of an encoding system and of a charset? ebcdic - ascii - unicode = al of them are encoding systems greek-iso - latin-iso - utf8 - utf16 = all of them are charsets. What are these differences? i cant imagine them all, i can only imagine charsets not encodign systems. Why python interprets by default all given strings as unicode and not ascii? because the former supports many positions while ascii only 127 positions , hence can interpet only 127 different characters? Narrow Unicode uses two bytes per character. Since two bytes is only enough for about 65,000 characters, not 1,000,000+, the rest of the characters are stored as pairs of two-byte surrogates. surrogates literal means a replacemnt? Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is also similar, also 256 positions, but the characters are different. And so on, with dozens of charsets. Latin has to display english chars(capital, small) + numbers + symbols. that would be 127 why 256? greek = all of the above plus greek chars, no? And then there is Unicode, which includes *every* character is all of those dozens of charsets. It has 1114111 positions (most are currently unfilled). Shouldt the positions that Unicode has to use equal to the summary of all available characters of all the languages of the worlds plus numbers and special chars? why 1.000.000+ why the need for so many positions? Narrow Unicode format (2 byted) can cover all ofmthe worlds symbols. An encoding is simply a program that takes a character and returns a byte, or visa versa. For instance, the ASCII encoding will take character 'A'. That is found at position 65, which is 0x41 in hexadecimal, so the ASCII encoding turns character 'A' into byte 0x41, and visa versa. Why you say ASCII turn a character into HEX format and not as in binary format? Isnt the latter the way bytes are stored into hdd, like 01010010101 etc? Are they stored as hex instead or you just said so to avoid printing 0s and 1s? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Sorry for displaying my code so many times, i know i ahve exhaust you but hti is the last thinkg i am gonna ask from you in this thread. We are very close to have this working. # # Collect directory and its filenames as bytes path = b'/home/nikos/public_html/data/apps/' files = os.listdir( path ) for filename in files: # Compute 'path/to/filename' filepath_bytes = path + filename for encoding in ('utf-8', 'iso-8859-7', 'latin-1'): try: filepath = filepath_bytes.decode( encoding ) except UnicodeDecodeError: continue # Rename to something valid in UTF-8 if encoding != 'utf-8': os.rename( filepath_bytes, filepath.encode('utf-8') ) assert os.path.exists( filepath ) break else: # This only runs if we never reached the break raise ValueError( 'unable to clean filename %r' % filepath_bytes ) # # Collect filenames of the path dir as strings filenames = os.listdir( '/home/nikos/public_html/data/apps/' ) # Load'em for filename in filenames: try: # Check the presence of a file against the database and insert if it doesn't exist cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) ) data = cur.fetchone() if not data: # First time for file; primary key is automatic, hit is defaulted print( iam here, filename + '\n' ) cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) ) except pymysql.ProgrammingError as e: print( repr(e) ) # # Collect filenames of the path dir as strings filenames = os.listdir( '/home/nikos/public_html/data/apps/' ) filepaths = () # Build a set of 'path/to/filename' based on the objects of path dir for filename in filenames: filepaths.add( filename ) # Delete spurious cur.execute('''SELECT url FROM files''') data = cur.fetchall() # Check database's filenames against path's filenames for rec in data: if rec not in filepaths: cur.execute('''DELETE FROM files WHERE url = %s''', rec ) = [Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] Original exception was: [Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] Traceback (most recent call last): [Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] File /home/nikos/public_html/cgi-bin/files.py, line 78, in module [Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] assert os.path.exists( filepath ) [Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] File /usr/local/lib/python3.3/genericpath.py, line 18, in exists [Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] os.stat(path) [Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-37: ordinal not in range(128) == Asserts what to make sure the the path/to/file afetr the rename exists but why are we still get those unicodeencodeerrors? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Sun, Jun 9, 2013 at 7:21 AM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote: Sorry for displaying my code so many times, i know i ahve exhaust you but hti is the last thinkg i am gonna ask from you in this thread. We are very close to have this working. You need to spend more time reading and less time frantically jumping around. Go read my post on Unicode; it answers several of the questions you posted in response to Steven's. And please, don't use this list as your substitute for source control. Don't keep posting your code. Most of us are ignoring it already. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: | ASCII actually needs 7 bits to store a character. Since computers are | optimized to work with bytes, not bits, normally ASCII characters are | stored in a single byte, with one bit wasted. | | So ASCII and Unicode are 2 Encoding Systems currently in use. | How should i imagine them, visualize them? | Like tables 'A' = 65, 'B' = 66 and so on? Yes, that works. | But if i do then that would be the visualization of a 'charset' not of an encoding system. | What the diffrence of an encoding system and of a charset? And encoding system is the method or transcribing these values to bytes and back again. | ebcdic - ascii - unicode = al of them are encoding systems | greek-iso - latin-iso - utf8 - utf16 = all of them are charsets. No. EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets. (1:1 mappings of characters to numbers/ordinals). And encoding is a way of writing these values to bytes. Decoding reads bytes and emits character values. Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the range 0-255, they are usually transcribed (encoded) directly, one byte per ordinal. Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one value. There are several ways of transcribing Unicode. UTF-8 is a popular and usually compact form, using one byte for values below 128 and and multiple bytes for higher values. | Why python interprets by default all given strings as unicode and | not ascii? because the former supports many positions while ascii | only 127 positions , hence can interpet only 127 different characters? Yes. [...] | Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is | also similar, also 256 positions, but the characters are different. And | so on, with dozens of charsets. | | Latin has to display english chars(capital, small) + numbers + symbols. that would be 127 why 256? ASCII runs up to 127. Essentially English, numerals, control codes and various symbols. The iso-8859-x sets run to 255, and the upper 128 values map to characters popular in various regions. | greek = all of the above plus greek chars, no? So iso-8859-7 included the Greek characters. | And then there is Unicode, which includes *every* character is all of | those dozens of charsets. It has 1114111 positions (most are currently | unfilled). | | Shouldt the positions that Unicode has to use equal to the summary | of all available characters of all the languages of the worlds plus | numbers and special chars? why 1.000.000+ why the need for so many | positions? Narrow Unicode format (2 byted) can cover all ofmthe | worlds symbols. 2 bytes is not enough. Chinese alone has more glyphs than that. | An encoding is simply a program that takes a character and returns a | byte, or visa versa. For instance, the ASCII encoding will take character | 'A'. That is found at position 65, which is 0x41 in hexadecimal, so the | ASCII encoding turns character 'A' into byte 0x41, and visa versa. | | Why you say ASCII turn a character into HEX format and not as in binary format? Steven didn't say that. He said position 65. People often write bytes in hex (eg 0x41) because a byte always fits in a 2-character hex (16 x 16) and because often these values have binary-based subranges, and hex makes that more obvious. For example, 'A' is 0x41. 'a' is 0x61. So you can look at the hex code and almost visually know if you're dealing with upper or lower case, etc. | Isnt the latter the way bytes are stored into hdd, like 01010010101 etc? | Are they stored as hex instead or you just said so to avoid printing 0s and 1s? They're stored as bits at the gate level. But writing hex codes _in_ _text_ is more compact, and more readable for humans. Cheers, -- Cameron Simpson c...@zip.com.au A lot of people don't know the difference between a violin and a viola, so I'll tell you. A viola burns longer. - Victor Borge -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 9/6/2013 1:32 πμ, Cameron Simpson wrote: On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: | ASCII actually needs 7 bits to store a character. Since computers are | optimized to work with bytes, not bits, normally ASCII characters are | stored in a single byte, with one bit wasted. | | So ASCII and Unicode are 2 Encoding Systems currently in use. | How should i imagine them, visualize them? | Like tables 'A' = 65, 'B' = 66 and so on? Yes, that works. | But if i do then that would be the visualization of a 'charset' not of an encoding system. | What the diffrence of an encoding system and of a charset? And encoding system is the method or transcribing these values to bytes and back again. So we have: ( 'A' mapped to the value of '65' ) = encoding process(i.e. uf-8) = bytes bytes = decoding process(i.e. utf-8) = ( '65' mapped to character 'A' ) Why does every character in a character set needs to be associated with a numeric value? I mean couldn't we just have characters sets that wouldn't have numeric associations like: 'A' = encoding process(i.e. uf-8) = bytes bytes = decoding process(i.e. utf-8) = character 'A' EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets. (1:1 mappings of characters to numbers/ordinals). And encoding is a way of writing these values to bytes. Decoding reads bytes and emits character values. Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the range 0-255, they are usually transcribed (encoded) directly, one byte per ordinal. Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one value. There are several ways of transcribing Unicode. UTF-8 is a popular and usually compact form, using one byte for values below 128 and and multiple bytes for higher values. An ordinal = ordered numbers like 7,8,910 and so on? Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for values up to 256? UTF-8 and UTF-16 and UTF-32 I though the number beside of UTF- was to declare how many bits the character set was using to store a character into the hdd, no? Narrow Unicode uses two bytes per character. Since two bytes is only enough for about 65,000 characters, not 1,000,000+, the rest of the characters are stored as pairs of two-byte surrogates. Can you please explain this line the rest of thecharacters are stored as pairs of two-byte surrogates more easily for me to understand it? I'm still having troubl understanding what a surrogate is. Again, thank you very much for explaining the encodings to me, they were giving me trouble for years in all of my scripts. And one last thing. When locale to linux system is set to utf-8 that would mean that the linux applications, should try to encode string into hdd by using system's default encoding to utf-8 nad read them back from bytes by also using utf-8. Is that correct? -- Webhost http://superhost.gr Weblog http://psariastonafro.wordpress.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Σάββατο, 8 Ιουνίου 2013 9:47:53 μ.μ. UTC+3, ο χρήστης Chris Angelico έγραψε: Fortunately, Python lets us hide away pretty much all those details, just as it lets us hide away the details of what makes up a list, a dictionary, or an integer. You can safely assume that the string foo is a string of three characters, which you can work with as characters. The chr() and ord() functions let you switch between characters and numbers, and str.encode() and bytes.decode() let you switch between characters and byte sequences. Once you get your head around the differences between those three, it all works fairly neatly. I'm trying too! So, chr('A') would give me the mapping of this char, the number 65 while ord(65) would output the char 'A' likewise. and str.encode() and bytes.decode() let you switch between characters and byte sequences. Once What would happen if we we try to re-encode bytes on the disk? like trying: s = νίκος utf8_bytes = s.encode('utf-8') greek_bytes = utf_bytes.encode('iso-8869-7') Can we re-encode twice or as many times we want and then decode back respectively lke? utf8_bytes = greek_bytes.decode('iso-8859-7') s = utf8_bytes.decoce('utf-8') Is somethign like that totally crazy? And also is there a deiffrence between encoding and compressing ? Isnt the latter useing some form of encoding to take a string or bytes to make hold less space on disk? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Παρασκευή, 7 Ιουνίου 2013 4:25:40 π.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: MRAB tells you to work with the bytes, because the filenames' bytes are invalid decoded as UTF-8. If you fix the file names by renaming using a terminal set to UTF-8, then they will be valid and you can forget about working with bytes. Yes, but but 'putty' seems to always forget when i tell it to use utf8 for displaying and always picks up the Win8's default charset and it doesnt have a save options dialog. I cant always remember to switch to utf8 charset or renaming all the time from termnal so many greek filenames. Working with bytes is only for when the file names are turned to garbage. Your file names (some of them) are turned to garbage. Fix them, and then use file names as strings. Can't '~/data/apps/' is filled every day with more and more files which are uploaded via FileZilla client, which i think it behaves pretty much like putty, uploading filenames as greek-iso bytes. So that garbage will happen every day due to 'Putty' 'FileZilla' clients. So files.py before doing their stuff must do the automatic conversions from greek bytes to utf-8 bytes. Here is what i have up until now. = # Collect filenames of the path dir as bytes filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' ) # Iterate over all filenames in the path dir for filename in filenames_bytes: # Compute 'path/to/filename' in bytes filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename' try: filepath = filepath_bytes.decode('utf-8') except UnicodeDecodeError: try: filepath = filepath_bytes.decode('iso-8859-7') # Rename filename from greek bytes = utf-8 bytes os.rename( filepath_bytes filepath.encode('utf-8') ) except UnicodeDecodeError: print I give up! This filename is unreadable! = This is the best i can come up with, but after: ni...@superhost.gr [~/www/cgi-bin]# python files.py File files.py, line 75 os.rename( filepath_bytes filepath.encode('utf-8') ) ^ SyntaxError: invalid syntax ni...@superhost.gr [~/www/cgi-bin]# I am seeign the caret pointing at filepath but i cant follow what it tries to tell me. No parenthesis missed or added this time due to speed and tireness. This rename statement tries to convert the greek byted filepath to utf-8 byted filepath. I can't see whay this is wrong though. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Fri, Jun 7, 2013 at 4:35 PM, nagia.rets...@gmail.com wrote: Yes, but but 'putty' seems to always forget when i tell it to use utf8 for displaying and always picks up the Win8's default charset and it doesnt have a save options dialog. I cant always remember to switch to utf8 charset or renaming all the time from termnal so many greek filenames. I use PuTTY too (though that'll change when I next upgrade Traal, as I'll no longer have any Windows clients), and it's set to UTF-8 in the Winoow|Translation page. Far as I know, those settings are all saved into the Saved Sessions settings, back on the Session page. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 7/6/2013 4:01 πμ, Cameron Simpson wrote: On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: | py s = '999-Eυχή-του-Ιησού' | py bytes_as_utf8 = s.encode('utf-8') | py t = bytes_as_utf8.decode('iso-8859-7', errors='replace') | py print(t) | 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ | | errors='replace' mean dont break in case or error? Yes. The result will be correct for correct iso-8859-7 and slightly mangled for something that would not decode smoothly. How can it be correct? We have encoded out string in utf-8 and then we tried to decode it as greek-iso? How can this possibly be correct? | You took the unicode 's' string you utf-8 bytestringed it. | Then how its possible to ask for the utf8-bytestring to decode | back to unicode string with the use of a different charset that the | one used for encoding and thsi actually printed the filename in | greek-iso? It is easily possible, as shown above. Does it make sense? Normally not, but Steven is demonstrating how your mv exercises have behaved: a rename using utf-8, then a _display_ using iso-8859-7. Same as above, i don't understand it at all, since different charsets(encodings) used in the encode/decode process. | | a) WHAT does it mean when a linux system is set to use utf-8? It means the locale settings _for the current process_ are set for UTF-8. The locale command will show you the current state. That means that, when a linux application needs to saved a filename to the linux filesystem, the app checks the filesytem's 'locale', so to encode the filename using the utf-8 charset ? And likewise when a linux application wants to decode a filename is also checking the filesystem's 'locale' setting so to know what charset must use to decode the filename correctly back to the original string? So locale is used for filesystem itself and linux apps to know how to read(decode) and write(enode) filenames from/into the system's hdd? | c) WHAT happens when the two of them try to work together? If everything matches, it is all good. If the locales do not match, the mismatch will result in an undesired bytes-characters encode/decode step somewhere, and something will display incorrectly or be entered as input incorrectly. Cant quite grasp the idea: local end: Win8, locale = greek-iso remote end: CentOS 6.4, locale = utf-8 FileZilla by default uses do not know what charset to upload filenames Putty by default uses greek-iso to display filenames WHAT someone can expect to happen when all of the above work together? Mess of course, but i want to hear in detail each step of the mess as it emerges. -- Webhost http://superhost.gr Weblog http://psariastonafro.wordpress.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
nagia.rets...@gmail.com writes: File files.py, line 75 os.rename( filepath_bytes filepath.encode('utf-8') ) ^ SyntaxError: invalid syntax I am seeign the caret pointing at filepath but i cant follow what it tries to tell me. As already explained, often a SyntaxError is introduced by *preceeding* text, so you must look at your code with a wider eye. This rename statement tries to convert the greek byted filepath to utf-8 byted filepath. Yes: and that usually imply that the *function* accepts (at least) *two* arguments, specifically the source and the target names, right? How many arguments are you actually giving to the os.rename() function above? I can't see whay this is wrong though. Try stronger, I won't be give you further indications to your SyntaxErrors, you *must* learn how to detect and fix those by yourself. ciao, lele. -- nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia. l...@metapensiero.it | -- Fortunato Depero, 1929. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Παρασκευή, 7 Ιουνίου 2013 9:46:53 π.μ. UTC+3, ο χρήστης Chris Angelico έγραψε: On Fri, Jun 7, 2013 at 4:35 PM, nagia.rets...@gmail.com wrote: Yes, but but 'putty' seems to always forget when i tell it to use utf8 for displaying and always picks up the Win8's default charset and it doesnt have a save options dialog. I cant always remember to switch to utf8 charset or renaming all the time from termnal so many greek filenames. I use PuTTY too (though that'll change when I next upgrade Traal, as I'll no longer have any Windows clients), and it's set to UTF-8 in the Winoow|Translation page. Far as I know, those settings are all saved into the Saved Sessions settings, back on the Session page. ChrisA Session settings afaik is for putty to remember hosts to connect to, not terminal options. I might be worng though. No matter how many times i change its options next time i run it always defaults back. I'll google Traal right now. You should also take o look on 'Secure Shell' extension for Chrome i just found out. Seems a great plugin for Chrome. You'll definately like it, i did! -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Fri, Jun 7, 2013 at 5:08 PM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote: I'll google Traal right now. The one thing you're actually willing to go research, and it's actually something that won't help you. Traal is the name of my personal laptop. Spend your Googletrons on something else. :) ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Παρασκευή, 7 Ιουνίου 2013 10:09:29 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε: As already explained, often a SyntaxError is introduced by *preceeding* text, so you must look at your code with a wider eye. That what i ahte aabout error reporting. You have some syntax error someplace and error reports you another line, so you have to check the whole code again. Well i just did, i see no syntactical errors. Yes: and that usually imply that the *function* accepts (at least) *two* arguments, specifically the source and the target names, right? How many arguments are you actually giving to the os.rename() function above? i'm giving it two. os.rename( filepath_bytes filepath.encode('utf-8') ) 1st = filepath_bytes 2nd = filepath.encode('utf-8') Source and Target respectively. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Jun 7, 2013, at 8:32, Νικόλαος Κούρας nikos.gr...@gmail.com wrote: Τη Παρασκευή, 7 Ιουνίου 2013 10:09:29 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε: As already explained, often a SyntaxError is introduced by *preceeding* text, so you must look at your code with a wider eye. That what i ahte aabout error reporting. You have some syntax error someplace and error reports you another line, so you have to check the whole code again. Well i just did, i see no syntactical errors. Yes: and that usually imply that the *function* accepts (at least) *two* arguments, specifically the source and the target names, right? How many arguments are you actually giving to the os.rename() function above? i'm giving it two. os.rename( filepath_bytes filepath.encode('utf-8') Missing comma, which is, after all, just a matter of syntax so it can't matter, right? 1st = filepath_bytes 2nd = filepath.encode('utf-8') Source and Target respectively. -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 7/6/2013 10:42 πμ, Michael Weylandt wrote: os.rename( filepath_bytes filepath.encode('utf-8') Missing comma, which is, after all, just a matter of syntax so it can't matter, right? I doubted that os.rename arguments must be comma seperated. But ater reading the docs. s.rename(/src/,/dst/)http://docs.python.org/2/library/os.html#os.rename Rename the file or directory/src/to/dst/. If/dst/is a directory,OSError http://docs.python.org/2/library/exceptions.html#exceptions.OSErrorwill be raised. On Unix, if/dst/exists and is a file, it will be replaced silently if the user has permission. The operation may fail on some Unix flavors if/src/and/dst/are on different filesystems. If successful, the renaming will be an atomic operation (this is a POSIX requirement). On Windows, if/dst/already exists,OSError http://docs.python.org/2/library/exceptions.html#exceptions.OSErrorwill be raised even if it is a file; there may be no way to implement an atomic rename when/dst/names an existing file. Availability: Unix, Windows. Indeed it has to be: os.rename( filepath_bytes, filepath.encode('utf-8') 'mv source target' didn't require commas so i though it was safe to assume that os.rename did not either. I'am happy to announce that after correcting many idiotic error like commas, missing colons and declaring of variables, this surrogate erro si the last i get. I still dont understand what surrogate means. In english means replacement. Here is the code: # # Collect filenames of the path dir as bytes filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' ) # Iterate over all filenames in the path dir for filename in filename_bytes: # Compute 'path/to/filename' in bytes filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename' try: filepath = filepath_bytes.decode('utf-8') except UnicodeDecodeError: try: filepath = filepath_bytes.decode('iso-8859-7') # Rename current filename from greek bytes = utf-8 bytes os.rename( filepath_bytes, filepath.encode('utf-8') ) except UnicodeDecodeError: print( '''I give up! This filename is unreadable! ''') # # Get filenames of the apps directory as unicode filenames = os.listdir( '/home/nikos/public_html/data/apps/' ) # Load'em for filename in filenames: try: # Check the presence of a file against the database and insert if it doesn't exist cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) ) data = cur.fetchone()#filename is unique, so should only be one if not data: # First time for file; primary key is automatic, hit is defaulted cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) ) except pymysql.ProgrammingError as e: print( repr(e) ) # filenames = os.listdir( '/home/nikos/public_html/data/apps/' ) filenames = () # Build a set of 'path/to/filename' based on the objects of path dir for filename in filenames: filenames.add( filename ) # Delete spurious cur.execute('''SELECT url FROM files''') data = cur.fetchall() # Check database's filenames against path's filenames for filename in data: if filename not in filenames: cur.execute('''DELETE FROM files WHERE url = %s''', (filename,) ) = [Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] File /home/nikos/public_html/cgi-bin/files.py, line 88, in module [Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] cur.execute('''SELECT url FROM files WHERE url = %s''', filename ) [Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] File /usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py, line 108, in execute [Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] query = query.encode(charset) [Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'utf-8' codec can't encode character '\\udcce' in position 35: surrogates not allowed -- Webhost http://superhost.gr Weblog http://psariastonafro.wordpress.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Fri, Jun 7, 2013 at 9:10 AM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote: On 7/6/2013 10:42 πμ, Michael Weylandt wrote: os.rename( filepath_bytes filepath.encode('utf-8') Missing comma, which is, after all, just a matter of syntax so it can't matter, right? I doubted that os.rename arguments must be comma seperated. All function calls in Python require commas if you are putting in more than one argument. [0] But ater reading the docs. s.rename(src, dst) Rename the file or directory src to dst. If dst is a directory, OSError will be raised. On Unix, if dst exists and is a file, it will be replaced silently if the user has permission. The operation may fail on some Unix flavors if src and dst are on different filesystems. If successful, the renaming will be an atomic operation (this is a POSIX requirement). On Windows, if dst already exists, OSError will be raised even if it is a file; there may be no way to implement an atomic rename when dst names an existing file. Availability: Unix, Windows. Indeed it has to be: os.rename( filepath_bytes, filepath.encode('utf-8') Parenthesis missing here as well. 'mv source target' didn't require commas so i though it was safe to assume that os.rename did not either. That's for shell programming -- different language entirely. The surrogate business is back to Unicode, which ain't my specialty so I'll leave that to more able programmers. MW [0] You could pass multiple arguments by way of a tuple or dictionary using */** but if you want arguments that aren't in the container being passed, you're back to needing commas. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 07Jun2013 11:10, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | On 7/6/2013 10:42 πμ, Michael Weylandt wrote: | os.rename( filepath_bytes filepath.encode('utf-8') | Missing comma, which is, after all, just a matter of syntax so it can't matter, right? | | I doubted that os.rename arguments must be comma seperated. Why? Every other python function separates arguments with commas. | 'mv source target' didn't require commas so i though it was safe to assume that os.rename did not either. mv is shell syntax. os.rename is Python syntax. Two totally separate languages. -- Cameron Simpson c...@zip.com.au Cynic, n. A blackguard whose faulty vision sees things as they are, not as they ought to be. Ambrose Bierce (1842-1914), U.S. author. The Devil's Dictionary (1881-1906). -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 07Jun2013 09:56, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | On 7/6/2013 4:01 πμ, Cameron Simpson wrote: | On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | | Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: | | py s = '999-Eυχή-του-Ιησού' | | py bytes_as_utf8 = s.encode('utf-8') | | py t = bytes_as_utf8.decode('iso-8859-7', errors='replace') | | py print(t) | | 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ | | | | errors='replace' mean dont break in case or error? | | Yes. The result will be correct for correct iso-8859-7 and slightly mangled | for something that would not decode smoothly. | | How can it be correct? We have encoded out string in utf-8 and then | we tried to decode it as greek-iso? How can this possibly be | correct? Ok, not correct. But consistent. Safe to call. If it is a valid iso-8859-7 sequence (which might cover everything, since I expect it is an 8-bit 1:1 mapping from bytes values to a set of codepoints, just like iso-8859-1) then it may decode to the wrong characters, but the reverse process (characters encoded as bytes) should produce the original bytes. With a mapping like this, errors='replace' may mean nothing; there will be no errors because the only Unicode characters in play are all from iso-8859-7 to start with. Of course another string may not be safe. | | You took the unicode 's' string you utf-8 bytestringed it. | | Then how its possible to ask for the utf8-bytestring to decode | | back to unicode string with the use of a different charset that the | | one used for encoding and thsi actually printed the filename in | | greek-iso? | | It is easily possible, as shown above. Does it make sense? Normally | not, but Steven is demonstrating how your mv exercises have | behaved: a rename using utf-8, then a _display_ using iso-8859-7. | | Same as above, i don't understand it at all, since different | charsets(encodings) used in the encode/decode process. Visually, the names will be garbage. And if you go: mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3' while using the iso-8859-7 locale, the wrong thing will occur (assuming it even works, though I think it should because all these characters are represented in iso-8859-7, yes?) Why? In the iso-8859-7 locale, your (currently named under an utf-8 regime) file looks like '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' (because the Unicode byte sequence maps to those characters in iso-8859-7). Why you issue the about mv command, the new name will be the _iso-8859-7_ bytes encoding for '999-Eυχή-του-Ιησού.mp3'. Which, under an utf-8 regime will decode to _other_ characters. If you want to repair filenames, by which I mean, cause them to be correctly encoded for utf-8, you are best to work in utf-8 (using mv or python). Of course, the badly named files will then look wrong in your listing. If you _know_ the filenames were written using iso-8859-7 encoding, and that the names are right under that encoding, you can write python code to rename them to utf-8. Totally untested example code: import sys from binascii import hexlify for bytename in os.listdir( b'.' ): unicode_name = bytename.decode('iso-8859-7') new_bytename = unicode_name.encode('utf-8') print(%s: %s = %s % (unicode_name, hexlify(bytename), hexlify(new_bytename)), file=sys.stderr) os.rename(bytename, new_bytename) That code should not care what locale you are using because it uses bytes for the file calls and is explicit about the encoding/decoding steps. | | a) WHAT does it mean when a linux system is set to use utf-8? | | It means the locale settings _for the current process_ are set for | UTF-8. The locale command will show you the current state. | | That means that, when a linux application needs to saved a filename | to the linux filesystem, the app checks the filesytem's 'locale', so | to encode the filename using the utf-8 charset ? At the command line, many will not. They'll just read and write bytes. Some will decode/encode. Those that do, should by default use the current locale. But broadly, it is GUI apps that care about this because they must translate byte sequences to glyphs: images of characters. So plenty of command line tools do not need to care; the terminal application is the one that presents the names to you; _it_ will decode them for display. And it is the terminal app that translates your keystrokes into bytes to feed to the command line. NOTE: it is NOT the filesystem's locale. It is the current process' locale, which is deduced from environment variables (which have defaults if they are not set). Under Windows I believe filesystems have locales; this can prevent you storing some files on some filesystems on Windows, because the filesystem doesn't cope. UNIX just takes bytes. | And likewise when a linux application wants to decode a filename is | also checking the filesystem's 'locale' setting so to know what |
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Jun 7, 6:53 pm, Cameron Simpson c...@zip.com.au wrote: Experiment: LC_ALL=C ls -b LC_ALL=utf-8 ls -b LC_ALL=iso-8859-7 ls -b And the Terminal itself is decoding the output for display, and encoding your input keystrokes to feed as input to the command line. This reminded me of something I saw on stackoverflow recently: http://stackoverflow.com/questions/11735363/python3-unicodeencodeerror-only-when-run-from-crontab Script would run from shell but not from crontab, as the crontab environment had different locale settings. Solution was to prepend the correct LC_CTYPE to the command in the crontab. Would it be similar for httpd processes? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: On 07Jun2013 09:56, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | On 7/6/2013 4:01 πμ, Cameron Simpson wrote: | On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | | Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: | | py s = '999-Eυχή-του-Ιησού' | | py bytes_as_utf8 = s.encode('utf-8') | | py t = bytes_as_utf8.decode('iso-8859-7', errors='replace') | | py print(t) | | 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ | | | | errors='replace' mean dont break in case or error? | | Yes. The result will be correct for correct iso-8859-7 and slightly mangled | for something that would not decode smoothly. | | How can it be correct? We have encoded out string in utf-8 and then | we tried to decode it as greek-iso? How can this possibly be | correct? If it is a valid iso-8859-7 sequence (which might cover everything, since I expect it is an 8-bit 1:1 mapping from bytes values to a set of codepoints, just like iso-8859-1) then it may decode to the wrong characters, but the reverse process (characters encoded as bytes) should produce the original bytes. With a mapping like this, errors='replace' may mean nothing; there will be no errors because the only Unicode characters in play are all from iso-8859-7 to start with. Of course another string may not be safe. Visually, the names will be garbage. And if you go: mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3' while using the iso-8859-7 locale, the wrong thing will occur (assuming it even works, though I think it should because all these characters are represented in iso-8859-7, yes?) All the rest you i understood only the above quotes its still unclear to me. I cant see to understand it. Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar? For example char 'a' has the value of '65' for all of those character sets? Is hat what you mean? s = 'a' (This is unicode right? Why when we assign a string to a variable that string's type is always unicode and does not automatically become utf-8 which includes all available world-wide characters? Unicode is something different that a character set? ) utf8_byte = s.encode('utf-8') Now if we are to decode this back to utf8 we will receive the char 'a'. I beleive same thing will happen with latin, greek, ascii isos. Correct? utf8_a = utf8_byte.decode('iso-8859-7') latin_a = utf8_byte.decode('iso-8859-1') ascii_a = utf8_byte.decode('ascii') utf8_a = utf8_byte.decode('iso-8859-7') Is this correct? All of those decodes will work even if the encoded bytestring was of utf8 type? The characters that will not decode correctly are those that their codepoints are greater that 127 ? for example if s = 'α' (greek character equivalent to english 'a') Is this what you mean? Now back to my almost ready files.py script please: # # Collect filenames of the path dir as bytes greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' ) for filename in greek_filenames: # Compute 'path/to/filename' in bytes greek_path = b'/home/nikos/public_html/data/apps/' + b'filename' try: filepath = greek_path.decode('iso-8859-7') # Rename current filename from greek bytes -- utf-8 bytes os.rename( greek_path, filepath.encode('utf-8') ) except UnicodeDecodeError: # Since its not a greek bytestring then its a proper utf8 bytestring filepath = greek_path.decode('utf-8') # filenames = os.listdir( '/home/nikos/public_html/data/apps/' ) # Load'em for filename in filenames: try: # Check the presence of a file against the database and insert if it doesn't exist cur.execute('''SELECT url FROM files WHERE url = %s''', filename ) data = cur.fetchone() if not data: # First time for file; primary key is automatic, hit is defaulted cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) ) except pymysql.ProgrammingError as e: print( repr(e) ) # filenames = os.listdir( '/home/nikos/public_html/data/apps/' ) filepaths = () # Build a set of 'path/to/filename' based on the objects of path dir for filename in filenames: filepaths.add( filename ) # Delete spurious cur.execute('''SELECT url FROM files''') data = cur.fetchall() # Check database's filenames against path's filenames for rec in data: if
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 07/06/2013 12:53, Νικόλαος Κούρας wrote: [snip] # # Collect filenames of the path dir as bytes greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' ) for filename in greek_filenames: # Compute 'path/to/filename' in bytes greek_path = b'/home/nikos/public_html/data/apps/' + b'filename' try: This is a worse way of doing it because the ISO-8859-7 encoding has 1 byte per codepoint, meaning that it's more 'tolerant' (if that's the word) of errors. A sequence of bytes that is actually UTF-8 can be decoded as ISO-8859-7, giving gibberish. UTF-8 is less tolerant, and it's the encoding that ideally you should be using everywhere, so it's better to assume UTF-8 and, if it fails, try ISO-8859-7 and then rename so that any names that were ISO-8859-7 will be converted to UTF-8. That's the reason I did it that way in the code I posted, but, yet again, you've changed it without understanding why! filepath = greek_path.decode('iso-8859-7') # Rename current filename from greek bytes -- utf-8 bytes os.rename( greek_path, filepath.encode('utf-8') ) except UnicodeDecodeError: # Since its not a greek bytestring then its a proper utf8 bytestring filepath = greek_path.decode('utf-8') [snip] -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Fri, 07 Jun 2013 04:53:42 -0700, Νικόλαος Κούρας wrote: Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar? You can answer this yourself. Open a terminal window and start a Python interactive session. Then try it and see what happens: s = ''.join(chr(i) for i in range(128)) bytes_as_utf8 = s.encode('utf-8') bytes_as_latin1 = s.encode('latin-1') bytes_as_greek_iso = s.encode('ISO-8859-7') bytes_as_ascii = s.encode('ascii') bytes_as_utf8 == bytes_as_latin1 == bytes_as_greek_iso == bytes_as_ascii What result do you get? True or False? And now you know the answer, without having to ask. For example char 'a' has the value of '65' for all of those character sets? Is hat what you mean? You can answer that question yourself. c = 'a' for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'): print(c.encode(encoding)) By the way, I believe that Python has made a strategic mistake in the way that bytes are printed. I think it leads to more confusion, not less. Better would be something like this: c = 'a' for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'): print(hex(c.encode(encoding)[0])) For historical reasons, most (but not all) charsets are supersets of ASCII. That is, the first 128 characters in the charset are the same as the 128 characters in ASCII. s = 'a' (This is unicode right? Why when we assign a string to a variable that string's type is always unicode Strings in Python 3 are Unicode strings. That's just the way Python works. Unicode was chosen because Unicode includes over a million different characters (well, potentially over a million, most of them are currently unused), and is a strict superset of *all* common legacy codepages from the old DOS and Windows 95 days. and does not automatically become utf-8 which includes all available world-wide characters? Unicode is something different that a character set? ) Unicode is a character set. It is an enormous set of over one million characters (technically code point, but don't worry about the difference right now) which can be collected in strings. UTF-8 is an encoding that goes from a string using the Unicode character set into bytes, and back again. Sometimes, people are lazy and say UTF-8 when they mean Unicode, or visa versa. UTF-16 and UTF-32 are two different encodings for the same purpose, but for various technical reasons UTF-8 is better for files. 'λ' is a character which exists in some charsets but not others. It is not in the ASCII charset, nor is it in Latin-1, nor Big-5. It is in the ISO-8859-7 charset, and of course it is in Unicode. In ISO-8859-7, the character 'λ' is stored as byte 0xEB (decimal 235), just as the character 'a' is stored as byte 0x61 (decimal 97). In UTF-8, the character λ is stored as two bytes 0xCE 0xBB. In UTF-16 (big-endian), the character λ is stored as two bytes 0x03 0xBB. In UTF-32 (big-endian), the character λ is stored as four bytes 0x00 0x00 0x03 0xBB. That's four different ways of spelling the same character as bytes, just as three, trois, drei, τρία, três are all different ways of spelling the same number 3. utf8_byte = s.encode('utf-8') Now if we are to decode this back to utf8 we will receive the char 'a'. I beleive same thing will happen with latin, greek, ascii isos. Correct? Why don't you try it for yourself and see? The characters that will not decode correctly are those that their codepoints are greater that 127 ? Maybe, maybe not. It depends on which codepoint, and which encodings. Some encodings use the same bytes for the same characters. Some encodings use different bytes. It all depends on the encoding, just like American and English both spell 3 three, while French spells it trois. for example if s = 'α' (greek character equivalent to english 'a') In Latin-1, 'α' does not exist: py 'α'.encode('latin-1') Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'latin-1' codec can't encode character '\u03b1' in position 0: ordinal not in range(256) In the old Windows Greek charset, ISO-8859-7, 'α' is stored as byte 0xE1: py 'α'.encode('ISO-8859-7') b'\xe1' But in the old Windows *Russian* charset, ISO-8859-5, the byte 0xE1 means a completely different character, CYRILLIC SMALL LETTER ES: py b'\xE1'.decode('ISO-8859-5') 'с' (don't be fooled that this looks like the English c, it is not the same). In Unicode, 'α' is always codepoint 0x3B1 (decimal 945): py ord('α') 945 but before you can store that on a disk, or as a file name, it needs to be converted to bytes, and which bytes you get depends on which encoding you use: py 'α'.encode('utf-8') b'\xce\xb1' py 'α'.encode('utf-16be') b'\x03\xb1' py 'α'.encode('utf-32be') b'\x00\x00\x03\xb1' -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Παρασκευή, 7 Ιουνίου 2013 5:29:25 μ.μ. UTC+3, ο χρήστης MRAB έγραψε: This is a worse way of doing it because the ISO-8859-7 encoding has 1 byte per codepoint, meaning that it's more 'tolerant' (if that's the word) of errors. A sequence of bytes that is actually UTF-8 can be decoded as ISO-8859-7, giving gibberish. UTF-8 is less tolerant, and it's the encoding that ideally you should be using everywhere, so it's better to assume UTF-8 and, if it fails, try ISO-8859-7 and then rename so that any names that were ISO-8859-7 will be converted to UTF-8. Indeed iw asnt aware of that, at that time, i was under the impression that if a string was encoded to bytes using soem charset can only be switched back with the use of that and only that charset. Since this is the case here is my fixning: # # Collect filenames of the path dir as bytes filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' ) for filename in filename_bytes: # Compute 'path/to/filename' into bytes filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename' flag = False try: # Assume current file is utf8 encoded filepath = filepath_bytes.decode('utf-8') flag = 'utf8' except UnicodeDecodeError: try: # Since current filename is not utf8 encoded then it has to be greek-iso encoded filepath = filepath_bytes.decode('iso-8859-7') flag = 'greek' except UnicodeDecodeError: print( '''I give up! File name is unreadable!''' ) if( flag = 'greek' ) # Rename filename from greek bytes -- utf-8 bytes os.rename( filepath_bytes, filepath.encode('utf-8') ) # filenames = os.listdir( '/home/nikos/public_html/data/apps/' ) # Load'em for filename in filenames: try: # Check the presence of a file against the database and insert if it doesn't exist cur.execute('''SELECT url FROM files WHERE url = %s''', filename ) data = cur.fetchone() if not data: # First time for file; primary key is automatic, hit is defaulted cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) ) except pymysql.ProgrammingError as e: print( repr(e) ) # filenames = os.listdir( '/home/nikos/public_html/data/apps/' ) filepaths = () # Build a set of 'path/to/filename' based on the objects of path dir for filename in filenames: filepaths.add( filename ) # Delete spurious cur.execute('''SELECT url FROM files''') data = cur.fetchall() # Check database's filenames against path's filenames for rec in data: if rec not in filepaths: cur.execute('''DELETE FROM files WHERE url = %s''', rec ) = ni...@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] File /home/nikos/public_html/cgi-bin/files.py, line 81 [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 'greek' ) [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] ^ [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of script headers: files.py --- i dont know why that if statement errors. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
: On 7 June 2013 14:52, Νικόλαος Κούρας nikos.gr...@gmail.com wrote: File /home/nikos/public_html/cgi-bin/files.py, line 81 [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 'greek' ) [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] ^ [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of script headers: files.py --- i dont know why that if statement errors. Oh for f... READ SOME DOCUMENTATION, FOR THE LOVE OF BOB!!! READ YOUR OWN EFFING CODE! Look at this: http://docs.python.org/2/tutorial/controlflow.html Read it now? Of course not. Go away and read it. Now have you read it? GO AND READ IT. What does an if statement end with? Hint: yep, that's it. -[]z. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 07/06/2013 20:31, Zero Piraeus wrote: : On 7 June 2013 14:52, Νικόλαος Κούρας nikos.gr...@gmail.com wrote: File /home/nikos/public_html/cgi-bin/files.py, line 81 [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 'greek' ) [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] ^ [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of script headers: files.py --- i dont know why that if statement errors. Oh for f... READ SOME DOCUMENTATION, FOR THE LOVE OF BOB!!! READ YOUR OWN EFFING CODE! Look at this: http://docs.python.org/2/tutorial/controlflow.html Read it now? Of course not. Go away and read it. Now have you read it? GO AND READ IT. What does an if statement end with? Hint: yep, that's it. Have you noticed how the line in the traceback doesn't match the line in the post? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
: On 7 June 2013 16:45, MRAB pyt...@mrabarnett.plus.com wrote: On 07/06/2013 20:31, Zero Piraeus wrote: [something exasperated, in capitals] Have you noticed how the line in the traceback doesn't match the line in the post? Actually, I hadn't. It's not exactly a surprise at this point, though ... I learnt a new word today, while searching for an apt ending to the sentence Reading Nikos' posts is the internet equivalent of ... ... and that word is Dermatillomania. -[]z. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 07Jun2013 11:52, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | ni...@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] File /home/nikos/public_html/cgi-bin/files.py, line 81 | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 'greek' ) | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] ^ | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid syntax | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of script headers: files.py | --- | i dont know why that if statement errors. Python statements that continue (if, while, try etc) end in a colon, so: if flag == 'greek': Cheers, -- Cameron Simpson c...@zip.com.au Hello, my name is Yog-Sothoth, and I'll be your eldritch horror today. - Heather Keith hkei...@andrew.cmu.edu -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: | | | errors='replace' mean dont break in case or error? | | | Yes. The result will be correct for correct iso-8859-7 and slightly mangled | | for something that would not decode smoothly. | | | How can it be correct? We have encoded out string in utf-8 and then | | we tried to decode it as greek-iso? How can this possibly be | | correct? | | If it is a valid iso-8859-7 sequence (which might cover everything, | since I expect it is an 8-bit 1:1 mapping from bytes values to a | set of codepoints, just like iso-8859-1) then it may decode to the | wrong characters, but the reverse process (characters encoded as | bytes) should produce the original bytes. With a mapping like this, | errors='replace' may mean nothing; there will be no errors because | the only Unicode characters in play are all from iso-8859-7 to start | with. Of course another string may not be safe. | | Visually, the names will be garbage. And if you go: |mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3' | while using the iso-8859-7 locale, the wrong thing will occur | (assuming it even works, though I think it should because all these | characters are represented in iso-8859-7, yes?) | | All the rest you i understood only the above quotes its still unclear to me. | I cant see to understand it. | | Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 codepoints similar? Yes. It is certainly true for utf-8 and latin-iso and ASCII. I expect it to be so for greek-iso, but have not checked. They're all essentially the ASCII set plus a range of other character codepoints for the upper values. The 8-bit sets iso-8859-1 (which I take you to mean by latin-iso) and iso-8859-7 (which I take you to mean by greek-iso) are single byte mapping with the top half mapped to characters commonly used in a particular region. Unicode has a much greater range, but the UTF-8 encoding of Unicode deliberately has the bottom 0-127 identical to ASCII, and higher values represented by multibyte sequences commences with at least the first byte in the 128-255 range. In this way pure ASCII files are already in UTF-8 (and, in fact, work just fine for the iso-8859-x encodings as well). | For example char 'a' has the value of '65' for all of those character sets? | Is hat what you mean? Yes. | s = 'a' (This is unicode right? Why when we assign a string to | a variable that string's type is always unicode and does not | automatically become utf-8 which includes all available world-wide | characters? Unicode is something different that a character set? ) In Python 3, yes. Strings are unicode. Note that that means they are sequences of codepoints whose meaning is as for Unicode. utf-8 is a byte encoding for Unicode strings. An external storage format, if you like. The first 0-127 codepoints are 1:1 with byte values, and the higher code points require multibyte sequences. | utf8_byte = s.encode('utf-8') Unicode string = utf-8 byte encoding. | Now if we are to decode this back to utf8 we will receive the char 'a'. Yes. | I beleive same thing will happen with latin, greek, ascii isos. Correct? | | utf8_a = utf8_byte.decode('iso-8859-7') | latin_a = utf8_byte.decode('iso-8859-1') | ascii_a = utf8_byte.decode('ascii') | utf8_a = utf8_byte.decode('iso-8859-7') | | Is this correct? Yes, because of the design decision about the 0-127 codepoints. | All of those decodes will work even if the encoded bytestring was of utf8 type? | | The characters that will not decode correctly are those that their codepoints are greater that 127 ? | for example if s = 'α' (greek character equivalent to english 'a') | Is this what you mean? Yes, exactly so. | | | Now back to my almost ready files.py script please: | | | # | # Collect filenames of the path dir as bytes | greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' ) | | for filename in greek_filenames: | # Compute 'path/to/filename' in bytes | greek_path = b'/home/nikos/public_html/data/apps/' + b'filename' You don't mean b'filename', which is the literal word filename. You mean: filename.encode('iso-8859-7') More probably, you mean: dirpath = b'/home/nikos/public_html/data/apps/' greek_filenames = os.listdir(dirpath) for greek_filename in greek_filenames: try: filename = greek_filename.decode('iso-8859-7') and then: greek_path = dirpath + greek_filename utf8_filename = filename.encode('utf-8') utf8_path = dirpath + utf8_filename | try: | filepath = greek_path.decode('iso-8859-7') | # Rename current filename from greek bytes -- utf-8 bytes | os.rename( greek_path, filepath.encode('utf-8')
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Thu, Jun 6, 2013 at 3:54 PM, jmfauth wxjmfa...@gmail.com wrote: (filesystems are just bytes, yeah, whatever...). Sure. You tell me what a proper Unicode rendition of an animated GIF is. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Yes this is a linxu issue although locale is se to utf-8 root@nikos [~]# locale LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 LC_NUMERIC=en_US.UTF-8 LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8 LC_ALL= root@nikos [~]# Since 'locale' is set to 'utf-8' why when i: 'mv 'Euxi tou Ihsou.mp3' 'Ευχή του Ιησού.mp3' lead to that unknown encoded bytestream '\305\365\367\336\\364\357\365\311\347\363\357\375.mp3' which isn't by default an utf-8 bytestream as locale indicated and python expected? how 'files.py' is supposed to read this file now using: # Compute a set of current fullpaths fullpaths = set() path = /home/nikos/public_html/data/apps/ for root, dirs, files in os.walk(path): for fullpath in files: fullpaths.add( os.path.join(root, fullpath) ) -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Am 05.06.2013 18:44, schrieb MRAB: From the previous posts I guessed that the filename might be encoded using ISO-8859-7: s = b\305\365\367\336\ \364\357\365\ \311\347\363\357\375.mp3 s.decode(iso-8859-7) 'Ευχή\\ του\\ Ιησού.mp3' Yes, that looks the same. Most probably, his terminal is set to ISO-8859-7, so that when he issues the rename command on the command-line of his shell session, the mv command gets a stream of bytes as the new file name which happens to be the ISO-8859-7 encoding of the file name he'd like the file to have. This is what's stored on disk. So, his biggest problem isn't that the operating system is encoding agnostic wrt. filenames (i.e., treats them as a stream of bytes), but rather that he's using an ISO-7 terminal window when having set up UTF-8 as his operating system locale and expects filenames to be encoded in UTF-8 when he's not passing in UTF-8 byte streams from his client computer at all. -- --- Heiko. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 06/06/2013 07:11, Chris Angelico wrote: On Thu, Jun 6, 2013 at 3:54 PM, jmfauth wxjmfa...@gmail.com wrote: (filesystems are just bytes, yeah, whatever...). Sure. You tell me what a proper Unicode rendition of an animated GIF is. ChrisA It's obviously one that doesn't use the flawed Python Flexible String Representation :) -- Steve is going for the pink ball - and for those of you who are watching in black and white, the pink is next to the green. Snooker commentator 'Whispering' Ted Lowe. Mark Lawrence -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 05Jun2013 11:43, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | Τη Τετάρτη, 5 Ιουνίου 2013 9:32:15 μ.μ. UTC+3, ο χρήστης MRAB έγραψε: | Using Python, I think you could get the filenames using os.listdir, | passing the directory name as a bytestring so that it'll return the | names as bytestrings. | | Then, for each name, you could decode from its current encoding and | encode to UTF-8 and rename the file, passing the old and new paths to | os.rename as bytestrings. | | Iam not sure i follow: | | Change this: | | # Compute a set of current fullpaths | fullpaths = set() | path = /home/nikos/public_html/data/apps/ | | for root, dirs, files in os.walk(path): [...] Have a read of this: http://docs.python.org/3/library/os.html#os.listdir The UNIX API accepts bytes for filenames and paths. Python 3 strs are sequences of Unicode code points. If you try to open a file or directory on a UNIX system using a Python str, that string must be converted to a sequence of bytes before being handed to the OS. This is done implicitly using your locale settings if you just use a str. However, if you pass a bytes to open or listdir, this conversion does not take place. You put bytes in and in the case of listdir you get bytes out. You can work on pathnames in bytes and never concern yourself with encode/decode at all. In this way you can write code that does not care about the translation between Unicode and some arbitrary byte encoding. Of course, the issue will still arise when accepting user input; your shell has done exactly this kind of thing when you renamed your MP3 file. But it is possible to write pure utility code that doesn't care about filenames as Unicode or str if you work purely in bytes. Regarding user filenames, the common policy these days is to use utf-8 throughout. Of course you need to get everything into that regime to start with. -- Cameron Simpson c...@zip.com.au ...but C++ gloggles the cheesewad, thus causing a type conflict. - David Jevans, jev...@apple.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Πέμπτη, 6 Ιουνίου 2013 11:50:55 π.μ. UTC+3, ο χρήστης Heiko Wundram έγραψε: Am 05.06.2013 18:44, schrieb MRAB: From the previous posts I guessed that the filename might be encoded using ISO-8859-7: s = b\305\365\367\336\ \364\357\365\ \311\347\363\357\375.mp3 s.decode(iso-8859-7) 'οΏ½οΏ½οΏ½οΏ½\\ οΏ½οΏ½οΏ½\\ οΏ½οΏ½οΏ½οΏ½οΏ½.mp3' Yes, that looks the same. Most probably, his terminal is set to ISO-8859-7, so that when he issues the rename command on the command-line of his shell session, the mv command gets a stream of bytes as the new file name which happens to be the ISO-8859-7 encoding of the file name he'd like the file to have. This is what's stored on disk. So, his biggest problem isn't that the operating system is encoding agnostic wrt. filenames (i.e., treats them as a stream of bytes), but rather that he's using an ISO-7 terminal window when having set up UTF-8 as his operating system locale and expects filenames to be encoded in UTF-8 when he's not passing in UTF-8 byte streams from his client computer at all. -- --- Heiko. ni...@superhost.gr [~/www/data/apps]# ls -l | file - /dev/stdin: ASCII text # Compute a set of current fullpaths fullpaths = set() path = /home/nikos/public_html/data/apps/ for root, dirs, files in os.walk(path): for fullpath in files: fullpaths.add( os.path.join(root, fullpath) ) [Thu Jun 06 13:34:19 2013] [error] [client 79.103.41.173] cur.execute('''SELECT url FROM files WHERE url = %s''', fullpath.encode('iso-8859-7') ) [Thu Jun 06 13:34:19 2013] [error] [client 79.103.41.173] File /usr/local/lib/python3.3/encodings/iso8859_7.py, line 12, in encode [Thu Jun 06 13:34:19 2013] [error] [client 79.103.41.173] return codecs.charmap_encode(input,errors,encoding_table) [Thu Jun 06 13:34:19 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'charmap' codec can't encode characters in position 34-37: character maps to undefined [Thu Jun 06 13:27:17 2013] [error] [client 79.103.41.173] Traceback (most recent call last): [Thu Jun 06 13:27:17 2013] [error] [client 79.103.41.173] File files.py, line 73, in module [Thu Jun 06 13:27:17 2013] [error] [client 79.103.41.173] cur.execute('''SELECT url FROM files WHERE url = %s''', fullpath.decode('iso-8859-7') ) [Thu Jun 06 13:27:17 2013] [error] [client 79.103.41.173] AttributeError: 'str' object has no attribute 'decode' Same when i encode in latin -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Am 06.06.2013 12:35, schrieb Νικόλαος Κούρας: ni...@superhost.gr [~/www/data/apps]# ls -l | file - /dev/stdin: ASCII text Did you actually try to understand what I wrote? -- --- Heiko. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Heiko, the ssh client i used to 'mv' the .mp3 was putty.Do you mean that putty is responsible for the encoding mess? the rename command on the command-line of his shell session, the mv command gets a stream of bytes as the new file name which happens to be the ISO-8859-7 encoding of the file name he'd like the file to have. This is what's stored on disk. So, his biggest problem isn't that the operating system is encoding agnostic wrt. filenames (i.e., treats them as a stream of bytes), but rather that he's using an ISO-7 terminal window when having set up UTF-8 as his operating system locale and expects filenames to be encoded in UTF-8 when he's not passing in UTF-8 byte streams from his client computer at all. the rename command on the command-line of his shell session, the mv command gets a stream of bytes as the new file name which happens to be the ISO-8859-7 encoding of the file name he'd like the file to have. This is what's stored on disk. So, his biggest problem isn't that the operating system is encoding agnostic wrt. filenames (i.e., treats them as a stream of bytes), but rather that he's using an ISO-7 terminal window when having set up UTF-8 as his operating system locale and expects filenames to be encoded in UTF-8 when he's not passing in UTF-8 byte streams from his client the rename command on the command-line of his shell session, the mv command gets a stream of bytes as the new file name which happens to be the ISO-8859-7 encoding of the file name he'd like the file to have. This is what's stored on disk. So, his biggest problem isn't that the operating system is encoding agnostic wrt. filenames (i.e., treats them as a stream of bytes), but rather that he's using an ISO-7 terminal window when having set up UTF-8 as his operating system locale and expects filenames to be encoded in UTF-8 when he's not passing in UTF-8 byte streams from his client computer at all. the rename command on the command-line of his shell session, the mv command gets a stream of bytes as the new file name which happens to be the ISO-8859-7 encoding of the file name he'd like the file to have. This is what's stored on disk. So, his biggest problem isn't that the operating system is encoding agnostic wrt. filenames (i.e., treats them as a stream of bytes), but rather that he's using an ISO-7 terminal window when having set up UTF-8 as his operating system locale and expects filenames to be encoded in UTF-8 when he's not passing in UTF-8 byte streams from his client computer at all. the rename command on the command-line of his shell session, the mv command gets a stream of bytes as the new file name which happens to be the ISO-8859-7 encoding of the file name he'd like the file to have. This is what's stored on disk. So, his biggest problem isn't that the operating system is encoding agnostic wrt. filenames (i.e., treats them as a stream of bytes), but rather that he's using an ISO-7 terminal window when having set up UTF-8 as his operating system locale and expects filenames to be encoded in UTF-8 when he's not passing in UTF-8 byte streams from his client computer at all. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Am 06.06.2013 13:00, schrieb Νικόλαος Κούρας: Heiko, the ssh client i used to 'mv' the .mp3 was putty.Do you mean that putty is responsible for the encoding mess? Exactly. Check the encoding that putty uses for the terminal session. If it doesn't use UTF-8, switch your terminal session to UTF-8 and try the rename again. If it does, try to use another terminal client (I recommend the Cygwin-Suite). -- --- Heiko. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Πέμπτη, 6 Ιουνίου 2013 1:24:16 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: On 05Jun2013 11:43, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | Τη Τετάρτη, 5 Ιουνίου 2013 9:32:15 μ.μ. UTC+3, ο χρήστης MRAB έγραψε: | Using Python, I think you could get the filenames using os.listdir, | passing the directory name as a bytestring so that it'll return the | names as bytestrings. | | Then, for each name, you could decode from its current encoding and | encode to UTF-8 and rename the file, passing the old and new paths to | os.rename as bytestrings. | | Iam not sure i follow: | | Change this: | | # Compute a set of current fullpaths | fullpaths = set() | path = /home/nikos/public_html/data/apps/ | | for root, dirs, files in os.walk(path): [...] Have a read of this: http://docs.python.org/3/library/os.html#os.listdir The UNIX API accepts bytes for filenames and paths. Python 3 strs are sequences of Unicode code points. If you try to open a file or directory on a UNIX system using a Python str, that string must be converted to a sequence of bytes before being handed to the OS. This is done implicitly using your locale settings if you just use a str. However, if you pass a bytes to open or listdir, this conversion does not take place. You put bytes in and in the case of listdir you get bytes out. You can work on pathnames in bytes and never concern yourself with encode/decode at all. In this way you can write code that does not care about the translation between Unicode and some arbitrary byte encoding. Of course, the issue will still arise when accepting user input; your shell has done exactly this kind of thing when you renamed your MP3 file. But it is possible to write pure utility code that doesn't care about filenames as Unicode or str if you work purely in bytes. Regarding user filenames, the common policy these days is to use utf-8 throughout. Of course you need to get everything into that regime to start with Τη Πέμπτη, 6 Ιουνίου 2013 1:24:16 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: On 05Jun2013 11:43, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= nikos.gr...@gmail.com wrote: | Τη Τετάρτη, 5 Ιουνίου 2013 9:32:15 μ.μ. UTC+3, ο χρήστης MRAB έγραψε: | Using Python, I think you could get the filenames using os.listdir, | passing the directory name as a bytestring so that it'll return the | names as bytestrings. | | Then, for each name, you could decode from its current encoding and | encode to UTF-8 and rename the file, passing the old and new paths to | os.rename as bytestrings. | | Iam not sure i follow: | | Change this: | | # Compute a set of current fullpaths | fullpaths = set() | path = /home/nikos/public_html/data/apps/ | | for root, dirs, files in os.walk(path): [...] Have a read of this: http://docs.python.org/3/library/os.html#os.listdir The UNIX API accepts bytes for filenames and paths. Python 3 strs are sequences of Unicode code points. If you try to open a file or directory on a UNIX system using a Python str, that string must be converted to a sequence of bytes before being handed to the OS. This is done implicitly using your locale settings if you just use a str. However, if you pass a bytes to open or listdir, this conversion does not take place. You put bytes in and in the case of listdir you get bytes out. You can work on pathnames in bytes and never concern yourself with encode/decode at all. In this way you can write code that does not care about the translation between Unicode and some arbitrary byte encoding. Of course, the issue will still arise when accepting user input; your shell has done exactly this kind of thing when you renamed your MP3 file. But it is possible to write pure utility code that doesn't care about filenames as Unicode or str if you work purely in bytes. Regarding user filenames, the common policy these days is to use utf-8 throughout. Of course you need to get everything into that regime to start with. So i i nee to use os.listdir() to grab those filenames into bytes. okey. So by changing this to: fullpaths = set() path = /home/nikos/public_html/data/apps/ for root, dirs, files in os.walk(path): for fullpath in files: fullpaths.add( os.path.join(root, fullpath) ) # Compute a set of current fullpaths fullpaths = os.listdir( '/home/nikos/public_html/data/apps/' ) # Load'em for fullpath in fullpaths: try: # Check the presence of a file against the database and insert if it doesn't exist cur.execute('''SELECT url FROM files WHERE url = %s''', (fullpath,) ) data = cur.fetchone()#URL
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Τη Πέμπτη, 6 Ιουνίου 2013 2:09:22 μ.μ. UTC+3, ο χρήστης Heiko Wundram έγραψε: Am 06.06.2013 13:00, schrieb Νικόλαος Κούρας: Heiko, the ssh client i used to 'mv' the .mp3 was putty.Do you mean that putty is responsible for the encoding mess? Exactly. Check the encoding that putty uses for the terminal session. If it doesn't use UTF-8, switch your terminal session to UTF-8 and try the rename again. If it does, try to use another terminal client (I recommend the Cygwin-Suite). Okey, indeed it was using greek-sio encoding, i changed it to uf-8 and reopned the terminal session. ni...@superhost.gr [~/www/data/apps]# mv *.mp3 'Ευχή του Ιησού.mp3' mv: `\305\365\367\336 \364\357\365 \311\347\363\357\375.mp3' and `\305\365\367\3 36 \364\357\365 \311\347\363\357\375.mp3' are the same file ni...@superhost.gr [~/www/data/apps]# mv *.mp3 'Ευχή του Ιησού.mp33' ni...@superhost.gr [~/www/data/apps]# mv *.mp33 'Ευχή του Ιησού.mp3' ni...@superhost.gr [~/www/data/apps]# ls -l total 368548 drwxr-xr-x 2 nikos nikos 4096 Jun 6 14:22 ./ drwxr-xr-x 6 nikos nikos 4096 May 26 21:13 ../ -rwxr-xr-x 1 nikos nikos 13157283 Mar 17 12:57 100\ Mythoi\ tou\ Aiswpou.pdf* -rwxr-xr-x 1 nikos nikos 29524686 Mar 11 18:17 Anekdotologio.exe* -rw-r--r-- 1 nikos nikos 42413964 Jun 2 20:29 Battleship.exe -rw-r--r-- 1 nikos nikos 236032 Jun 4 14:10 \323\352\335\370\357\365\ \335\35 5\341\355\ \341\361\351\350\354\374.exe -rwxr-xr-x 1 nikos nikos 66896732 Mar 17 13:13 Kosmas\ o\ Aitwlos\ -\ Profiteies .pdf* -rw-r--r-- 1 nikos nikos 51819750 Jun 2 20:04 Luxor\ Evolved.exe -rw-r--r-- 1 nikos nikos 60571648 Jun 2 14:59 Monopoly.exe -rw-r--r-- 1 nikos nikos 3511233 Jun 4 14:11 \305\365\367\336\ \364\357\365\ \ 311\347\363\357\375.mp3 -rwxr-xr-x 1 nikos nikos 1788164 Mar 14 11:31 Online\ Movie\ Player.zip* -rw-r--r-- 1 nikos nikos 5277287 Jun 1 18:35 O\ Nomos\ tou\ Merfy\ v1-2-3.zip -rwxr-xr-x 1 nikos nikos 16383001 Jun 22 2010 Orthodoxo\ Imerologio.exe* -rw-r--r-- 1 nikos nikos 6084806 Jun 1 18:22 Pac-Man.exe -rw-r--r-- 1 nikos nikos 25476584 Jun 2 19:50 Scrabble.exe -rwxr-xr-x 1 nikos nikos 49141166 Mar 17 12:48 To\ 1o\ mou\ vivlio\ gia\ to\ ska ki.pdf* -rwxr-xr-x 1 nikos nikos 3298310 Mar 17 12:45 Vivlos\ gia\ Atheofovous.pdf* -rw-r--r-- 1 nikos nikos 1764864 May 29 21:50 V-Radio\ v2.4.msi ni...@superhost.gr [~/www/data/apps]# ls *.mp3 | file - /dev/stdin: ASCII text ni...@superhost.gr [~/www/data/apps]# still same error. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
Am 06.06.2013 13:24, schrieb Νικόλαος Κούρας: ni...@superhost.gr [~/www/data/apps]# ls *.mp3 | file - /dev/stdin: ASCII text Again, did you actually read (and try to understand) what I wrote? I said to redo the rename after you change your terminal session to UTF-8. -- --- Heiko. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
# Compute a set of current fullpaths fullpaths = os.listdir( '/home/nikos/public_html/data/apps/' ) # Load'em for fullpath in fullpaths: try: # Check the presence of a file against the database and insert if it doesn't exist cur.execute('''SELECT url FROM files WHERE url = %s''', fullpath.encode('utf-8') ) data = cur.fetchone()#URL is unique, so should only be one print( fullpath.encode('utf-8') ) Now why this does not print out the filenames when iterated in the for loop? One step forward is that when i run it liek this no error is being displyed in the error log. Please help, i ahve tried os.listdir() as Cameron suggested. -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 06/06/2013 04:43, Νικόλαος Κούρας wrote: Τη Τετάρτη, 5 Ιουνίου 2013 9:43:18 μ.μ. UTC+3, ο χρήστης Νικόλαος Κούρας έγραψε: Τη Τετάρτη, 5 Ιουνίου 2013 9:32:15 μ.μ. UTC+3, ο χρήστης MRAB έγραψε: On 05/06/2013 18:43, οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½ οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½ wrote: οΏ½οΏ½ οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½, 5 οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½ 2013 8:56:36 οΏ½.οΏ½. UTC+3, οΏ½ οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½ Steven D'Aprano οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½: Somehow, I don't know how because I didn't see it happen, you have one or more files in that directory where the file name as bytes is invalid when decoded as UTF-8, but your system is set to use UTF-8. So to fix this you need to rename the file using some tool that doesn't care quite so much about encodings. Use the bash command line to rename each file in turn until the problem goes away. ' leade to that unknown encoding of this bytestream '\305\365\367\336\ \364\357\365\ \311\347\363\357\375.mp3' But please tell me Steven what linux tool you think it can encode the weird filename to proper 'οΏ½οΏ½οΏ½οΏ½ οΏ½οΏ½οΏ½ οΏ½οΏ½οΏ½οΏ½οΏ½.mp3' utf-8? or we cna write a script as i suggested to decode back the bytestream using all sorts of available decode charsets boiling down to the original greek letters. Actually you were correct i was typing greek and is aw the fileneme here in gogole groups as: But renaming ia hsell access like 'mv 'Euxi tou Ihsou.mp3' 'οΏ½οΏ½οΏ½οΏ½ οΏ½οΏ½οΏ½ οΏ½οΏ½οΏ½οΏ½οΏ½.mp3 so maybe the filenames have to be decoded to greek-iso but then agian the contain both greek letters but their extension are in english chars like '.mp3' Using Python, I think you could get the filenames using os.listdir, passing the directory name as a bytestring so that it'll return the names as bytestrings. Then, for each name, you could decode from its current encoding and encode to UTF-8 and rename the file, passing the old and new paths to os.rename as bytestrings. Iam not sure i follow: Change this: # Compute a set of current fullpaths fullpaths = set() path = /home/nikos/public_html/data/apps/ for root, dirs, files in os.walk(path): for fullpath in files: fullpaths.add( os.path.join(root, fullpath) ) to what to make the full url readable by files.py? MRAB can you please explain in more clarity your idea of solution? I was suggesting a way to rename the files so that their names are encoded in UTF-8 (they appear to be encoded in ISO-8859-7). You MUST TEST IT thoroughly first, of course, before trying it on the actual files. It could go something like this: import os # Give the path as a bytestring so that we'll get the names as bytestrings. root_folder = b/home/nikos/public_html/data/apps/ # Setting TESTING to True will make it print out what renamings it will do, but # not actually do them. TESTING = True # Walk through the files. for root, dirs, files in os.walk(root_folder): for name in files: try: # Is this name encoded in UTF-8? name.decode(utf-8) except UnicodeDecodeError: # Decoding from UTF- failed, which means that the name is not valid # UTF-8. # It appears (from elsewhere) that the names are encoded in # ISO-8859-7, so decode from that and re-encode to UTF-8. new_name = name.decode(iso-8859-7).encode(utf-8) old_path = os.path.join(root, name) new_path = os.path.join(root, new_name) if TESTING: print(Will rename {!r} to {!r}.format(old_path, new_path)) else: print(Renaming {!r} to {!r}.format(old_path, new_path)) os.rename(old_path, new_path) -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
First of all thank you for helping me MRAB. After make some alternation to your code ia have this: # Give the path as a bytestring so that we'll get the filenames as bytestrings path = b/home/nikos/public_html/data/apps/ # Setting TESTING to True will make it print out what renamings it will do, but not actually do them TESTING = True # Walk through the files. for root, dirs, files in os.walk( path ): for filename in files: try: # Is this name encoded in UTF-8? filename.decode('utf-8') except UnicodeDecodeError: # Decoding from UTF-8 failed, which means that the name is not valid UTF-8 # It appears that the filenames are encoded in ISO-8859-7, so decode from that and re-encode to UTF-8 new_filename = filename.decode('iso-8859-7').encode('utf-8') old_path = os.path.join(root, filename) new_path = os.path.join(root, new_filename) if TESTING: print( '''brWill rename {!r} --- {!r}brbr'''.format( old_path, new_path ) ) else: print( '''brRenaming {!r} --- {!r}brbr'''.format( old_path, new_path ) ) os.rename( old_path, new_path ) sys.exit(0) - and the output can be seen here: http://superhost.gr/cgi-bin/files.py We are in test mode so i dont know if when renaming actually take place what the encodings will be. Shall i switch off test mode and try it for real? -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On Tue, 04 Jun 2013 02:00:43 -0700, Νικόλαος Κούρας wrote: Τη Τρίτη, 4 Ιουνίου 2013 11:47:01 π.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: Please run these commands, and show what result they give: [...] ni...@superhost.gr [~/www/data/apps]# alias ls alias ls='/bin/ls $LS_OPTIONS' And what does echo $LS_OPTIONS give? [...] Seems that the way the system used to actually rename the file matters. Yes. This is where you get interactions between different systems that use different encodings, and they don't work well together. Some day, everything will use UTF-8, and these problems will go away. If all else fails, you could just rename the troublesome file and hopefully the problem will go away: mv *Ο.mp3 1.mp3 mv 1.mp3 Eυχή του Ιησού.mp3 Yes, but why you are doing it it 2 steps and not as: mv *Ο.mp3 'Eυχή του Ιησού.mp3' I don't remember. I had a reason that made sense at the time, but I can't remember what it was. I think I can reproduce your problem. If I open a terminal, set to use UTF-8, I can do this: [steve@ando ~]$ cd /tmp [steve@ando tmp]$ touch '999-Eυχή-του-Ιησού' [steve@ando tmp]$ ls 999* 999-Eυχή-του-Ιησού Now if I change the terminal to use Greek ISO-8859-7, and hit UP-ARROW to grab the previous command line from history, the *displayed* file name changes, but the actual file being touched remains the same: [steve@ando tmp]$ touch '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ' [steve@ando tmp]$ ls 999* 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ In Python 3.3, I can demonstrate the same sort of thing: py s = '999-Eυχή-του-Ιησού' py bytes_as_utf8 = s.encode('utf-8') py t = bytes_as_utf8.decode('iso-8859-7', errors='replace') py print(t) 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ So that demonstrates part of your problem: even though your Linux system is using UTF-8, your terminal is probably set to ISO-8859-7. The interaction between these will lead to strange and disturbing Unicode errors. To continue, back in the terminal set to ISO-8859-7, if instead of using the history line, if I re-copy and paste the file name: [steve@ando tmp]$ touch '999-Eυχή-του-Ιησού' [steve@ando tmp]$ ls 999* 999-E???-???-? 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ So now I end up with two files, one with a file name that is utter garbage bytes, and one that is only a little better, being mojibake. Resetting the terminal to use UTF-8 at least now restores the *display* of the earlier file's name: [steve@ando tmp]$ ls 999* 999-E???-???-? 999-Eυχή-του-Ιησού [steve@ando tmp]$ ls -b 999* 999-E\365\367\336-\364\357\365-\311\347\363\357\375 999-Eυχή-του-Ιησού but the other file name is still made of garbage bytes. So I believe I understand how your file name has become garbage. To fix it, make sure that your terminal is set to use UTF-8, and then rename it. Do the same with every file in the directory until the problem goes away. (If one file has garbage bytes in the file name, chances are that more than one do.) -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Changing filenames from Greeklish = Greek (subprocess complain)
On 06/06/2013 13:04, Νικόλαος Κούρας wrote: First of all thank you for helping me MRAB. After make some alternation to your code ia have this: # Give the path as a bytestring so that we'll get the filenames as bytestrings path = b/home/nikos/public_html/data/apps/ # Setting TESTING to True will make it print out what renamings it will do, but not actually do them TESTING = True # Walk through the files. for root, dirs, files in os.walk( path ): for filename in files: try: # Is this name encoded in UTF-8? filename.decode('utf-8') except UnicodeDecodeError: # Decoding from UTF-8 failed, which means that the name is not valid UTF-8 # It appears that the filenames are encoded in ISO-8859-7, so decode from that and re-encode to UTF-8 new_filename = filename.decode('iso-8859-7').encode('utf-8') old_path = os.path.join(root, filename) new_path = os.path.join(root, new_filename) if TESTING: print( '''brWill rename {!r} --- {!r}brbr'''.format( old_path, new_path ) ) else: print( '''brRenaming {!r} --- {!r}brbr'''.format( old_path, new_path ) ) os.rename( old_path, new_path ) sys.exit(0) - and the output can be seen here: http://superhost.gr/cgi-bin/files.py We are in test mode so i dont know if when renaming actually take place what the encodings will be. Shall i switch off test mode and try it for real? The first one is '/home/nikos/public_html/data/apps/Ευχή του Ιησού.mp3'. The second one is '/home/nikos/public_html/data/apps/Σκέψου έναν αριθμό.exe'. These names are currently encoded in ISO-8859-7, but will be encoded in UTF-8 if you turn off test mode. If you're happy for that change to happen, then go ahead. -- http://mail.python.org/mailman/listinfo/python-list