RE: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-12 Thread Carlos Nepomuceno

 To: python-list@python.org
 From: breamore...@yahoo.co.uk
 Subject: Re: Changing filenames from Greeklish = Greek (subprocess complain)
 Date: Sun, 2 Jun 2013 15:51:31 +0100
[...]
 Steve is going for the pink ball - and for those of you who are
 watching in black and white, the pink is next to the green. Snooker
 commentator 'Whispering' Ted Lowe.

 Mark Lawrence

 --
 http://mail.python.org/mailman/listinfo/python-list

le+666l   
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-12 Thread Gene Heskett
On Sunday 02 June 2013 13:10:30 Chris Angelico did opine:

 On Mon, Jun 3, 2013 at 2:21 AM, حéêüëلïٍ تï‎ٌلٍ nikos.gr...@gmail.com 
wrote:
  Paying for someone to just remove a dash to get the script working is
  too much to ask for
 
 One dash: 1c
 Knowing where to remove it: $99.99
 Total bill: $100.00
 
 Knowing that it ought really to be utf8mb4 and giving hints that the
 docs should be read rather than just taking this simple example and
 plugging it in: Priceless.
 
 ChrisA

Chuckle.  Chris, I do believe you have topped yourself.  Love it.

Cheers, Gene
-- 
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
My web page: http://coyoteden.dyndns-free.com:85/gene is up!
My views 
http://www.armchairpatriot.com/What%20Has%20America%20Become.shtml
Unnamed Law:
If it happens, it must be possible.
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
 law-abiding citizens.
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-12 Thread Carlos Nepomuceno
' Server: ApacheBooster/1.6'  isn't a signature of httpd. I think you are 
really running something different.

 From: nob...@nowhere.com
 Subject: Re: Changing filenames from Greeklish = Greek (subprocess complain)
 Date: Tue, 4 Jun 2013 14:01:48 +0100
 To: python-list@python.org
 
 On Tue, 04 Jun 2013 00:58:42 -0700, Νικόλαος Κούρας wrote:
 
  Τη Τρίτη, 4 Ιουνίου 2013 10:39:08 π.μ. UTC+3, ο
  χρήστης Nobody έγραψε:
  
  Chrome didn't choose ISO-8859-1, the server did; the HTTP response says:
Content-Type: text/html;charset=ISO-8859-1
  
  From where do you see this
 
 $ wget -S -O - http://superhost.gr/data/apps/
 --2013-06-04 14:00:10--  http://superhost.gr/data/apps/
 Resolving superhost.gr... 82.211.30.133
 Connecting to superhost.gr|82.211.30.133|:80... connected.
 HTTP request sent, awaiting response... 
   HTTP/1.1 200 OK
   Server: ApacheBooster/1.6
   Date: Tue, 04 Jun 2013 13:00:19 GMT
   Content-Type: text/html;charset=ISO-8859-1
   Transfer-Encoding: chunked
   Connection: keep-alive
   Vary: Accept-Encoding
   X-Cacheable: YES
   X-Varnish: 2000177813
   Via: 1.1 varnish
   age: 0
   X-Cache: MISS
 
 -- 
 http://mail.python.org/mailman/listinfo/python-list
  -- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-11 Thread Larry Hudson

On 06/10/2013 01:11 AM, Νικόλαος Κούρας wrote:

Τη Δευτέρα, 10 Ιουνίου 2013 10:51:34 π.μ. UTC+3, ο χρήστης Larry Hudson έγραψε:


I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 
256, not above 256.



0 - 127, yes.
128 - 255 - one byte of a multibyte code.


you mean that in utf-8 for 1 character to be stored, we need 2 bytes?
I still havign troubl e understanding this.


Utf-8 characters are encoded in different sizes, NOT a single fixed number of 
bytes.
The high _bits_ of the first byte define the number of bytes of the individual 
character code.

(I'm copying this from Wikipedia...)
0xxx - 1 byte
110x - 2 bytes
1110 - 3 bytes
0xxx - 4 bytes
10xx - 5 bytes
110x - 6 bytes

Notice that in the 1-byte version, since the high bit is always 0, only 7 bits are available for 
the character code, and this is the standard 0-127 ASCII (and ASCII-compatible) code set.



Since 2^8 = 256, utf-8 would need 1 byte to store the 1st 256 characters but 
instead its using 1 byte up to the first 127 value and then 2 bytes for 
anyhtign above.  Why?

As I indicated above, one bit is reserved as a flag to indicate that the code is one-byte code 
and not a multibyte code, only 7 bits are available for the actual 1-byte (ASCII) code.


--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-10 Thread nagia . retsina
Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:

 py c = 'α'
 py ord(c)
 945

The number 945 is the characters 'α' ordinal value in the unicode charset 
correct?

The command in the python interactive session to show me how many bytes
this character will take upon encoding to utf-8 is:

 s = 'α'
 s.encode('utf-8')
b'\xce\xb1'

I see that the encoding of this char takes 2 bytes. But why two exactly?
How do i calculate how many bits are needed to store this char into bytes?


Trying to to the same here but it gave me no bytes back.

 s = 'a'
 s.encode('utf-8')
b'a'


py c.encode('utf-8')
 b'\xce\xb1'

2 bytes here. why 2?

 py c.encode('utf-16be')
 b'\x03\xb1'

2 byets here also. but why 3 different bytes? the ordinal value of char 'a' is 
the same in unicode. the encodign system just takes the ordinal value end 
encode, but sinc eit uses 2 bytes should these 2 bytes be the same?

 py c.encode('utf-32be')
 b'\x00\x00\x03\xb1

every char here takes exactly 4 bytes to be stored. okey.

 py c.encode('iso-8859-7')
 b'\xe1'

And also does '\x' means that the value is being respresented in hex way?
and when i bin(6) i see '0b101'

I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-10 Thread Larry Hudson

On 06/09/2013 03:37 AM, Νικόλαος Κούρας wrote:



I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 
256, not above 256.


NO!!

0 - 127, yes.
128 - 255 - one byte of a multibyte code.

That's why the decode fails, it sees it as incomplete data so it can't do 
anything with it.



A surrogate pair is like itting for example Ctrl-A, which means is a 
combination character that consists of 2 different characters?
Is this what a surrogate is? a pari of 2 chars?

You're confusing character encodings with the way NON-CHARACTER keys on the KEYBOARD are encoded 
(function keys, arrow keys and such).  These are NOT text characters but KEYBOARD key codes. 
These are NOT text codes and are entirely different and not related to any character encoding. 
How programs interpret and use these codes depends entirely on the individual programs.  There 
are common conventions on how many are used, but there are no standards.


Also the control-codes are the first 32 values of the ASCII (and ASCII-compatible) character set 
and are not multi-character key codes like the keyboard non-character keys.


However, there are a few keyboard keys that actually produce control-codes.  A 
few examples:

Return/Enter - Ctrl-M
Tab - Ctrl-I
Backspace - Ctrl-H



So character 'A' - 65 (in decimal uses in charset's table)  - 01011100 (as binary 
stored in disk) - 0xEF (as hex, when we open the file with a hex editor)

You are trying to put too much meaning to this.  The value stored on disk, in memory, or 
whatever is binary bits, nothing else.  How you describe the value, in decimal, in octal, in 
hex, in base-12, or... is totally irrelevant.  These are simply different ways of describing or 
naming these numeric values.


It's the same as saying 3 in English is three, in Spanish is tres, in German is drei...  (I 
don't know Greek, sorry.)  No matter what you call it, it is still the numeric integer value 
that is between 2 and 4.


--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-10 Thread Νικόλαος Κούρας
Τη Δευτέρα, 10 Ιουνίου 2013 10:51:34 π.μ. UTC+3, ο χρήστης Larry Hudson έγραψε:

  I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant 
  up to 256, not above 256.

 0 - 127, yes.
 128 - 255 - one byte of a multibyte code.

you mean that in utf-8 for 1 character to be stored, we need 2 bytes?
I still havign troubl e understanding this.

Since 2^8 = 256, utf-8 would need 1 byte to store the 1st 256 characters but 
instead its using 1 byte up to the first 127 value and then 2 bytes for 
anyhtign above.  Why?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-10 Thread Andreas Perstinger

On 10.06.2013 09:10, nagia.rets...@gmail.com wrote:

Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:


py c = 'α'
py ord(c)
945


The number 945 is the characters 'α' ordinal value in the unicode charset 
correct?


Yes, the unicode character set is just a big list of characters. The 
946th character in that list (starting from 0) happens to be 'α'.



The command in the python interactive session to show me how many bytes
this character will take upon encoding to utf-8 is:


s = 'α'
s.encode('utf-8')

b'\xce\xb1'

I see that the encoding of this char takes 2 bytes. But why two exactly?


That's how the encoding is designed. Haven't you read the wikipedia 
article which was already mentioned several times?



How do i calculate how many bits are needed to store this char into bytes?


You need to understand how UTF-8 works. Read the wikipedia article.


Trying to to the same here but it gave me no bytes back.


s = 'a'
s.encode('utf-8')

b'a'


The encode method returns a byte object. It's length will tell you how 
many bytes there are:


 len(b'a')
1
 len(b'\xce\xb1')
2

The python interpreter will represent all values below 256 as ASCII 
characters if they are printable:


 ord(b'a')
97
 hex(97)
'0x61'
 b'\x61' == b'a'
True

The Python designers have decided to use b'a' instead of b'\x61'.


py c.encode('utf-8')
b'\xce\xb1'


2 bytes here. why 2?


Same as your first question.


py c.encode('utf-16be')
b'\x03\xb1'


2 byets here also. but why 3 different bytes? the ordinal value of
char 'a' is the same in unicode. the encodign system just takes the
ordinal value end encode, but sinc eit uses 2 bytes should these 2 bytes
be the same?


'utf-16be' is a different encoding scheme, thus it uses other rules to 
determine how each character is translated into a byte sequence.



py c.encode('iso-8859-7')
b'\xe1'


And also does '\x' means that the value is being respresented in hex way?
and when i bin(6) i see '0b101'

I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say?

'\x' is an escape sequence and means that the following two characters 
should be interpreted as a number in hexadecimal notation (see also the 
table of allowed escape sequences: 
http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals 
).


'0b' tells you that the number is printed in binary notation.
Leading zeros are usually discarded when a number is printed:
 bin(70)
'0b1000110'
 0b100110 == 0b00100110
True
 0b100110 == 0b00100110
True

It's the same with decimal notation. You wouldn't say 00123 is different 
from 123, would you?


Bye, Andreas
--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-10 Thread Νικόλαος Κούρας
Τη Δευτέρα, 10 Ιουνίου 2013 11:15:38 π.μ. UTC+3, ο χρήστης Andreas Perstinger 
έγραψε:

What is the difference between len('nikos') and len(b'nikos')
First beeing the length of string nikos in characters while the second being 
the length of an ???


 The python interpreter will represent all values below 256 as ASCII 
 characters if they are printable:

   ord(b'a')
 97
   hex(97)
 '0x61'
   b'\x61' == b'a'
 True
 The Python designers have decided to use b'a' instead of b'\x61'.

b'a' and b'\x61' are the bytestrings of char 'a' after utf-8 encoding?

This ord(b'a' )should give an error in my opinion:

ord('a') should return the ordinal value of char 'a', not ord(b'a')
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-10 Thread Νικόλαος Κούρας
  s = 'α' 
  s.encode('utf-8') 
  b'\xce\xb1' 

'b' stands for binary right? 
 b'\xce\xb1' = we are looking at a byte in a hexadecimal format? 
if yes how could we see it in binary and decimal represenation? 
  
  I see that the encoding of this char takes 2 bytes. But why two exactly? 
  How do i calculate how many bits are needed to store this char into bytes? 
  
 Because utf-8 takes 1 to 4 bytes to encode characters 

Since 2^8 = 256, utf-8 should store the first 256 chars of unicode charset 
using 1 byte. 

Also Since 2^16 = 65535, utf-8 should store the first 65535 chars of unicode 
charset using 2 bytes and so on. 

But i know that this is not the case. 
But i dont understand why. 


  s = 'a' 
  s.encode('utf-8') 
  b'a' 
 utf-8 takes ASCII as it is, as 1 byte. They are the same 

EBCDIC and ASCII and Unicode are charactet sets, correct? 

iso-8859-1, iso-8859-7, utf-8, utf-16, utf-32 and so on are encoding methods, 
right?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-10 Thread Andreas Perstinger

On 10.06.2013 11:59, Νικόλαος Κούρας wrote:

 s = 'α'
 s.encode('utf-8')
 b'\xce\xb1'


'b' stands for binary right?


No, here it stands for bytes:
http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals


  b'\xce\xb1' = we are looking at a byte in a hexadecimal format?


No, b'\xce\xb1' represents a byte object containing 2 bytes.
Yes, each byte is represented in hexadecimal format.


if yes how could we see it in binary and decimal represenation?


 s = b'\xce\xb1'
 s[0]
206
 bin(s[0])
'0b11001110'
 s[1]
177
 bin(s[1])
'0b10110001'

A byte object is a sequence of bytes (= integer values) and support 
indexing.

http://docs.python.org/3/library/stdtypes.html#bytes


Since 2^8 = 256, utf-8 should store the first 256 chars of unicode
charset using 1 byte.

Also Since 2^16 = 65535, utf-8 should store the first 65535 chars of
unicode charset using 2 bytes and so on.

But i know that this is not the case. But i dont understand why.


Because your method doesn't work.
If you use all possible 256 bit-combinations to represent a valid 
character, how do you decide where to stop in a sequence of bytes?



 s = 'a'
 s.encode('utf-8')
 b'a'
utf-8 takes ASCII as it is, as 1 byte. They are the same


EBCDIC and ASCII and Unicode are charactet sets, correct?

iso-8859-1, iso-8859-7, utf-8, utf-16, utf-32 and so on are encoding methods, 
right?



Look at http://www.unicode.org/glossary/ for an explanation of all the 
terms.


Bye, Andreas
--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-10 Thread Steven D'Aprano
On Mon, 10 Jun 2013 00:10:38 -0700, nagia.retsina wrote:

 Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano
 έγραψε:
 
 py c = 'α'
 py ord(c)
 945
 
 The number 945 is the characters 'α' ordinal value in the unicode
 charset correct?

Correct.


 The command in the python interactive session to show me how many bytes
 this character will take upon encoding to utf-8 is:
 
 s = 'α'
 s.encode('utf-8')
 b'\xce\xb1'
 
 I see that the encoding of this char takes 2 bytes. But why two exactly?

Because that's how UTF-8 works. If it was a different encoding, it might 
be 4 bytes, or 2, or 1, or 101, or 7, or 3. But it is UTF-8, so it takes 
2 bytes. If you want to understand how UTF-8 works, look it up on 
Wikipedia. 


 How do i calculate how many bits are needed to store this char into
 bytes?

Every byte is made of 8 bits. There are two bytes. So multiply 8 by 2.


 Trying to to the same here but it gave me no bytes back.
 
 s = 'a'
 s.encode('utf-8')
 b'a'

There is a byte there. The byte is printed by Python as b'a', which in my 
opinion is a design mistake. That makes it look like a string, but it is 
not a string, and would be better printed as b'\x61'. But regardless of 
the display, it is still a single byte.

 
py c.encode('utf-8')
 b'\xce\xb1'
 
 2 bytes here. why 2?

Because that's how UTF-8 works.


 py c.encode('utf-16be')
 b'\x03\xb1'
 
 2 byets here also. but why 3 different bytes? 

Because it is a different encoding.


 the ordinal value of char 'a' is the same in unicode.

The same as what?


 the encodign system just takes the ordinal value end encode, but 
 sinc eit uses 2 bytes should these 2 bytes be the same?

No.

That's like saying that since a dog in Germany has four legs and one 
head, and a dog in France has four legs and one head, dog should be 
spelled Hund in both Germany and France.

Different encodings are like different languages. They spell the same 
word differently.


 py c.encode('utf-32be')
 b'\x00\x00\x03\xb1
 
 every char here takes exactly 4 bytes to be stored. okey.
 
 py c.encode('iso-8859-7')
 b'\xe1'
 
 And also does '\x' means that the value is being respresented in hex
 way? and when i bin(6) i see '0b101'
 
 I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to
 say?

b for Binary.

Just like 0o1234 uses octal, o for Octal.

And 0x123EF uses hexadecimal. x for heXadecimal.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-10 Thread Νικόλαος Κούρας
Τη Δευτέρα, 10 Ιουνίου 2013 2:59:03 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:
 On Mon, 10 Jun 2013 00:10:38 -0700, nagia.retsina wrote:
 
 
 
  Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano
 
  έγραψε:
 
  
 
  py c = 'α'
 
  py ord(c)
 
  945
 
  
 
  The number 945 is the characters 'α' ordinal value in the unicode
 
  charset correct?
 
 
 
 Correct.
 
 
 
 
 
  The command in the python interactive session to show me how many bytes
 
  this character will take upon encoding to utf-8 is:
 
  
 
  s = 'α'
 
  s.encode('utf-8')
 
  b'\xce\xb1'
 
  
 
  I see that the encoding of this char takes 2 bytes. But why two exactly?
 
 
 
 Because that's how UTF-8 works. If it was a different encoding, it might 
 
 be 4 bytes, or 2, or 1, or 101, or 7, or 3. But it is UTF-8, so it takes 
 
 2 bytes. If you want to understand how UTF-8 works, look it up on 
 
 Wikipedia. 
 
 
 
 
 
  How do i calculate how many bits are needed to store this char into
 
  bytes?
 
 
 
 Every byte is made of 8 bits. There are two bytes. So multiply 8 by 2.
 
 
 
 
 
  Trying to to the same here but it gave me no bytes back.
 
  
 
  s = 'a'
 
  s.encode('utf-8')
 
  b'a'
 
 
 
 There is a byte there. The byte is printed by Python as b'a', which in my  
 opinion is a design mistake. That makes it look like a string, but it is  
 not a string, and would be better printed as b'\x61'. But regardless of 
 the display, it is still a single byte.


Perhaps, up to 127 ASCII chars python thinks its better for human to read the 
character representaion of the stored byte, instead of hex's. Just a guess.

 Just like 0o1234 uses octal, o for Octal.
 And 0x123EF uses hexadecimal. x for heXadecimal.

Why the leadin zero before octal's 'o' and hex's 'x'  and binary's 'b' ?


Iam not goin to tired you any more, because ia hve exhaust myself tlo days now 
tryign to get my head around this.

Please confirm i ahve understood correctly:

I did but docs confuse me even more. Can you pleas ebut it simple.

Unicode as i understand it was created out of need for a bigger character set 
compared to ASCII which could hold up to 127 chars(and extended versions of it 
up to 256), that could be able to hold all worlds symbols.

ASCII and Unicode are character sets.

Everything else sees to be an encoding system that work upon those characters 
sets.

If what i said is true the last thing that still confuses me is that

iso-8859-7(256 chars) seems like charactet set and an encoding method too.
Can it be both or it is iso-8859-7 encoding method of Unicode character set 
similar as uTF8 is also Unicode's encoding method?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-10 Thread jmfauth

-

A coding scheme works with three sets. A *unique* set
of CHARACTERS, a *unique* set of CODE POINTS and a *unique*
set of ENCODED CODE POINTS, unicode or not.

The relation between the set of characters and the set of the
code points is a *human* table, created with a sheet of paper
and a pencil, a deliberate choice of characters with integers
as labels.

The relation between the set of the code points and the
set of encoded code points is a mathematical operation.

In the case of an 8bits coding scheme, like iso-XXX,
this operation is a no-op, the relation is an identity.
Shortly: set of code points == set of encoded code points.

In the case of unicode, The Unicode consortium endorses
three such mathematical operations called UTF-8, UTF-16 and
UTF-32 where UTF means Unicode Transformation Format, a
confusing wording meaning at the same time, the process
and the result of the process. This Unicode Transformation does
not produce bytes, it produces words/chunks/tokens of *bits* with
lengths 8, 16, 32, called Unicode Transformation Units (from this
the names UTF-8, -16, -32). At this level, only a structure has
been defined (there is no computing). Very important, an healthy
coding scheme works conceptually only with this *unique set
of encoded code points, not with bytes, characters or code points.

The last step, the machine implementation: it is up to the
processor, the compiler, the language to implement all these
Unicode Transformation Units with of course their related
specifities: char, w_char, int, long, endianess, rune (Go
language), ...

Not too over-simplified or not too over-complicated and enough
to understand one, if not THE, design mistake of the flexible
string representation.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-10 Thread Ned Batchelder
On Monday, June 10, 2013 3:48:08 PM UTC-4, jmfauth wrote:
 -
 
 
 
 A coding scheme works with three sets. A *unique* set
 of CHARACTERS, a *unique* set of CODE POINTS and a *unique*
 set of ENCODED CODE POINTS, unicode or not.
 
 The relation between the set of characters and the set of the
 code points is a *human* table, created with a sheet of paper
 and a pencil, a deliberate choice of characters with integers
 as labels.
 
 The relation between the set of the code points and the
 set of encoded code points is a mathematical operation.
 
 In the case of an 8bits coding scheme, like iso-XXX,
 this operation is a no-op, the relation is an identity.
 Shortly: set of code points == set of encoded code points.
 
 In the case of unicode, The Unicode consortium endorses
 three such mathematical operations called UTF-8, UTF-16 and
 UTF-32 where UTF means Unicode Transformation Format, a
 confusing wording meaning at the same time, the process
 and the result of the process. This Unicode Transformation does
 not produce bytes, it produces words/chunks/tokens of *bits* with
 lengths 8, 16, 32, called Unicode Transformation Units (from this
 the names UTF-8, -16, -32). At this level, only a structure has
 been defined (there is no computing). 

This is a really good description of the issues involved with character sets 
and encodings, thanks.

 Very important, an healthy
 coding scheme works conceptually only with this *unique set
 of encoded code points, not with bytes, characters or code points.
 

You don't explain why it is important to work with encoded code points.  What's 
wrong with working with code points?

 
 The last step, the machine implementation: it is up to the
 processor, the compiler, the language to implement all these
 Unicode Transformation Units with of course their related
 specifities: char, w_char, int, long, endianess, rune (Go
 language), ...
 
 Not too over-simplified or not too over-complicated and enough
 to understand one, if not THE, design mistake of the flexible
 string representation.
 
 jmf

Again you've made the claim that the flexible string representation is a 
mistake.  But you haven't said WHY.  I can't tell if you are trolling us, or 
are deluded, or genuinely don't understand what you are talking about.

Some day you might explain yourself. I look forward to it.

--Ned.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Steven D'Aprano
On Sun, 09 Jun 2013 07:46:40 +0300, Νικόλαος Κούρας wrote:

 Why does every character in a character set needs to be associated with
 a numeric value?

Because computers are digital, not analog, and because bytes are numbers.

Here are a few of the 256 possible bytes, written in binary, decimal and 
hexadecimal:

0b 0 0x00
0b0001 1 0x01
0b0010 2 0x02
[...]
0b0111 127 0x7F
0b1000 128 0x80
[...]
0b1110 254 0xFE
0b 255 0xFF


EVERYTHING in computers are numbers, because everything is stored as 
bytes. Text is stored as bytes. Sound files are stored as bytes. Images 
are stored as bytes. Programs are stored as bytes. So everything is being 
stored as numbers. But the *meaning* we give to those numbers depends on 
what we do with them, whether we treat them as characters, bitmapped 
images, floating point values, or something else.

Once we decide we want to store the character A as bytes, we need to 
decide which number it should be. That is the job of the charset.

ASCII:

65 -- 'A'
66 -- 'B'
67 -- 'C'
etc.


 I mean couldn't we just have characters sets that wouldn't have numeric
 associations like:
 
 'A'  = encoding process(i.e. uf-8) = bytes bytes = decoding
 process(i.e. utf-8) =  character 'A'

No. How would you store it in a computer's memory, or on a hard drive? By 
carving a tiny, microscopic A onto the hard drive? How would you read 
it back?

It is theoretically possible to build an analog computer, out of 
clockwork, or water flowing through pipes, or something, but nobody 
really bothers because it is much harder and not very useful.


 An ordinal = ordered numbers like 7,8,910 and so on?

Yes.


 Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
 values up to 256?

Because then how do you tell when you need one byte, and when you need 
two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
characters, with ordinal values 0x4C and 0xFA, or one character with 
ordinal value 0x4CFA?

UTF-8 solves this problem by reserving some values to mean this byte, on 
its own, and others to mean this byte, plus the next byte, together, 
and so forth, up to four bytes.

If you look up UTF-8 on Wikipedia, you will see more about this.

 UTF-8 and UTF-16 and UTF-32
 I though the number beside of UTF- was to declare how many bits the 
 character set was using to store a character into the hdd, no?

Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
values to make a surrogate pair. UTF-8 uses 8-bit values, but sometimes 
it combines two, three or four of them to represent a single code-point.

  Narrow Unicode uses two bytes per character. Since two bytes is only
  enough for about 65,000 characters, not 1,000,000+, the rest of the
  characters are stored as pairs of two-byte surrogates.
 
 Can you please explain this line the rest of thecharacters are stored 
 as pairs of two-byte surrogates more easily for me to understand it?
 I'm still having troubl understanding what a surrogate is.

Look up UTF-16 and surrogate pair on Wikepedia.

But basically, there are 65000+ different possible 16-bit values 
available for UTF-16 to use. Some of those values are reserved to mean 
this value is not a character, it is half of a surrogate pair. Since 
they are *pairs*, they must always come in twos. A surrogate pair makes 
up a valid character. Half of a surrogate pair, on its own, is an error.


A lot of this complexity is because of historical reasons. For example, 
when Unicode was first invented, there was only 65 thousand characters, 
and a fixed 16 bits was all you needed. But it was soon learned that 65 
thousand was not enough (there are more than 65,000 Asian characters 
alone!) and so UTF-16 developed the trick with surrogate pairs to cover 
the extras.


[...]
 When locale to linux system is set to utf-8 that would mean that the 
 linux applications, should try to encode string into hdd by using 
 system's default encoding to utf-8 nad read them back from bytes by
 also using utf-8. Is that correct?

Yes.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Steven D'Aprano
On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:

 chr('A') would give me the mapping of this char, the number 65 while
 ord(65) would output the char 'A' likewise.

Correct. Python uses Unicode, where code-point 65 (ordinal value 65) 
means letter A.

There are older encodings. For example, a very old one, used on IBM 
mainframes, is EBCDIC, where ordinal value 65 means the letter â, and 
the letter A has ordinal value 193.

 
 What would happen if we we try to re-encode bytes on the disk? like
 trying:
 
 s = νίκος
 utf8_bytes = s.encode('utf-8')
 greek_bytes = utf_bytes.encode('iso-8869-7')
 
 Can we re-encode twice or as many times we want and then decode back
 respectively lke?

Of course. Bytes have no memory of where they came from, or what they are 
used for. All you are doing is flipping bits on a memory chip, or on a 
hard drive. So long as *you* remember which encoding is the right one, 
there is no problem. If you forget, and start using the wrong one, you 
will get garbage characters, mojibake, or errors.

[...]
 And also is there a deiffrence between encoding and compressing ?

Of course. They are totally unrelated.

 Isnt the latter useing some form of encoding to take a string or bytes
 to make hold less space on disk?

Correct, except forget about encoding. It's not relevant (except, 
maybe, in a mathematical sense) and will just confuse you.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread nagia . retsina
Thanks Stevn, i ll read them in a bit. When i read them can you perhaps tell me 
whats wrong and ima still getting decode issues?

[CODE]
# 
=
# If user downloaded a file, thank the user !!!
# 
=
if filename:
#update file counter if cookie does not exist
if not nikos:
cur.execute('''UPDATE files SET hits = hits + 1, host = %s, 
lastvisit = %s WHERE url = %s''', (host, lastvisit, filename) )

print('''h2font color=blueΤο αρχείο font color=red %s font 
color=blueκατεβαίνει!''' % filename )
print('''brimg src=/data/images/thanks.gif''')
print('''brbrbrh3font color=blueΚαι τώρα Tetris μέχρι να 
ολοκληρωθεί :-)''' )
print('''brobject 
classid=clsid:d27cdb6e-ae6d-11cf-96b8-44455354 
codebase=http://fpdownload.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,0,0;
 width=450 height=300param name=menu value=false /param 
name=movie value=http://www.fugly.com/f/1e6d8cd7b905f4e1bf72; /param 
name=quality value=high /embed 
src=http://www.fugly.com/f/1e6d8cd7b905f4e1bf72; AllowScriptAccess=always 
menu=false quality=high width=450 height=300 name=FuglyGame 
align=middle type=application/x-shockwave-flash 
pluginspage=http://www.macromedia.com/go/getflashplayer; //object''')

print( '''meta http-equiv=REFRESH content=2;/data/apps/%s''' % 
filename )
sys.exit(0)


# 
=
# Display download button for each file and download it on click
# 
=
print('''body background='/data/images/star.jpg'
 centerimg src='/data/images/download.gif'brbr
 table border=5 cellpadding=5 bgcolor=green
''')


#
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )

for filename in files:
# Compute 'path/to/filename'
filepath_bytes = path + filename
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try: 
filepath = filepath_bytes.decode( encoding )
except UnicodeDecodeError:
continue

# Rename to something valid in UTF-8 
if encoding != 'utf-8': 
os.rename( filepath_bytes, filepath.encode('utf-8') )

assert os.path.exists( filepath )
break 
else: 
# This only runs if we never reached the break
raise ValueError( 'unable to clean filename %r' % 
filepath_bytes ) 


#
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
(filename,) )
data = cur.fetchone()

if not data:
# First time for file; primary key is automatic, hit is 
defaulted
print( iam here, filename + '\n' )
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = set()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
if rec not in filepaths:
cur.execute('''DELETE FROM files WHERE url = %s''', rec )
[/CODE] 

When trying to run it is still erroting out:

[CODE]
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] Original exception 
was:, referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] Traceback (most 
recent call last):, referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173]   File 
/home/nikos/public_html/cgi-bin/files.py, line 83, in module, referer: 
http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] assert 
os.path.exists( 

Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Cameron Simpson
On 09Jun2013 06:25, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info 
wrote:
| [... heaps of useful explaination ...]
|  When locale to linux system is set to utf-8 that would mean that the 
|  linux applications, should try to encode string into hdd by using 
|  system's default encoding to utf-8 nad read them back from bytes by
|  also using utf-8. Is that correct?
| 
| Yes.

Although I'd point out that only application that care about text
as _text_ need to consider Unicode and the encoding. A command like
mv does not care. You type the command and mv receives byte
strings as its arguments. So it is doing straight forward bytes
file renames. It does not care or even know about encodings.

In this scenario, really it is the Terminal program (eg Putty) which
cares about text (what you type, and what gets displayed). It is
because of mismatches between your Terminal local settings and the
encoding that was chosen for the filenames that you get garbage
listings, one way or another.

Cheers,
-- 
Cameron Simpson c...@zip.com.au

But then, I'm only 50. Things may well get a bit much for me when I
reach the gasping heights of senile decrepitude of which old Andy
Woodward speaks with such feeling.
- Chris Malcolm, c...@uk.ac.ed.aifh, DoD #205
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Νικόλαος Κούρας
I'm sorry posted by mistake unnessary code: here is the correct one that 
prodiuced the above error:


#
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )

for filename in files:
# Compute 'path/to/filename'
filepath_bytes = path + filename
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try: 
filepath = filepath_bytes.decode( encoding )
except UnicodeDecodeError:
continue

# Rename to something valid in UTF-8 
if encoding != 'utf-8': 
os.rename( filepath_bytes, filepath.encode('utf-8') )

assert os.path.exists( filepath )
break 
else: 
# This only runs if we never reached the break
raise ValueError( 'unable to clean filename %r' % 
filepath_bytes ) 


#
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
(filename,) )
data = cur.fetchone()

if not data:
# First time for file; primary key is automatic, hit is 
defaulted
print( iam here, filename + '\n' )
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = set()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
if rec not in filepaths:
cur.execute('''DELETE FROM files WHERE url = %s''', rec )
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Steven D'Aprano
On Sun, 09 Jun 2013 00:00:53 -0700, nagia.retsina wrote:

 path = b'/home/nikos/public_html/data/apps/'
 files = os.listdir( path )
 
 for filename in files:
   # Compute 'path/to/filename'
   filepath_bytes = path + filename
   for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
   try:
   filepath = filepath_bytes.decode( encoding )
   except UnicodeDecodeError:
   continue
 
   # Rename to something valid in UTF-8
   if encoding != 'utf-8':
   os.rename( filepath_bytes, 
  filepath.encode('utf-8') )
   assert os.path.exists( filepath )
   break
   else:
   # This only runs if we never reached the break 
   raise ValueError(
 'unable to clean filename %r' % filepath_bytes )

Editing the traceback to get rid of unnecessary noise from the logging:

Traceback (most recent call last):
  File /home/nikos/public_html/cgi-bin/files.py, line 83, in module
  assert os.path.exists( filepath )
  File /usr/local/lib/python3.3/genericpath.py, line 18, in exists
  os.stat(path)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 
34-37: ordinal not in range(128)


 Why am i still receing unicode decore errors? With the help of you guys
 we have writen a prodecure just to avoid this kind of decoding issues
 and rename all greek_byted_filenames to utf-8_byted.

That's a very good question. It works for me when I test it, so I cannot 
explain why it fails for you.

Please try this: log into the Linux server, and then start up a Python 
interactive session by entering:

python3.3

at the $ prompt. Then, at the  prompt, enter these lines of code. You 
can copy and paste them:


import os, sys
print(sys.version)
s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
 '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
 '\N{GREEK SMALL LETTER EPSILON}')
print(s)
filename = '/tmp/' + s
open(filename, 'w')
os.path.exists(filename)


Copy and paste the results back here please.



 Is it the assert that fail? Do we have some logic error someplace i dont
 see?

Please read the error message. Does it say AssertionError?

If it says AssertionError, then the assert has failed. If it says 
something else, the code failed before the assert can run.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Lele Gaifax
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:

 On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:

 chr('A') would give me the mapping of this char, the number 65 while
 ord(65) would output the char 'A' likewise.

 Correct. Python uses Unicode, where code-point 65 (ordinal value 65) 
 means letter A.

Actually, that's the other way around:

 chr(65)
'A'
 ord('A')
65

 What would happen if we we try to re-encode bytes on the disk? like
 trying:
 
 s = νίκος
 utf8_bytes = s.encode('utf-8')
 greek_bytes = utf_bytes.encode('iso-8869-7')
 
 Can we re-encode twice or as many times we want and then decode back
 respectively lke?

 Of course. Bytes have no memory of where they came from, or what they are 
 used for. All you are doing is flipping bits on a memory chip, or on a 
 hard drive. So long as *you* remember which encoding is the right one, 
 there is no problem. If you forget, and start using the wrong one, you 
 will get garbage characters, mojibake, or errors.

Uhm, no: encode transforms a Unicode string into an array of bytes,
decode does the opposite transformation. You cannot do the former on
an arbitrary array of bytes:

 s = νίκος
 utf8_bytes = s.encode('utf-8')
 greek_bytes = utf8_bytes.encode('iso-8869-7')
Traceback (most recent call last):
  File stdin, line 1, in module
AttributeError: 'bytes' object has no attribute 'encode'

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Νικόλαος Κούρας
Steven wrote:
 Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
 values up to 256? 

Because then how do you tell when you need one byte, and when you need 
two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
characters, with ordinal values 0x4C and 0xFA, or one character with 
ordinal value 0x4CFA? 

I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 
256, not above 256.


 UTF-8 and UTF-16 and UTF-32 
 I though the number beside of UTF- was to declare how many bits the 
 character set was using to store a character into the hdd, no? 

Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
values to make a surrogate pair.

A surrogate pair is like itting for example Ctrl-A, which means is a 
combination character that consists of 2 different characters?
Is this what a surrogate is? a pari of 2 chars?


UTF-8 uses 8-bit values, but sometimes 
it combines two, three or four of them to represent a single code-point.

'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is  127 )
'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since 
ordinal   65000 )

The amount of bytes needed to store a character solely depends on the 
character's ordinal value in the Unicode table?


UTF-8 solves this problem by reserving some values to mean this byte, on 
its own, and others to mean this byte, plus the next byte, together, 
and so forth, up to four bytes.

Some of the utf-8 bits that are used to represent a character's ordinal value 
are actually been also used to seperate or join the ordinal values themselves?
Can you give an example please? How there are beign seperated?


Computers are digital and work with numbers.


So character 'A' - 65 (in decimal uses in charset's table)  - 01011100 (as 
binary stored in disk) - 0xEF (as hex, when we open the file with a hex 
editor)

Is this how the thing works? (above values are fictional)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Νικόλαος Κούρας
Τη Κυριακή, 9 Ιουνίου 2013 11:02:48 π.μ. UTC+3, ο χρήστης Cameron Simpson 
έγραψε:

 In this scenario, really it is the Terminal program (eg Putty) which
 cares about text (what you type, and what gets displayed). It is
 because of mismatches between your Terminal local settings and the
 encoding that was chosen for the filenames that you get garbage
 listings, one way or another.

Ca n you give an example please that shows a string being greek-iso encoded and 
then being utf8 decoded and presented back as:

1. properly
2. garbage ( means trash but dont what a garbage char is)
3. error
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Νικόλαος Κούρας
Τη Κυριακή, 9 Ιουνίου 2013 11:55:43 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 
 
 
  On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:
 
 
 
  chr('A') would give me the mapping of this char, the number 65 while
 
  ord(65) would output the char 'A' likewise.
 
 
 
  Correct. Python uses Unicode, where code-point 65 (ordinal value 65) 
 
  means letter A.
 
 
 
 Actually, that's the other way around:
 
 
 
  chr(65)
 
 'A'
 
  ord('A')
 
 65
 
 
 
  What would happen if we we try to re-encode bytes on the disk? like
 
  trying:
 
  
 
  s = νίκος
 
  utf8_bytes = s.encode('utf-8')
 
  greek_bytes = utf_bytes.encode('iso-8869-7')
 
  
 
  Can we re-encode twice or as many times we want and then decode back
 
  respectively lke?
 
 
 
  Of course. Bytes have no memory of where they came from, or what they are 
 
  used for. All you are doing is flipping bits on a memory chip, or on a 
 
  hard drive. So long as *you* remember which encoding is the right one, 
 
  there is no problem. If you forget, and start using the wrong one, you 
 
  will get garbage characters, mojibake, or errors.
 
 
 
 Uhm, no: encode transforms a Unicode string into an array of bytes,
 
 decode does the opposite transformation. You cannot do the former on
 
 an arbitrary array of bytes:
 
 
 
  s = νίκος
 
  utf8_bytes = s.encode('utf-8')
 
  greek_bytes = utf8_bytes.encode('iso-8869-7')
 
 Traceback (most recent call last):
 
   File stdin, line 1, in module
 
 AttributeError: 'bytes' object has no attribute 'encode'

So something encoded into bytes cannot be re-encoded to some other bytes.

How about a string i wonder?
s = νίκος
what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Cameron Simpson
On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
nikos.gr...@gmail.com wrote:
| Steven wrote:
|  Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
|  values up to 256? 
| 
| Because then how do you tell when you need one byte, and when you need 
| two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
| characters, with ordinal values 0x4C and 0xFA, or one character with 
| ordinal value 0x4CFA? 
| 
| I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up 
to 256, not above 256.

Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your 
suggestion will not.

I'd point out that if you did this, you'd be back in the same
situation you just encountered with ASCII: the first above-255 value
would raise a UnicodeEncodeError (an error which does not even exist
at present:-)

|  UTF-8 and UTF-16 and UTF-32 
|  I though the number beside of UTF- was to declare how many bits the 
|  character set was using to store a character into the hdd, no? 
| 
| Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
| UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
| values to make a surrogate pair.
| 
| A surrogate pair is like itting for example Ctrl-A, which means is a 
combination character that consists of 2 different characters?
| Is this what a surrogate is? a pari of 2 chars?

Essentially. The combination represents a code point.

| UTF-8 uses 8-bit values, but sometimes 
| it combines two, three or four of them to represent a single code-point.
| 
| 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
| 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is  127 )
| 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since 
ordinal   65000 )
| 
| The amount of bytes needed to store a character solely depends on the 
character's ordinal value in the Unicode table?

Essentially. You can read up on the exact process in Wikipedia or the Unicode 
Standard.

Cheers,
-- 
Cameron Simpson c...@zip.com.au

The most annoying thing about being without my files after our disc crash was
discovering once again how widespread BLINK was on the web.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Νικόλαος Κούρας
Τη Κυριακή, 9 Ιουνίου 2013 11:15:07 π.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:

 Please try this: log into the Linux server, and then start up a Python 

 import os, sys 
 print(sys.version)
 s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}' 
  '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}' 
  '\N{GREEK SMALL LETTER EPSILON}')
 print(s)
 filename = '/tmp/' + s
 open(filename, 'w')
 os.path.exists(filename)

 Copy and paste the results back here please.

Of course: here it is:

root@nikos [/home/nikos/www/cgi-bin]# python
Python 3.3.2 (default, Jun  3 2013, 16:18:05)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux
Type help, copyright, credits or license for more information.
 import os, sys
 print(sys.version)
3.3.2 (default, Jun  3 2013, 16:18:05)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)]
 s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
...  '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
...  '\N{GREEK SMALL LETTER EPSILON}')
print(s)
 αβγδε
 filename = '/tmp/' + s
 open(filename, 'w')
_io.TextIOWrapper name='/tmp/αβγδε' mode='w' encoding='UTF-8'
 os.path.exists(filename)
True


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Cameron Simpson
On 09Jun2013 08:15, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info 
wrote:
| On Sun, 09 Jun 2013 00:00:53 -0700, nagia.retsina wrote:
|  path = b'/home/nikos/public_html/data/apps/'
|  files = os.listdir( path )
|  
|  for filename in files:
|  # Compute 'path/to/filename'
|  filepath_bytes = path + filename
|  for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
|  try:
|  filepath = filepath_bytes.decode( encoding )
|  except UnicodeDecodeError:
|  continue
|  
|  # Rename to something valid in UTF-8
|  if encoding != 'utf-8':
|  os.rename( filepath_bytes, 
|   filepath.encode('utf-8') )
|  assert os.path.exists( filepath )
|  break
|  else:
|  # This only runs if we never reached the break 
|raise ValueError(
|  'unable to clean filename %r' % filepath_bytes )
| 
| Editing the traceback to get rid of unnecessary noise from the logging:
| 
| Traceback (most recent call last):
|   File /home/nikos/public_html/cgi-bin/files.py, line 83, in module
|   assert os.path.exists( filepath )
|   File /usr/local/lib/python3.3/genericpath.py, line 18, in exists
|   os.stat(path)
| UnicodeEncodeError: 'ascii' codec can't encode characters in position 
| 34-37: ordinal not in range(128)
| 
|  Why am i still receing unicode decore errors? With the help of you guys
|  we have writen a prodecure just to avoid this kind of decoding issues
|  and rename all greek_byted_filenames to utf-8_byted.
| 
| That's a very good question. It works for me when I test it, so I cannot 
| explain why it fails for you.

If he's lucky the UnicodeEncodeError occurred while trying to print
an error message, printing a greek Unicode string in the error with
ASCII as the output encoding (default when not a tty IIRC).

Cheers,
-- 
Cameron Simpson c...@zip.com.au

I generally avoid temptation unless I can't resist it.  - Mae West
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Lele Gaifax
Νικόλαος Κούρας nikos.gr...@gmail.com writes:

 Τη Κυριακή, 9 Ιουνίου 2013 11:55:43 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:
 Uhm, no: encode transforms a Unicode string into an array of bytes,
 decode does the opposite transformation. You cannot do the former on
 an arbitrary array of bytes:
 
  s = νίκος
  utf8_bytes = s.encode('utf-8')
  greek_bytes = utf8_bytes.encode('iso-8869-7')
 Traceback (most recent call last):
   File stdin, line 1, in module
 AttributeError: 'bytes' object has no attribute 'encode'

 So something encoded into bytes cannot be re-encoded to some other bytes.

 How about a string i wonder?
 s = νίκος
 what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')

Ignoring the usual syntax error, this is just a variant of the code I
posted: “s.encode('iso-8869-7')” produces a bytes instance which
*cannot* be re-encoded again in whatever encoding.

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Νικόλαος Κούρας
Τη Κυριακή, 9 Ιουνίου 2013 12:12:36 μ.μ. UTC+3, ο χρήστης Cameron Simpson 
έγραψε:
 On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 nikos.gr...@gmail.com wrote:
 
 | Steven wrote:
 
 |  Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
 
 |  values up to 256? 
 
 | 
 
 | Because then how do you tell when you need one byte, and when you need 
 
 | two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
 
 | characters, with ordinal values 0x4C and 0xFA, or one character with 
 
 | ordinal value 0x4CFA? 
 
 | 
 
 | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant 
 up to 256, not above 256.
 
 
 
 Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your 
 suggestion will not.

I dont follow.

 |  UTF-8 and UTF-16 and UTF-32 
 
 |  I though the number beside of UTF- was to declare how many bits the 
 
 |  character set was using to store a character into the hdd, no? 
 
 | 
 
 | Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
 
 | UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
 
 | values to make a surrogate pair.
 
 | 
 
 | A surrogate pair is like itting for example Ctrl-A, which means is a 
 combination character that consists of 2 different characters?
 
 | Is this what a surrogate is? a pari of 2 chars?
 
 
 
 Essentially. The combination represents a code point.
 
 
 
 | UTF-8 uses 8-bit values, but sometimes 
 
 | it combines two, three or four of them to represent a single code-point.
 
 | 
 
 | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
 
 | 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is  
 127 )
 
 | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since 
 ordinal   65000 )
 
 | 
 
 | The amount of bytes needed to store a character solely depends on the 
 character's ordinal value in the Unicode table?
 
 
 
 Essentially. You can read up on the exact process in Wikipedia or the Unicode 
 Standard.



When you say essentially means you agree with my statements?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Νικόλαος Κούρας
Τη Κυριακή, 9 Ιουνίου 2013 12:20:58 μ.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:

  How about a string i wonder? 
  s = νίκος 
  what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')

 Ignoring the usual syntax error, this is just a variant of the code I 
 posted: s.encode('iso-8869-7') produces a bytes instance which
 *cannot* be re-encoded again in whatever encoding.

s = 'a'
s = s.encode('iso-8859-7').decode('utf-8')
print( s )

a (we got the original character back)

s = 'α'
s = s.encode('iso-8859-7').decode('utf-8')
print( s )

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: 
unexpected end of data

Why this error? because 'a' ordinal value  127 ?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Νικόλαος Κούρας
Τη Κυριακή, 9 Ιουνίου 2013 12:14:12 μ.μ. UTC+3, ο χρήστης Νικόλαος Κούρας 
έγραψε:
 Τη Κυριακή, 9 Ιουνίου 2013 11:15:07 π.μ. UTC+3, ο χρήστης Steven D'Aprano 
 έγραψε:
 
 
 
  Please try this: log into the Linux server, and then start up a Python 
 
 
 
  import os, sys 
 
  print(sys.version)
 
  s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}' 
 
   '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}' 
 
   '\N{GREEK SMALL LETTER EPSILON}')
 
  print(s)
 
  filename = '/tmp/' + s
 
  open(filename, 'w')
 
  os.path.exists(filename)
 
 
 
  Copy and paste the results back here please.
 
 
 
 Of course: here it is:
 
 
 
 root@nikos [/home/nikos/www/cgi-bin]# python
 
 Python 3.3.2 (default, Jun  3 2013, 16:18:05)
 
 [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux
 
 Type help, copyright, credits or license for more information.
 
  import os, sys
 
  print(sys.version)
 
 3.3.2 (default, Jun  3 2013, 16:18:05)
 
 [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)]
 
  s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
 
 ...  '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
 
 ...  '\N{GREEK SMALL LETTER EPSILON}')
 
 print(s)
 
  αβγδε
 
  filename = '/tmp/' + s
 
  open(filename, 'w')
 
 _io.TextIOWrapper name='/tmp/αβγδε' mode='w' encoding='UTF-8'
 
  os.path.exists(filename)
 
 True
 
 

I dont much but it lloks correct to me, but then agian why this error?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Νικόλαος Κούρας
I k nwo i have been a pain in the ass these days but this is the lats 
explanation i want from you, just to understand it completely.

 Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
 values up to 256? 

Because then how do you tell when you need one byte, and when you need 
two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
characters, with ordinal values 0x4C and 0xFA, or one character with 
ordinal value 0x4CFA? 

I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 
256, not above 256.


 UTF-8 and UTF-16 and UTF-32 
 I though the number beside of UTF- was to declare how many bits the 
 character set was using to store a character into the hdd, no? 

Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
values to make a surrogate pair.

A surrogate pair is like itting for example Ctrl-A, which means is a 
combination character that consists of 2 different characters?
Is this what a surrogate is? a pari of 2 chars?


UTF-8 uses 8-bit values, but sometimes 
it combines two, three or four of them to represent a single code-point.

'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is  127 )
'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since 
ordinal   65000 )

The amount of bytes needed to store a character solely depends on the 
character's ordinal value in the Unicode table?


UTF-8 solves this problem by reserving some values to mean this byte, on 
its own, and others to mean this byte, plus the next byte, together, 
and so forth, up to four bytes.

Some of the utf-8 bits that are used to represent a character's ordinal value 
are actually been also used to seperate or join the ordinal values themselves?
Can you give an example please? How there are beign seperated?


Computers are digital and work with numbers.


So character 'A' - 65 (in decimal uses in charset's table)  - 01011100 (as 
binary stored in disk) - 0xEF (as hex, when we open the file with a hex 
editor)

Is this how the thing works? (above values are fictional)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Steven D'Aprano
On Sun, 09 Jun 2013 10:55:43 +0200, Lele Gaifax wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 
 On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:

 chr('A') would give me the mapping of this char, the number 65 while
 ord(65) would output the char 'A' likewise.

 Correct. Python uses Unicode, where code-point 65 (ordinal value 65)
 means letter A.
 
 Actually, that's the other way around:
 
  chr(65)
 'A'
  ord('A')
 65

/facepalm 

Of course you are right.


 What would happen if we we try to re-encode bytes on the disk? like
 trying:
 
 s = νίκος
 utf8_bytes = s.encode('utf-8')
 greek_bytes = utf_bytes.encode('iso-8869-7')
 
 Can we re-encode twice or as many times we want and then decode back
 respectively lke?

 Of course. [...]

 Uhm, no: encode transforms a Unicode string into an array of bytes,
 decode does the opposite transformation. You cannot do the former on
 an arbitrary array of bytes:

And two for two. I misunderstood Nikos' question.

As you point out, no, Python 3 will not allow you to re-encode bytes. You 
must first decode them to a string first, then encode them using a 
different encoding. (I thought that this was was Nikos actually meant, 
but I on re-reading his question more closely, that's not actually what 
he asked.)

Sorry for any confusion.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Νικόλαος Κούρας
Please and tell me that this actually can be solved.
Iam willing to try anything for 'files.py' to load propelry.
Every thign works as expected in my webiste, have manages to correct 
pelatologio.poy and koukos.py.

This is the last thing the webiste needs, that is files.py to load so users can 
grab importan files in greek format.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Andreas Perstinger

On 09.06.2013 11:38, Νικόλαος Κούρας wrote:

s = 'α'
s = s.encode('iso-8859-7').decode('utf-8')
print( s )

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: 
unexpected end of data

Why this error? because 'a' ordinal value  127 ?


 s = 'α'
 s.encode('iso-8859-7')
b'\xe1'
 bin(0xe1)
'0b1111'

Now look at the table on https://en.wikipedia.org/wiki/UTF-8#Description 
to find out how many bytes a UTF-8 decoder expects when it reads that value.


Bye, Andreas
--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Steven D'Aprano
On Sun, 09 Jun 2013 02:00:46 -0700, Νικόλαος Κούρας wrote:

 Steven wrote:
 Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
 values up to 256?
 
Because then how do you tell when you need one byte, and when you need
two? If you read two bytes, and see 0x4C 0xFA, does that mean two
characters, with ordinal values 0x4C and 0xFA, or one character with
ordinal value 0x4CFA?
 
 I mean utf-8 could use 1 byte for storing the 1st 256 characters. I
 meant up to 256, not above 256.

Think about it. Draw up a big table of one million plus characters:

Ordinal   Character
  
0 NUL control code
1 SOH control code
...
84LATIN CAPITAL LETTER T
85LATIN CAPITAL LETTER U
...
255   LATIN SMALL LETTER Y WITH DIAERESIS
256   LATIN CAPITAL LETTER A WITH MACRON
...
8485  OUNCE SIGN


and so on, all the way to 1114111. Now, suppose you read a file, and see 
two bytes, shown in decimal: 84 followed by 85, or in hexadecimal, 0x54 
followed by 0x55.

How do you tell whether that means two characters, T followed by U, or a 
single character, ℥ (OUNCE SIGN)?

With UTF-32, you can, because every value takes exactly the same space. 
So a T followed by a U is:

0x0054
0x0055

while a single ℥ is:

0x2125

and it is easy to tell them apart: each block of 4 bytes is exactly one 
character. But notice how many NUL bytes there are? In the three 
characters shown, there are eight NUL bytes. Most text will be filled 
with NUL bytes, which is very wasteful.

UTF-8 is designed to be compact, and also to be backwards-compatible with 
ASCII. Characters which are in ASCII will be a single byte, so there are 
no null bytes used for padding, (except for NUL itself, of course). So 
the three characters TU℥ will be:

0x54
0x55
0xE2
0x84
0xA5

Five bytes in total, instead of 12 for UTF-32. But the only tricky part 
is that character with ordinal value 0xE2 (decimal 226, â) cannot be 
encoded as the single byte 0xE2, otherwise you would mistake the three 
bytes 0xE284A5 as starting with 'â' followed by two more characters. And 
indeed, 'â' is encoded as two bytes:

0xC3
0xA2

Likewise, character with ordinal value 0xC3 (decimal 195, Ã) is also 
encoded as two bytes:

0xC3
0x83

And so on. This way, there is never any confusion as to whether (say) 
three bytes are three one-byte characters, or one three-byte character.


 UTF-8 and UTF-16 and UTF-32
 I though the number beside of UTF- was to declare how many bits the
 character set was using to store a character into the hdd, no?
 
Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
values to make a surrogate pair.
 
 A surrogate pair is like itting for example Ctrl-A, which means is a
 combination character that consists of 2 different characters? Is this
 what a surrogate is? a pari of 2 chars?

Yes, a surrogate pair is a pair of two characters. But they're not 
*real* characters. They don't exist in any human language. They are just 
values that tells the program these go together, and count as a single 
character.

(This is why Unicode prefers to talk about *code points* rather than 
characters. Some code points are characters, and some are not.)

UTF-8 uses 8-bit values, but sometimes it combines two, three or four of
them to represent a single code-point.
 
 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)

Correct.


 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is 
 127 ) 

That looks like two characters to me, 'α' followed by '΄'. That will take 
4 bytes, two for 'α' and two for '΄'.


 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored
 ? (since ordinal   65000 )

Not necessarily four bytes. Could be three. Depends on the ideogram.

 The amount of bytes needed to store a character solely depends on the
 character's ordinal value in the Unicode table?

Yes.


UTF-8 solves this problem by reserving some values to mean this byte,
on its own, and others to mean this byte, plus the next byte,
together, and so forth, up to four bytes.
 
 Some of the utf-8 bits that are used to represent a character's ordinal
 value are actually been also used to seperate or join the ordinal values
 themselves? Can you give an example please? How there are beign
 seperated?

Did you look up UTF-8 on Wikipedia like I suggested?


Computers are digital and work with numbers.
 
 So character 'A' - 65 (in decimal uses in charset's table)  -
 01011100 (as binary stored in disk) - 0xEF (as hex, when we open the
 file with a hex editor)
 
 Is this how the thing works? (above values are fictional)

You can check this in Python:


py c = 'A'
py ord(c)
65
py bin(65)
'0b101'
py hex(65)
'0x41'


py c = 'α'
py ord(c)
945
py c.encode('utf-8')
b'\xce\xb1'
py c.encode('utf-16be')
b'\x03\xb1'
py c.encode('utf-32be')
b'\x00\x00\x03\xb1'
py c.encode('iso-8859-7')
b'\xe1'


-- 
Steven

Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Steven D'Aprano
On Sun, 09 Jun 2013 19:16:06 +1000, Cameron Simpson wrote:


 If he's lucky the UnicodeEncodeError occurred while trying to print an
 error message, 

That's not what happens at the interactive console:

py assert os.path.exists('Ж1')
Traceback (most recent call last):
  File stdin, line 1, in module
AssertionError


 printing a greek Unicode string in the error with ASCII
 as the output encoding (default when not a tty IIRC).


An interesting thought. How would we test that?



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Steven D'Aprano
On Sun, 09 Jun 2013 02:38:13 -0700, Νικόλαος Κούρας wrote:

 s = 'α'
 s = s.encode('iso-8859-7').decode('utf-8')
 
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0:
 unexpected end of data
 
 Why this error? because 'a' ordinal value  127 ?


Look at it this way... consider encoding and decoding to be like 
translating from one language to another.

Suppose you start with the English word street. You encode it to German 
by looking it up in an English-To-German dictionary:

street - Straße

The you decode the German by looking Straße up in a German-To-English 
dictionary:

Straße - street

and everything is good. But suppose that after encoding the English to 
German, you get confused, and think that it is Italian, not German. So 
when it comes to decoding, you try to look up 'Staße' in an Italian-To-
English dictionary, and discover that there is no such thing as letter ß 
in Italian. So you cannot look the word up, and you get frustrated and 
shout this is rubbish, there's no such thing as ß, that's not a letter!

Not in Italian, but it is a perfectly good letter in German. But you're 
looking it up in the wrong dictionary.

Same thing with UTF-8. You encoded the string 'α' by looking it up in the 
Unicode To ISO-8859-7 bytes dictionary. Then you try to decode it by 
looking for those bytes in the UTF-8 bytes To Unicode dictionary. But 
you can't find byte 0xe1 on its own in UTF-8 bytes, so Python shouts 
this is rubbish, there's no such thing as 0xe1 on its own in UTF-8! and 
raises UnicodeDecodeError.


Sometimes you don't get an exception. Suppose that you are encoding from 
French to German:

qui - die  (both words mean who in English)


Now if you get confused, and decode the word 'die' by looking it up in an 
English-To-French dictionary, instead of German-To-French, you get:

die - mourir

So instead of getting 'qui' back again, you get 'mourir'. This is like 
mojibake: the results are garbage, but there is no exception raised to 
warn you.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread nagia . retsina
Τη Κυριακή, 9 Ιουνίου 2013 3:36:51 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:

  printing a greek Unicode string in the error with ASCII 
  as the output encoding (default when not a tty IIRC).

 An interesting thought. How would we test that?

Please elaborare this for me. I ditn undertood what you are trying to say, your 
assumption of why still ima getting decode issues.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Benjamin Kaplan
On Sun, Jun 9, 2013 at 2:38 AM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote:
 Τη Κυριακή, 9 Ιουνίου 2013 12:20:58 μ.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:

  How about a string i wonder?
  s = νίκος
  what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')

 Ignoring the usual syntax error, this is just a variant of the code I
 posted: s.encode('iso-8869-7') produces a bytes instance which
 *cannot* be re-encoded again in whatever encoding.

 s = 'a'
 s = s.encode('iso-8859-7').decode('utf-8')
 print( s )

 a (we got the original character back)
 
 s = 'α'
 s = s.encode('iso-8859-7').decode('utf-8')
 print( s )

 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: 
 unexpected end of data

 Why this error? because 'a' ordinal value  127 ?
 --

No. You get that error because the string is not encoded in UTF-8.
It's encoded in ISO-8859-7. For ASCII strings (ord(x)  127),
ISO-8859-7 and UTF-8 look exactly the same. For anything else, they
are different. If you were to try to decode it as ISO-8859-1, it would
succeed, but you would get the character á back instead of α.

You're misunderstanding the decode function. Decode doesn't turn it
into a string with the specified encoding. It takes it *from* the
string with the specified encoding and turns it into Python's internal
string representation. In Python 3.3, that encoding doesn't even have
a name because it's not a standard encoding. So you want the decode
argument to match the encode argument.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-09 Thread Benjamin Kaplan
On Sun, Jun 9, 2013 at 2:20 AM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote:
 Τη Κυριακή, 9 Ιουνίου 2013 12:12:36 μ.μ. UTC+3, ο χρήστης Cameron Simpson 
 έγραψε:
 On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 nikos.gr...@gmail.com wrote:

 | Steven wrote:

 |  Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for

 |  values up to 256?

 |

 | Because then how do you tell when you need one byte, and when you need

 | two? If you read two bytes, and see 0x4C 0xFA, does that mean two

 | characters, with ordinal values 0x4C and 0xFA, or one character with

 | ordinal value 0x4CFA?

 |

 | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant 
 up to 256, not above 256.



 Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your 
 suggestion will not.

 I dont follow.


The point in the UTF formats is that they can encode any of the 1.1
million codepoints available in Unicode. Your suggestion can only
encode 256 code points. We have that encoding already- it's called
Latin-1 and it can't encode any of your Greek characters (hence why
ISO-8859-7 exists, which can encode the Greek characters but not the
Latin ones).

If you were to use the whole byte to store the first 256 characters,
you wouldn't be able to store character number 256 because the
computer wouldn't be able to tell the difference between character 257
(0x01 0x01) and two chr(1)s. UTF-8 gets around this by reserving the
top bit as a am I part of a multibyte sequence flag,

 |  UTF-8 and UTF-16 and UTF-32

 |  I though the number beside of UTF- was to declare how many bits the

 |  character set was using to store a character into the hdd, no?

 |

 | Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.

 | UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit

 | values to make a surrogate pair.

 |

 | A surrogate pair is like itting for example Ctrl-A, which means is a 
 combination character that consists of 2 different characters?

 | Is this what a surrogate is? a pari of 2 chars?



 Essentially. The combination represents a code point.



 | UTF-8 uses 8-bit values, but sometimes

 | it combines two, three or four of them to represent a single code-point.

 |

 | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)

 | 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is  
 127 )

 | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? 
 (since ordinal   65000 )

 |

 | The amount of bytes needed to store a character solely depends on the 
 character's ordinal value in the Unicode table?



 Essentially. You can read up on the exact process in Wikipedia or the 
 Unicode Standard.



 When you say essentially means you agree with my statements?
 --

In UTF-8 or UTF-16, the number of bytes required for the character is
dependent on its code point, yes. That isn't the case for UTF-32,
where every character uses exactly four bytes.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Νικόλαος Κούρας
Τη Σάββατο, 8 Ιουνίου 2013 5:52:22 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
 On 07Jun2013 11:52, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 nikos.gr...@gmail.com wrote:
 
 | ni...@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] 
 [client 79.103.41.173]   File /home/nikos/public_html/cgi-bin/files.py, 
 line 81
 
 | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 
 'greek' )
 
 | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]   
   ^
 
 | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: 
 invalid syntax
 
 | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of 
 script headers: files.py
 
 | ---
 
 | i dont know why that if statement errors.
 
 
 
 Python statements that continue (if, while, try etc) end in a colon, so:

Oh iam very sorry.
Oh my God i cant beleive i missed a colon *again*:

I have corrected this:

#
# Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in filename_bytes:
# Compute 'path/to/filename' into bytes
filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
flag = False

try:
# Assume current file is utf8 encoded
filepath = filepath_bytes.decode('utf-8')
flag = 'utf8' 
except UnicodeDecodeError:
try:
# Since current filename is not utf8 encoded then it 
has to be greek-iso encoded
filepath = filepath_bytes.decode('iso-8859-7')
flag = 'greek'
except UnicodeDecodeError:
print( '''I give up! File name is unreadable!''' )

if flag == 'greek':
# Rename filename from greek bytes -- utf-8 bytes
os.rename( filepath_bytes, filepath.encode('utf-8') )
==

Now everythitng were supposed to work but instead iam getting this surrogate 
error once more. 
What is this surrogate thing?

Since i make use of error cathcing and handling like 'except 
UnicodeDecodeError:'

then it utf8's decode fails for some reason, it should leave that file alone 
and try the next file?
try:
# Assume current file is utf8 encoded
filepath = filepath_bytes.decode('utf-8')
flag = 'utf8' 
except UnicodeDecodeError:

This is what it supposed to do, correct?

==
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]   File 
/home/nikos/public_html/cgi-bin/files.py, line 94, in module
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] 
cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]   File 
/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py,
 line 108, in execute
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] query = 
query.encode(charset)
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 
'utf-8' codec can't encode character '\\udcce' in position 35: surrogates not 
allowed
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Chris Angelico
On Sat, Jun 8, 2013 at 4:49 PM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote:
 Oh my God i cant beleive i missed a colon *again*:

For most Python programmers, this is a matter of moments to solve. Run
the program, get a SyntaxError, fix it. Non-interesting event. (Maybe
even sooner than that, if the editor highlights it for you.) This is
why you really need to start yourself a testbox. DO NOT PLAY ON YOUR
LIVE SYSTEM. This is sysadminning 101. And Python programming 101: The
error traceback points to the error, or just after it.

Get to know how error messages work. This is not even Python-specific.
*Every* language behaves this way. You look at the highlighted line,
if you can't see an error there you look a little bit higher.

You should not need to beg for help for such trivial problems. This is
the mark of a novice. You ought no longer to be a novice, based on how
long you've been doing this stuff. You ought not to behave like one.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Steven D'Aprano
On Fri, 07 Jun 2013 23:49:17 -0700, Νικόλαος Κούρας wrote:

[...]
 Oh iam very sorry.
 Oh my God i cant beleive i missed a colon *again*:
 
 I have corrected this:

[snip code]

Stop posting your code after every trivial edit!!!


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Chris Angelico
On Sat, Jun 8, 2013 at 5:26 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Fri, 07 Jun 2013 23:49:17 -0700, Νικόλαος Κούρας wrote:

 [...]
 Oh iam very sorry.
 Oh my God i cant beleive i missed a colon *again*:

 I have corrected this:

 [snip code]

 Stop posting your code after every trivial edit!!!

I think he uses the python-list archives as ersatz source control.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Steven D'Aprano
On Thu, 06 Jun 2013 23:35:33 -0700, nagia.retsina wrote:

 Working with bytes is only for when the file names are turned to
 garbage. Your file names (some of them) are turned to garbage. Fix
 them, and then use file names as strings.
 
 Can't '~/data/apps/' is filled every day with more and more files which
 are uploaded via FileZilla client, which i think it behaves pretty much
 like putty, uploading filenames as greek-iso bytes.


Well, that is certainly a nuisance. Try something like this:

# Untested.

dir = b'/home/nikos/public_html/data/apps/'  # This must be bytes.
files = os.listdir(dir)
for name in files:
pathname_as_bytes = dir + name
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try:
pathname = pathname_as_bytes.decode(encoding)
except UnicodeDecodeError:
continue
# Rename to something valid in UTF-8.
if encoding != 'utf-8':
os.rename(pathname_as_bytes, pathname.encode('utf-8'))
assert os.path.exists(pathname)
break
else:
# This only runs if we never reached the break.
raise ValueError('unable to clean filename %r'%pathname_as_bytes)


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Roel Schroeven

Νικόλαος Κούρας schreef:

Session settings afaik is for putty to remember hosts to connect to,
not terminal options. I might be worng though. No matter how many times
i change its options next time i run it always defaults back.


Putty can most definitely remember its settings:
- Start PuTTY; you should get the PuTTY Configuration window
- Select a session in the list of sessions
- Click Load
- Change any setting you want to change
- Go back to Session in the Category treeview
- Click Save

HTH

--
People almost invariably arrive at their beliefs not on the basis of
proof but on the basis of what they find attractive.
-- Pascal Blaise

r...@roelschroeven.net

--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread MRAB

On 08/06/2013 07:49, Νικόλαος Κούρας wrote:

Τη Σάββατο, 8 Ιουνίου 2013 5:52:22 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:

On 07Jun2013 11:52, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
nikos.gr...@gmail.com wrote:

| ni...@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] [client 
79.103.41.173]   File /home/nikos/public_html/cgi-bin/files.py, line 81

| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 
'greek' )

| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] 
^

| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: 
invalid syntax

| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of 
script headers: files.py

| ---

| i dont know why that if statement errors.



Python statements that continue (if, while, try etc) end in a colon, so:


Oh iam very sorry.
Oh my God i cant beleive i missed a colon *again*:

I have corrected this:

#
# Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in filename_bytes:
# Compute 'path/to/filename' into bytes
filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
flag = False

try:
# Assume current file is utf8 encoded
filepath = filepath_bytes.decode('utf-8')
flag = 'utf8'
except UnicodeDecodeError:
try:
# Since current filename is not utf8 encoded then it 
has to be greek-iso encoded
filepath = filepath_bytes.decode('iso-8859-7')
flag = 'greek'
except UnicodeDecodeError:
print( '''I give up! File name is unreadable!''' )

if flag == 'greek':
# Rename filename from greek bytes -- utf-8 bytes
os.rename( filepath_bytes, filepath.encode('utf-8') )
==

Now everythitng were supposed to work but instead iam getting this surrogate 
error once more.
What is this surrogate thing?

Since i make use of error cathcing and handling like 'except 
UnicodeDecodeError:'

then it utf8's decode fails for some reason, it should leave that file alone 
and try the next file?
try:
# Assume current file is utf8 encoded
filepath = filepath_bytes.decode('utf-8')
flag = 'utf8'
except UnicodeDecodeError:

This is what it supposed to do, correct?

==
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]   File 
/home/nikos/public_html/cgi-bin/files.py, line 94, in module
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] 
cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]   File 
/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py,
 line 108, in execute
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] query = 
query.encode(charset)
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 
'utf-8' codec can't encode character '\\udcce' in position 35: surrogates not 
allowed


Look at the traceback.

It says that the exception was raised by:

query = query.encode(charset)

which was called by:

cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )

But what is 'filename'? And what has it to do with the first code
snippet? Does the traceback have _anything_ to do with the first code
snippet?

--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Νικόλαος Κούρας
Sorry for th delay guys, was busy with other thigns today and i am still 
reading your resposes, still ahvent rewad them all just Cameron's:

Here is what i have now following Cameron's advices:


#
# Collect filenames of the path directory as bytes
path = b'/home/nikos/public_html/data/apps/'
filenames_bytes = os.listdir( path )

for filename_bytes in filenames_bytes:
try:
filename = filename_bytes.decode('utf-8)
except UnicodeDecodeError:
# Since its not a utf8 bytestring then its for sure a greek 
bytestring

# Prepare arguments for rename to happen
utf8_filename = filename_bytes.encode('utf-8')
greek_filename = filename_bytes.encode('iso-8859-7')

utf8_path = path + utf8_filename
greek_path = path + greek_filename

# Rename current filename from greek bytes -- utf8 bytes
os.rename( greek_path, utf8_path )
==

I know this is wrong though.
Since filename_bytes is the current filename encoded as utf8 or greek-iso
then i cannot just *encode* what is already encoded by doing this:

utf8_filename = filename_bytes.encode('utf-8')
greek_filename = filename_bytes.encode('iso-8859-7')
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Νικόλαος Κούρας
Okey after reading also Steven post, i was relived form the previous suck 
position i was, so with an alternation of a few variable names here is the code 
now:


#
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )

for filename in files:
# Compute 'path/to/filename'
filepath_bytes = path + filename
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try: 
filepath = filepath_bytes.decode( encoding )
except UnicodeDecodeError:
continue

# Rename to something valid in UTF-8 
if encoding != 'utf-8': 
os.rename( filepath_bytes, filepath.encode('utf-8') )

assert os.path.exists( filepath )
break 
else: 
# This only runs if we never reached the break
raise ValueError( 'unable to clean filename %r' % 
filepath_bytes ) 

=

I dont know why it is still failing when it tried to decode stuff since it 
tries 3 ways of decoding. Here is the exact error.


ni...@superhost.gr [~/www/cgi-bin]# [Sat Jun 08 20:32:44 2013] [error] [client 
79.103.41.173] Error in sys.excepthook:
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] ValueError: 
underlying buffer has been detached
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173]
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] Original exception 
was:
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] Traceback (most 
recent call last):
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173]   File 
/home/nikos/public_html/cgi-bin/files.py, line 78, in module
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] assert 
os.path.exists( filepath )
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173]   File 
/usr/local/lib/python3.3/genericpath.py, line 18, in exists
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] os.stat(path)
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 
'ascii' codec can't encode characters in position 34-37: ordinal not in 
range(128)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread MRAB

On 08/06/2013 17:53, Νικόλαος Κούρας wrote:

Sorry for th delay guys, was busy with other thigns today and i am still 
reading your resposes, still ahvent rewad them all just Cameron's:

Here is what i have now following Cameron's advices:


#
# Collect filenames of the path directory as bytes
path = b'/home/nikos/public_html/data/apps/'
filenames_bytes = os.listdir( path )

for filename_bytes in filenames_bytes:
try:
filename = filename_bytes.decode('utf-8)
except UnicodeDecodeError:
# Since its not a utf8 bytestring then its for sure a greek 
bytestring

# Prepare arguments for rename to happen
utf8_filename = filename_bytes.encode('utf-8')
greek_filename = filename_bytes.encode('iso-8859-7')

utf8_path = path + utf8_filename
greek_path = path + greek_filename

# Rename current filename from greek bytes -- utf8 bytes
os.rename( greek_path, utf8_path )
==

I know this is wrong though.


Yet you did it anyway!


Since filename_bytes is the current filename encoded as utf8 or greek-iso
then i cannot just *encode* what is already encoded by doing this:

utf8_filename = filename_bytes.encode('utf-8')
greek_filename = filename_bytes.encode('iso-8859-7')


Try reading and understanding the code I originally posted.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Νικόλαος Κούρας

On 8/6/2013 5:49 πμ, Cameron Simpson wrote:

On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
nikos.gr...@gmail.com wrote:
| Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson 
έγραψε:
|  | | errors='replace' mean dont break in case or error?
| 
|  | Yes. The result will be correct for correct iso-8859-7 and slightly 
mangled
|  | for something that would not decode smoothly.
| 
|  | How can it be correct? We have encoded out string in utf-8 and then
|  | we tried to decode it as greek-iso? How can this possibly be
|  | correct?
|
|  If it is a valid iso-8859-7 sequence (which might cover everything,
|  since I expect it is an 8-bit 1:1 mapping from bytes values to a
|  set of codepoints, just like iso-8859-1) then it may decode to the
|  wrong characters, but the reverse process (characters encoded as
|  bytes) should produce the original bytes.  With a mapping like this,
|  errors='replace' may mean nothing; there will be no errors because
|  the only Unicode characters in play are all from iso-8859-7 to start
|  with. Of course another string may not be safe.
|
|  Visually, the names will be garbage. And if you go:
|mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
|  while using the iso-8859-7 locale, the wrong thing will occur
|  (assuming it even works, though I think it should because all these
|  characters are represented in iso-8859-7, yes?)
|
| All the rest you i understood only the above quotes its still unclear to me.
| I cant see to understand it.
|
| Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 
codepoints similar?

Yes. It is certainly true for utf-8 and latin-iso and ASCII.
I expect it to be so for greek-iso, but have not checked.

They're all essentially the ASCII set plus a range of other character
codepoints for the upper values.  The 8-bit sets iso-8859-1 (which
I take you to mean by latin-iso) and iso-8859-7 (which I take you
to mean by greek-iso) are single byte mapping with the top half
mapped to characters commonly used in a particular region.

Unicode has a much greater range, but the UTF-8 encoding of Unicode
deliberately has the bottom 0-127 identical to ASCII, and higher
values represented by multibyte sequences commences with at least
the first byte in the 128-255 range. In this way pure ASCII files
are already in UTF-8 (and, in fact, work just fine for the iso-8859-x
encodings as well).


Hold on!

In the beginning there was ASCII with 0-127 values and then there was 
Unicode with 0-127 of ASCII's + i dont know how much many more?


Now ASCIII needs 1 byte to store a single character while Unicode needs 
2 bytes to store a character and that is because it has  256 characters 
to store  2^8bits ?


Is this correct?

Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters 
into the hard drive?


Because in some post i have read that 'UTF-8 encoding of Unicode'.
Can you please explain to me whats the difference of ASCII-Unicode 
themselves aand then of them compared to 'Charsets' . I'm still confused 
about this.


Is it like we said in C++:
' int a', a variable with name 'a' of type integer.
'char a',   a variable with name 'a' of type char

So taken form above example(the closest i could think of), the way i 
understand them is:


A 'string' can be of (unicode's or ascii's) type and that type needs a 
way (thats a charset) to store this string into the hdd as a sequense of 
bytes?







--
Webhost http://superhost.gr Weblog http://psariastonafro.wordpress.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Steven D'Aprano
On Sat, 08 Jun 2013 21:01:23 +0300, Νικόλαος Κούρας wrote:

 In the beginning there was ASCII with 0-127 values 

No, there were encoding systems that existed before ASCII, such as 
EBCDIC. But we can ignore those, and just start with ASCII.


 and then there was
 Unicode with 0-127 of ASCII's + i dont know how much many more?

No, you have missed the utter chaos of dozens and dozens of Windows 
codepages and charsets. We still have to live with the pain of that.

But now we have Unicode, with 0x10 (decimal 1114111) code points. You 
can consider a code point to be the same as a character, at least for now.


 Now ASCIII needs 1 byte to store a single character 

ASCII actually needs 7 bits to store a character. Since computers are 
optimized to work with bytes, not bits, normally ASCII characters are 
stored in a single byte, with one bit wasted.


 while Unicode needs 2 bytes to store a character 

No. Since there are 0x10 different Unicode characters (really code 
points, but ignore the difference) two bytes is not enough. Unicode needs 
21 bits to store a character. Since that is more than 2 bytes, but less 
than 3, there are a few different ways for Unicode to be stored in 
memory, including:

Wide Unicode uses four bytes per character. Why four instead of three? 
Because computers are more efficient when working with chunks of memory 
that is a multiple of four.

Narrow Unicode uses two bytes per character. Since two bytes is only 
enough for about 65,000 characters, not 1,000,000+, the rest of the 
characters are stored as pairs of two-byte surrogates.



 and that is because it has  256 characters
 to store  2^8bits ?

Correct.



 Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters
 into the hard drive?

Your computer cannot carve a tiny little A into the hard drive when it 
stores that letter in a file. It has to write some bytes. So you need to 
know:

- what byte, or bytes, represents the letter A?

- what byte, or bytes, represents the letter B?

- what byte, or bytes, represents the letter λ?

and so on. This set of rules, byte  means letter , is called an 
encoding. If you don't know what encoding to use, you cannot tell what 
the byte means.

 
 Because in some post i have read that 'UTF-8 encoding of Unicode'. Can
 you please explain to me whats the difference of ASCII-Unicode
 themselves aand then of them compared to 'Charsets' . I'm still confused
 about this.

A charset is an ordered set of characters. For example, ASCII has 127 
characters, starting with NUL:

NUL ... A B C D E ... Z [ \ ] ^ ... a b c ... z ... 


where NULL is at position 0, 'A' is at position 65, 'B' at position 66, 
and so on.

Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is 
also similar, also 256 positions, but the characters are different. And 
so on, with dozens of charsets.

And then there is Unicode, which includes *every* character is all of 
those dozens of charsets. It has 1114111 positions (most are currently 
unfilled).


An encoding is simply a program that takes a character and returns a 
byte, or visa versa. For instance, the ASCII encoding will take character 
'A'. That is found at position 65, which is 0x41 in hexadecimal, so the 
ASCII encoding turns character 'A' into byte 0x41, and visa versa.


 Is it like we said in C++:
 ' int a', a variable with name 'a' of type integer. 'char a',   a
 variable with name 'a' of type char
 
 So taken form above example(the closest i could think of), the way i
 understand them is:
 
 A 'string' can be of (unicode's or ascii's) type and that type needs a
 way (thats a charset) to store this string into the hdd as a sequense of
 bytes?


Correct.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Chris Angelico
On Sun, Jun 9, 2013 at 4:01 AM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote:
 Hold on!

 In the beginning there was ASCII with 0-127 values and then there was
 Unicode with 0-127 of ASCII's + i dont know how much many more?

 Now ASCIII needs 1 byte to store a single character while Unicode needs 2
 bytes to store a character and that is because it has  256 characters to
 store  2^8bits ?

 Is this correct?

No. Let me start from the beginning.

Computers don't work with characters, or strings, natively. They work
with numbers. To be specific, they work with bits; and it's only by
convention that we can work with anything larger. For instance,
there's a VERY common convention around the PC world that a set of
bits can be interpreted as a signed integer; if the highest bit is
set, it's negative. There are also standards for floating-point (IEEE
754), and so on.

ASCII is a character set. It defines a mapping of numbers to
characters - for instance, @ is 64, SOH is 1, $ is 36, etcetera,
etcetera. There are 128 such mappings. Since they all fit inside a
7-bit number, there's a trivial way to represent ASCII characters in a
PC's 8-bit byte: you just leave the high bit clear and use the other
seven. There have been various schemes for using the eighth bit -
serial ports with parity, WordStar (I think) marking the ends of
words, and most notably, Extended ASCII schemes that give you another
whole set of 128 characters. And that was the beginning of Code Pages,
because nobody could agree on what those extra 128 should be.
Norwegians used Norwegian, the Greeks were taught their Greek,
Arabians created themselves an Arabian codepage with the speed of
summer lightning, and Hebrews allocated from 255 down to 128, which is
absolutely frightening. But I digress.

There were a variety of multi-byte schemes devised at various times,
but we'll ignore all of them and jump straight to Unicode. With
Unicode, there's (theoretically) no need to use any other system ever
again, because whatever character you want, it'll exist in Unicode. In
theory, of course; there are debates over that. Now, Unicode currently
has defined an address space of roughly 20 bits, and in a throwback
to the first programming I ever did, it's a segmented system: sixteen
or seventeen planes of 65,536 characters each. (Fortunately the planes
are identified by low numbers, not high numbers, and there's no
stupidity of overlapping planes the way the 8086 did with memory!) The
highest planes are  special (plane 14 has a few special-purpose
characters, planes 15 and 16 are for private use), and most of the
middle ones have no characters assigned to them, so for the most part,
you'll see characters from the first three planes.

So what do we now have? A mapping of characters to code points,
which are numbers. (I'm leaving aside the issues of combining
characters and such for the moment.) But computers don't work with
numbers, they work with bits. Somehow we have to store those bits in
memory.

There are a good few ways to do that; one is to note that every
Unicode character can be represented inside 32 bits, so we can use the
standard integer scheme safely. (Since they fit inside 31 bits, we
don't even need to care if it's signed or unsigned.) That's called
UTF-32 or UCS-4, and it's a great way to handle the full Unicode range
in a manner that makes a Texan look agoraphobic. Wide builds of Python
up to 3.2 did this. Or you can try to store them in 16-bit numbers,
but then you have to worry about the ones that don't fit in 16 bits,
because it's really hard to squeeze 20 bits of information into 16
bits of storage. UTF-16 is one way to do this; special numbers mean
grab another number. It has its issues, but is (in my opinion,
unfortunately) fairly prevalent. Narrow builds of Python up to 3.2 did
this. Finally, you can use a more complicated scheme that uses
anywhere from 1 to 4 bytes for each character, by carefully encoding
information into the top bit - if it's set, you have a multi-byte
character. That's how UTF-8 works, and is probably the most prevalent
disk/network encoding.

All of the UTF-X systems are called UCS Transformation Formats (UCS
meaning Universal Character Set, roughly Unicode). They are mappings
from Unicode numbers to bytes. Between Unicode and UTF-X, you have a
mapping from character to byte sequence.

 Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters into
 the hard drive?

The ISO standard 8859 specifies a number of ASCII-compatible
encodings, referred to as ISO-8859-1 through ISO-8859-16. You've been
working with ISO-8859-1, also called Latin-1, and ISO-8859-7, which
has your Greek characters in it. These are all ways of translating
characters into numbers; and since they all fit within 8 bits, they're
most commonly represented on PCs with single bytes.

 So taken form above example(the closest i could think of), the way i
 understand them is:

 A 'string' can be of (unicode's or ascii's) type and that type needs a 

Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Νικόλαος Κούρας
Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:

 ASCII actually needs 7 bits to store a character. Since computers are  
 optimized to work with bytes, not bits, normally ASCII characters are
 stored in a single byte, with one bit wasted.

So ASCII and Unicode are 2 Encoding Systems currently in use.
How should i imagine them, visualize them?
Like tables 'A' = 65, 'B' = 66 and so on?

But if i do then that would be the visualization of a 'charset' not of an 
encoding system.
What the diffrence of an encoding system and of a charset?

ebcdic - ascii - unicode = al of them are encoding systems

greek-iso - latin-iso - utf8 - utf16 = all of them are charsets.

What are these differences? i cant imagine them all, i can only imagine 
charsets not encodign systems.

Why python interprets by default all given strings as unicode and not ascii? 
because the former supports many positions while ascii only 127 positions , 
hence can interpet only 127 different characters? 


 Narrow Unicode uses two bytes per character. Since two bytes is only 
 enough for about 65,000 characters, not 1,000,000+, the rest of the 
 characters are stored as pairs of two-byte surrogates.

surrogates literal means a replacemnt?


 Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is 
 also similar, also 256 positions, but the characters are different. And 
 so on, with dozens of charsets. 

Latin has to display english chars(capital, small) + numbers + symbols. that 
would be 127 why 256?

greek = all of the above plus greek chars, no?

 And then there is Unicode, which includes *every* character is all of 
 those dozens of charsets. It has 1114111 positions (most are currently  
 unfilled).

Shouldt the positions that Unicode has to use equal to the summary of all 
available characters of all the languages of the worlds plus numbers and 
special chars? why 1.000.000+ why the need for so many positions? Narrow 
Unicode format (2 byted) can cover all ofmthe worlds symbols.

 An encoding is simply a program that takes a character and returns a 
 byte, or visa versa. For instance, the ASCII encoding will take character 
 'A'. That is found at position 65, which is 0x41 in hexadecimal, so the 
 ASCII encoding turns character 'A' into byte 0x41, and visa versa.

Why you say ASCII turn a character into HEX format and not as in binary format?
Isnt the latter the way bytes are stored into hdd, like 01010010101 etc?
Are they stored as hex instead or you just said so to avoid printing 0s and 1s?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Νικόλαος Κούρας
Sorry for displaying my code so many times, i know i ahve exhaust you but hti 
is the last thinkg i am gonna ask from you in this thread. We are very close to 
have this working.


#
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )

for filename in files:
# Compute 'path/to/filename'
filepath_bytes = path + filename
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try: 
filepath = filepath_bytes.decode( encoding )
except UnicodeDecodeError:
continue

# Rename to something valid in UTF-8 
if encoding != 'utf-8': 
os.rename( filepath_bytes, filepath.encode('utf-8') )

assert os.path.exists( filepath )
break 
else: 
# This only runs if we never reached the break
raise ValueError( 'unable to clean filename %r' % 
filepath_bytes ) 


#
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
(filename,) )
data = cur.fetchone()

if not data:
# First time for file; primary key is automatic, hit is 
defaulted
print( iam here, filename + '\n' )
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
if rec not in filepaths:
cur.execute('''DELETE FROM files WHERE url = %s''', rec )





=
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] Original exception 
was:
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] Traceback (most 
recent call last):
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173]   File 
/home/nikos/public_html/cgi-bin/files.py, line 78, in module
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] assert 
os.path.exists( filepath )
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173]   File 
/usr/local/lib/python3.3/genericpath.py, line 18, in exists
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] os.stat(path)
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 
'ascii' codec can't encode characters in position 34-37: ordinal not in 
range(128)
==

Asserts what to make sure the the path/to/file afetr the rename exists but why 
are we still get those unicodeencodeerrors?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Chris Angelico
On Sun, Jun 9, 2013 at 7:21 AM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote:
 Sorry for displaying my code so many times, i know i ahve exhaust you but hti 
 is the last thinkg i am gonna ask from you in this thread. We are very close 
 to have this working.

You need to spend more time reading and less time frantically jumping
around. Go read my post on Unicode; it answers several of the
questions you posted in response to Steven's. And please, don't use
this list as your substitute for source control. Don't keep posting
your code. Most of us are ignoring it already.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Cameron Simpson
On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
nikos.gr...@gmail.com wrote:
| Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:
|  ASCII actually needs 7 bits to store a character. Since computers are  
|  optimized to work with bytes, not bits, normally ASCII characters are
|  stored in a single byte, with one bit wasted.
| 
| So ASCII and Unicode are 2 Encoding Systems currently in use.
| How should i imagine them, visualize them?
| Like tables 'A' = 65, 'B' = 66 and so on?

Yes, that works.

| But if i do then that would be the visualization of a 'charset' not of an 
encoding system.
| What the diffrence of an encoding system and of a charset?

And encoding system is the method or transcribing these values to bytes and 
back again.

| ebcdic - ascii - unicode = al of them are encoding systems
| greek-iso - latin-iso - utf8 - utf16 = all of them are charsets.

No.

EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets.
(1:1 mappings of characters to numbers/ordinals).

And encoding is a way of writing these values to bytes.
Decoding reads bytes and emits character values.

Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the 
range 0-255,
they are usually transcribed (encoded) directly, one byte per ordinal.

Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one 
value.
There are several ways of transcribing Unicode. UTF-8 is a popular and usually 
compact form,
using one byte for values below 128 and and multiple bytes for higher values.

| Why python interprets by default all given strings as unicode and
| not ascii? because the former supports many positions while ascii
| only 127 positions , hence can interpet only 127 different characters?

Yes.

[...]
|  Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is 
|  also similar, also 256 positions, but the characters are different. And 
|  so on, with dozens of charsets. 
| 
| Latin has to display english chars(capital, small) + numbers + symbols. that 
would be 127 why 256?

ASCII runs up to 127. Essentially English, numerals, control codes and various 
symbols.

The iso-8859-x sets run to 255, and the upper 128 values map to
characters popular in various regions.

| greek = all of the above plus greek chars, no?

So iso-8859-7 included the Greek characters.

|  And then there is Unicode, which includes *every* character is all of 
|  those dozens of charsets. It has 1114111 positions (most are currently  
|  unfilled).
| 
| Shouldt the positions that Unicode has to use equal to the summary
| of all available characters of all the languages of the worlds plus
| numbers and special chars? why 1.000.000+ why the need for so many
| positions? Narrow Unicode format (2 byted) can cover all ofmthe
| worlds symbols.

2 bytes is not enough. Chinese alone has more glyphs than that.

|  An encoding is simply a program that takes a character and returns a 
|  byte, or visa versa. For instance, the ASCII encoding will take character 
|  'A'. That is found at position 65, which is 0x41 in hexadecimal, so the 
|  ASCII encoding turns character 'A' into byte 0x41, and visa versa.
| 
| Why you say ASCII turn a character into HEX format and not as in binary 
format?

Steven didn't say that. He said position 65. People often write
bytes in hex (eg 0x41) because a byte always fits in a 2-character
hex (16 x 16) and because often these values have binary-based
subranges, and hex makes that more obvious.

For example, 'A' is 0x41. 'a' is 0x61. So you can look at the hex
code and almost visually know if you're dealing with upper or lower
case, etc.

| Isnt the latter the way bytes are stored into hdd, like 01010010101 etc?
| Are they stored as hex instead or you just said so to avoid printing 0s and 
1s?

They're stored as bits at the gate level. But writing hex codes
_in_ _text_ is more compact, and more readable for humans.

Cheers,
-- 
Cameron Simpson c...@zip.com.au

A lot of people don't know the difference between a violin and a viola, so
I'll tell you.  A viola burns longer.   - Victor Borge
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread Νικόλαος Κούρας

On 9/6/2013 1:32 πμ, Cameron Simpson wrote:

On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
nikos.gr...@gmail.com wrote:
| Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:
|  ASCII actually needs 7 bits to store a character. Since computers are
|  optimized to work with bytes, not bits, normally ASCII characters are
|  stored in a single byte, with one bit wasted.
|
| So ASCII and Unicode are 2 Encoding Systems currently in use.
| How should i imagine them, visualize them?
| Like tables 'A' = 65, 'B' = 66 and so on?

Yes, that works.

| But if i do then that would be the visualization of a 'charset' not of an 
encoding system.
| What the diffrence of an encoding system and of a charset?

And encoding system is the method or transcribing these values to bytes and 
back again.

So we have:

( 'A' mapped to the value of '65' ) = encoding process(i.e. uf-8) = bytes
bytes = decoding process(i.e. utf-8) =  ( '65' mapped to character 'A' )

Why does every character in a character set needs to be associated with 
a numeric value?
I mean couldn't we just have characters sets that wouldn't have numeric 
associations like:


'A'  = encoding process(i.e. uf-8) = bytes
bytes = decoding process(i.e. utf-8) =  character 'A'




EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets.
(1:1 mappings of characters to numbers/ordinals).

And encoding is a way of writing these values to bytes.
Decoding reads bytes and emits character values.

Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the 
range 0-255,
they are usually transcribed (encoded) directly, one byte per ordinal.

Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one 
value.
There are several ways of transcribing Unicode. UTF-8 is a popular and usually 
compact form,
using one byte for values below 128 and and multiple bytes for higher values.

An ordinal = ordered numbers like 7,8,910 and so on?

Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
values up to 256?


UTF-8 and UTF-16 and UTF-32
I though the number beside of UTF- was to declare how many bits the 
character set was using to store a character into the hdd, no?


Narrow Unicode uses two bytes per character. Since two bytes is only
enough for about 65,000 characters, not 1,000,000+, the rest of the
characters are stored as pairs of two-byte surrogates.

Can you please explain this line the rest of thecharacters are stored 
as pairs of two-byte surrogates more easily for me to understand it?

I'm still having troubl understanding what a surrogate is.

Again, thank you very much for explaining the encodings to me, they were 
giving me trouble for years in all of my scripts.



And one last thing.
When locale to linux system is set to utf-8 that would mean that the 
linux applications, should try to encode string into hdd by using 
system's default encoding to utf-8 nad read them back from bytes by also 
using utf-8. Is that correct?

--
Webhost http://superhost.gr Weblog http://psariastonafro.wordpress.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-08 Thread nagia . retsina
Τη Σάββατο, 8 Ιουνίου 2013 9:47:53 μ.μ. UTC+3, ο χρήστης Chris Angelico έγραψε:

 Fortunately, Python lets us hide away pretty much all those details, 
 just as it lets us hide away the details of what makes up a list, a
 dictionary, or an integer. You can safely assume that the string foo
 is a string of three characters, which you can work with as
 characters. The chr() and ord() functions let you switch between
 characters and numbers, and str.encode() and bytes.decode() let you
 switch between characters and byte sequences. Once you get your head
 around the differences between those three, it all works fairly
 neatly.

I'm trying too!

So,

chr('A') would give me the mapping of this char, the number 65 while
ord(65) would output the char 'A' likewise.

and str.encode() and bytes.decode() let you switch between characters and byte 
sequences. Once

What would happen if we we try to re-encode bytes on the disk?
like trying:

s = νίκος
utf8_bytes = s.encode('utf-8')
greek_bytes = utf_bytes.encode('iso-8869-7')

Can we re-encode twice or as many times we want and then decode back 
respectively lke?

utf8_bytes = greek_bytes.decode('iso-8859-7')
s = utf8_bytes.decoce('utf-8')

Is somethign like that totally crazy?

And also is there a deiffrence between encoding and compressing ?

Isnt the latter useing some form of encoding to take a string or bytes to make 
hold less space on disk?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread nagia . retsina
Τη Παρασκευή, 7 Ιουνίου 2013 4:25:40 π.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:

 MRAB tells you to work with the bytes, because the filenames' bytes are 
 invalid decoded as UTF-8. If you fix the file names by renaming using a 
 terminal set to UTF-8, then they will be valid and you can forget about  
 working with bytes.

Yes, but but 'putty' seems to always forget when i tell it to use utf8 for 
displaying and always picks up the Win8's default charset and it doesnt have a 
save options dialog. I cant always remember to switch to utf8 charset or 
renaming all the time from termnal so many greek filenames.

 Working with bytes is only for when the file names are turned to garbage.  
 Your file names (some of them) are turned to garbage. Fix them, and then 
 use file names as strings.

Can't '~/data/apps/' is filled every day with more and more files which are 
uploaded via FileZilla client, which i think it behaves pretty much like putty, 
uploading filenames as greek-iso bytes.

So that garbage will happen every day due to 'Putty'  'FileZilla' clients.

So files.py before doing their stuff must do the automatic conversions from 
greek bytes to utf-8 bytes.

Here is what i have up until now.

=
 # Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

# Iterate over all filenames in the path dir
for filename in filenames_bytes:
# Compute 'path/to/filename' in bytes
filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
try:
filepath = filepath_bytes.decode('utf-8')
except UnicodeDecodeError:
try:
filepath = filepath_bytes.decode('iso-8859-7')

# Rename filename from greek bytes = utf-8 bytes
os.rename( filepath_bytes filepath.encode('utf-8') )
except UnicodeDecodeError:
print I give up! This filename is unreadable!
=

This is the best i can come up with, but after:

ni...@superhost.gr [~/www/cgi-bin]# python files.py
  File files.py, line 75
os.rename( filepath_bytes filepath.encode('utf-8') )
 ^
SyntaxError: invalid syntax
ni...@superhost.gr [~/www/cgi-bin]#



I am seeign the caret pointing at filepath but i cant follow what it tries to 
tell me. No parenthesis missed or added this time due to speed and tireness.

This rename statement tries to convert the greek byted filepath to utf-8 byted 
filepath.

I can't see whay this is wrong though.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Chris Angelico
On Fri, Jun 7, 2013 at 4:35 PM,  nagia.rets...@gmail.com wrote:
 Yes, but but 'putty' seems to always forget when i tell it to use utf8 for 
 displaying and always picks up the Win8's default charset and it doesnt have 
 a save options dialog. I cant always remember to switch to utf8 charset or 
 renaming all the time from termnal so many greek filenames.


I use PuTTY too (though that'll change when I next upgrade Traal, as
I'll no longer have any Windows clients), and it's set to UTF-8 in the
Winoow|Translation page. Far as I know, those settings are all saved
into the Saved Sessions settings, back on the Session page.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Νικόλαος Κούρας

On 7/6/2013 4:01 πμ, Cameron Simpson wrote:

On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
nikos.gr...@gmail.com wrote:
| Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:
|  py s = '999-Eυχή-του-Ιησού'
|  py bytes_as_utf8 = s.encode('utf-8')
|  py t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
|  py print(t)
|  999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
|
| errors='replace' mean dont break in case or error?

Yes. The result will be correct for correct iso-8859-7 and slightly mangled
for something that would not decode smoothly.
How can it be correct? We have encoded out string in utf-8 and then we 
tried to decode it as greek-iso? How can this possibly be correct?


| You took the unicode 's' string you utf-8 bytestringed it.
| Then how its possible to ask for the utf8-bytestring to decode
| back to unicode string with the use of a different charset that the
| one used for encoding and thsi actually printed the filename in
| greek-iso?

It is easily possible, as shown above. Does it make sense? Normally
not, but Steven is demonstrating how your mv exercises have
behaved: a rename using utf-8, then a _display_ using iso-8859-7.
Same as above, i don't understand it at all, since different 
charsets(encodings) used in the encode/decode process.

|
| a) WHAT does it mean when a linux system is set to use utf-8?

It means the locale settings _for the current process_ are set for
UTF-8. The locale command will show you the current state.
That means that, when a linux application needs to saved a filename to 
the linux filesystem, the app checks the filesytem's 'locale', so to 
encode the filename using the utf-8 charset ?
And likewise when a linux application wants to decode a filename is also 
checking the filesystem's 'locale' setting so to know what charset must 
use to decode the filename correctly back to the original string?


So locale is used for filesystem itself and linux apps to know how to 
read(decode) and write(enode) filenames from/into the system's hdd?



| c) WHAT happens when the two of them try to work together?

If everything matches, it is all good. If the locales do not match,
the mismatch will result in an undesired bytes-characters
encode/decode step somewhere, and something will display incorrectly
or be entered as input incorrectly.


Cant quite grasp the idea:

local end: Win8,  locale = greek-iso
remote end: CentOS 6.4,  locale = utf-8

FileZilla by default uses do not know what charset to upload filenames
Putty by default uses greek-iso to display filenames


WHAT someone can expect to happen when all of the above work together?
Mess of course, but i want to hear in detail each step of the mess as it 
emerges.


--
Webhost http://superhost.gr Weblog http://psariastonafro.wordpress.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Lele Gaifax
nagia.rets...@gmail.com writes:

   File files.py, line 75
 os.rename( filepath_bytes filepath.encode('utf-8') )
  ^
 SyntaxError: invalid syntax

 I am seeign the caret pointing at filepath but i cant follow what it
 tries to tell me.

As already explained, often a SyntaxError is introduced by *preceeding*
text, so you must look at your code with a wider eye.

 This rename statement tries to convert the greek byted filepath to
 utf-8 byted filepath.

Yes: and that usually imply that the *function* accepts (at least) *two*
arguments, specifically the source and the target names, right? How many
arguments are you actually giving to the os.rename() function above?

 I can't see whay this is wrong though.

Try stronger, I won't be give you further indications to your
SyntaxErrors, you *must* learn how to detect and fix those by yourself.

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Νικόλαος Κούρας
Τη Παρασκευή, 7 Ιουνίου 2013 9:46:53 π.μ. UTC+3, ο χρήστης Chris Angelico 
έγραψε:
 On Fri, Jun 7, 2013 at 4:35 PM,  nagia.rets...@gmail.com wrote:
 
  Yes, but but 'putty' seems to always forget when i tell it to use utf8 for 
  displaying and always picks up the Win8's default charset and it doesnt 
  have a save options dialog. I cant always remember to switch to utf8 
  charset or renaming all the time from termnal so many greek filenames.
 
 
 
 
 
 I use PuTTY too (though that'll change when I next upgrade Traal, as
 
 I'll no longer have any Windows clients), and it's set to UTF-8 in the
 
 Winoow|Translation page. Far as I know, those settings are all saved
 
 into the Saved Sessions settings, back on the Session page.
 
 
 
 ChrisA


Session settings afaik is for putty to remember hosts to connect to, not 
terminal options. I might be worng though. No matter how many times i change 
its options next time i run it always defaults back.

I'll google Traal right now.
You should also take o look on 'Secure Shell' extension for Chrome i just found 
out.

Seems a great plugin for Chrome. You'll definately like it, i did!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Chris Angelico
On Fri, Jun 7, 2013 at 5:08 PM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote:
 I'll google Traal right now.

The one thing you're actually willing to go research, and it's
actually something that won't help you. Traal is the name of my
personal laptop. Spend your Googletrons on something else. :)

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Νικόλαος Κούρας
Τη Παρασκευή, 7 Ιουνίου 2013 10:09:29 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:

 As already explained, often a SyntaxError is introduced by *preceeding*
 text, so you must look at your code with a wider eye.

That what i ahte aabout error reporting. You have some syntax error someplace 
and error reports you another line, so you have to check the whole code again.
Well i just did, i see no syntactical errors.

 Yes: and that usually imply that the *function* accepts (at least) *two*
 arguments, specifically the source and the target names, right? How many
 arguments are you actually giving to the os.rename() function above?

i'm giving it two.
os.rename( filepath_bytes filepath.encode('utf-8') )

1st = filepath_bytes
2nd = filepath.encode('utf-8')

Source and Target respectively.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Michael Weylandt


On Jun 7, 2013, at 8:32, Νικόλαος Κούρας nikos.gr...@gmail.com wrote:

 Τη Παρασκευή, 7 Ιουνίου 2013 10:09:29 π.μ. UTC+3, ο χρήστης Lele Gaifax 
 έγραψε:
 
 As already explained, often a SyntaxError is introduced by *preceeding*
 text, so you must look at your code with a wider eye.
 
 That what i ahte aabout error reporting. You have some syntax error someplace 
 and error reports you another line, so you have to check the whole code again.
 Well i just did, i see no syntactical errors.
 
 Yes: and that usually imply that the *function* accepts (at least) *two*
 arguments, specifically the source and the target names, right? How many
 arguments are you actually giving to the os.rename() function above?
 
 i'm giving it two.
 os.rename( filepath_bytes filepath.encode('utf-8') 

Missing comma, which is, after all, just a matter of syntax so it can't matter, 
right?


 
 1st = filepath_bytes
 2nd = filepath.encode('utf-8')
 
 Source and Target respectively.
 -- 
 http://mail.python.org/mailman/listinfo/python-list
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Νικόλαος Κούρας

On 7/6/2013 10:42 πμ, Michael Weylandt wrote:


os.rename( filepath_bytes filepath.encode('utf-8')
Missing comma, which is, after all, just a matter of syntax so it can't matter, 
right?


I doubted that os.rename arguments must be comma seperated.
But ater reading the docs.

s.rename(/src/,/dst/)http://docs.python.org/2/library/os.html#os.rename

   Rename the file or directory/src/to/dst/. If/dst/is a
   directory,OSError
   http://docs.python.org/2/library/exceptions.html#exceptions.OSErrorwill
   be raised. On Unix, if/dst/exists and is a file, it will be replaced
   silently if the user has permission. The operation may fail on some
   Unix flavors if/src/and/dst/are on different filesystems. If
   successful, the renaming will be an atomic operation (this is a
   POSIX requirement). On Windows, if/dst/already exists,OSError
   http://docs.python.org/2/library/exceptions.html#exceptions.OSErrorwill
   be raised even if it is a file; there may be no way to implement an
   atomic rename when/dst/names an existing file.

   Availability: Unix, Windows.

Indeed it has to be:

os.rename( filepath_bytes, filepath.encode('utf-8')

'mv source target' didn't require commas so i though it was safe to assume that 
os.rename did not either.


I'am happy to announce that after correcting many idiotic error like commas, 
missing colons and declaring of variables, this surrogate erro si the last i 
get.
I still dont understand what surrogate means. In english means replacement.
Here is the code:


#
# Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

# Iterate over all filenames in the path dir
for filename in filename_bytes:
# Compute 'path/to/filename' in bytes
filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
try:
filepath = filepath_bytes.decode('utf-8')
except UnicodeDecodeError:
try:
filepath = filepath_bytes.decode('iso-8859-7')

# Rename current filename from greek bytes = utf-8 
bytes
os.rename( filepath_bytes, filepath.encode('utf-8') )
except UnicodeDecodeError:
print( '''I give up! This filename is unreadable! ''')


#
# Get filenames of the apps directory as unicode
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
(filename,) )
data = cur.fetchone()#filename is unique, so should 
only be one

if not data:
# First time for file; primary key is automatic, hit is 
defaulted
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filenames = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filenames.add( filename )

# Delete spurious
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for filename in data:
if filename not in filenames:
cur.execute('''DELETE FROM files WHERE url = %s''', (filename,) 
)



=

[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173]   File 
/home/nikos/public_html/cgi-bin/files.py, line 88, in module
[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] 
cur.execute('''SELECT url FROM files WHERE url = %s''', filename )
[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173]   File 
/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py,
 line 108, in execute
[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] query = 
query.encode(charset)
[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 
'utf-8' codec can't encode character '\\udcce' in position 35: surrogates not 
allowed



--
Webhost http://superhost.gr Weblog http://psariastonafro.wordpress.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread R. Michael Weylandt
On Fri, Jun 7, 2013 at 9:10 AM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote:
 On 7/6/2013 10:42 πμ, Michael Weylandt wrote:

 os.rename( filepath_bytes filepath.encode('utf-8')

 Missing comma, which is, after all, just a matter of syntax so it can't
 matter, right?

 I doubted that os.rename arguments must be comma seperated.

All function calls in Python require commas if you are putting in more
than one argument. [0]

 But ater reading the docs.

 s.rename(src, dst)

 Rename the file or directory src to dst. If dst is a directory, OSError will
 be raised. On Unix, if dst exists and is a file, it will be replaced
 silently if the user has permission. The operation may fail on some Unix
 flavors if src and dst are on different filesystems. If successful, the
 renaming will be an atomic operation (this is a POSIX requirement). On
 Windows, if dst already exists, OSError will be raised even if it is a file;
 there may be no way to implement an atomic rename when dst names an existing
 file.

 Availability: Unix, Windows.

 Indeed it has to be:

 os.rename( filepath_bytes, filepath.encode('utf-8')

Parenthesis missing here as well.


 'mv source target' didn't require commas so i though it was safe to assume
 that os.rename did not either.


That's for shell programming -- different language entirely.

The surrogate business is back to Unicode, which ain't my specialty so
I'll leave that to more able programmers.

MW

[0] You could pass multiple arguments by way of a tuple or dictionary
using */** but if you want arguments that aren't in the container
being passed, you're back to needing commas.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Cameron Simpson
On 07Jun2013 11:10, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
nikos.gr...@gmail.com wrote:
| On 7/6/2013 10:42 πμ, Michael Weylandt wrote:
| os.rename( filepath_bytes filepath.encode('utf-8')
| Missing comma, which is, after all, just a matter of syntax so it can't 
matter, right?
|
| I doubted that os.rename arguments must be comma seperated.

Why?

Every other python function separates arguments with commas.

| 'mv source target' didn't require commas so i though it was safe to assume 
that os.rename did not either.

mv is shell syntax.
os.rename is Python syntax.

Two totally separate languages.
-- 
Cameron Simpson c...@zip.com.au

Cynic, n. A blackguard whose faulty vision sees things as they are, not as
they ought to be.
Ambrose Bierce (1842-1914), U.S. author. The Devil's Dictionary (1881-1906).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Cameron Simpson
On 07Jun2013 09:56, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
nikos.gr...@gmail.com wrote:
| On 7/6/2013 4:01 πμ, Cameron Simpson wrote:
| On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
nikos.gr...@gmail.com wrote:
| | Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:
| |  py s = '999-Eυχή-του-Ιησού'
| |  py bytes_as_utf8 = s.encode('utf-8')
| |  py t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
| |  py print(t)
| |  999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
| |
| | errors='replace' mean dont break in case or error?
| 
| Yes. The result will be correct for correct iso-8859-7 and slightly mangled
| for something that would not decode smoothly.
|
| How can it be correct? We have encoded out string in utf-8 and then
| we tried to decode it as greek-iso? How can this possibly be
| correct?

Ok, not correct. But consistent. Safe to call.

If it is a valid iso-8859-7 sequence (which might cover everything,
since I expect it is an 8-bit 1:1 mapping from bytes values to a
set of codepoints, just like iso-8859-1) then it may decode to the
wrong characters, but the reverse process (characters encoded as
bytes) should produce the original bytes.  With a mapping like this,
errors='replace' may mean nothing; there will be no errors because
the only Unicode characters in play are all from iso-8859-7 to start
with. Of course another string may not be safe.

| | You took the unicode 's' string you utf-8 bytestringed it.
| | Then how its possible to ask for the utf8-bytestring to decode
| | back to unicode string with the use of a different charset that the
| | one used for encoding and thsi actually printed the filename in
| | greek-iso?
| 
| It is easily possible, as shown above. Does it make sense? Normally
| not, but Steven is demonstrating how your mv exercises have
| behaved: a rename using utf-8, then a _display_ using iso-8859-7.
|
| Same as above, i don't understand it at all, since different
| charsets(encodings) used in the encode/decode process.

Visually, the names will be garbage. And if you go:

  mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'

while using the iso-8859-7 locale, the wrong thing will occur
(assuming it even works, though I think it should because all these
characters are represented in iso-8859-7, yes?)

Why?

In the iso-8859-7 locale, your (currently named under an utf-8
regime) file looks like '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' (because the
Unicode byte sequence maps to those characters in iso-8859-7). Why
you issue the about mv command, the new name will be the _iso-8859-7_
bytes encoding for '999-Eυχή-του-Ιησού.mp3'.  Which, under an utf-8
regime will decode to _other_ characters.

If you want to repair filenames, by which I mean, cause them to be correctly
encoded for utf-8, you are best to work in utf-8 (using mv or python).

Of course, the badly named files will then look wrong in your listing.

If you _know_ the filenames were written using iso-8859-7 encoding, and that 
the names are right under that encoding, you can write python code to rename 
them to utf-8.

Totally untested example code:

  import sys
  from binascii import hexlify

  for bytename in os.listdir( b'.' ):
unicode_name = bytename.decode('iso-8859-7')
new_bytename = unicode_name.encode('utf-8')
print(%s: %s = %s % (unicode_name, hexlify(bytename), 
hexlify(new_bytename)), file=sys.stderr)
os.rename(bytename, new_bytename)

That code should not care what locale you are using because it uses
bytes for the file calls and is explicit about the encoding/decoding
steps.

| | a) WHAT does it mean when a linux system is set to use utf-8?
| 
| It means the locale settings _for the current process_ are set for
| UTF-8. The locale command will show you the current state.
|
| That means that, when a linux application needs to saved a filename
| to the linux filesystem, the app checks the filesytem's 'locale', so
| to encode the filename using the utf-8 charset ?

At the command line, many will not. They'll just read and write bytes.

Some will decode/encode. Those that do, should by default use the
current locale.

But broadly, it is GUI apps that care about this because they must
translate byte sequences to glyphs: images of characters. So plenty
of command line tools do not need to care; the terminal application
is the one that presents the names to you; _it_ will decode them
for display. And it is the terminal app that translates your
keystrokes into bytes to feed to the command line.

NOTE: it is NOT the filesystem's locale. It is the current process'
locale, which is deduced from environment variables (which have
defaults if they are not set).

Under Windows I believe filesystems have locales; this can prevent
you storing some files on some filesystems on Windows, because the
filesystem doesn't cope. UNIX just takes bytes.

| And likewise when a linux application wants to decode a filename is
| also checking the filesystem's 'locale' setting so to know what
| 

Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread alex23
On Jun 7, 6:53 pm, Cameron Simpson c...@zip.com.au wrote:
   Experiment:

     LC_ALL=C ls -b
     LC_ALL=utf-8 ls -b
     LC_ALL=iso-8859-7 ls -b

   And the Terminal itself is decoding the output for display, and
   encoding your input keystrokes to feed as input to the command
   line.

This reminded me of something I saw on stackoverflow recently:
http://stackoverflow.com/questions/11735363/python3-unicodeencodeerror-only-when-run-from-crontab

Script would run from shell but not from crontab, as the crontab
environment had different locale settings. Solution was to prepend the
correct LC_CTYPE to the command in the crontab. Would it be similar
for httpd processes?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Νικόλαος Κούρας
Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson 
έγραψε:
 On 07Jun2013 09:56, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 nikos.gr...@gmail.com wrote:
 
 | On 7/6/2013 4:01 πμ, Cameron Simpson wrote:
 
 | On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 nikos.gr...@gmail.com wrote:
 
 | | Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
 έγραψε:
 
 | |  py s = '999-Eυχή-του-Ιησού'
 
 | |  py bytes_as_utf8 = s.encode('utf-8')
 
 | |  py t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
 
 | |  py print(t)
 
 | |  999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
 
 | |
 
 | | errors='replace' mean dont break in case or error?
 
 | 
 
 | Yes. The result will be correct for correct iso-8859-7 and slightly mangled
 
 | for something that would not decode smoothly.
 
 |
 
 | How can it be correct? We have encoded out string in utf-8 and then
 
 | we tried to decode it as greek-iso? How can this possibly be
 
 | correct?

 If it is a valid iso-8859-7 sequence (which might cover everything, 
 since I expect it is an 8-bit 1:1 mapping from bytes values to a 
 set of codepoints, just like iso-8859-1) then it may decode to the 
 wrong characters, but the reverse process (characters encoded as
 bytes) should produce the original bytes.  With a mapping like this, 
 errors='replace' may mean nothing; there will be no errors because
 the only Unicode characters in play are all from iso-8859-7 to start
 with. Of course another string may not be safe. 

 Visually, the names will be garbage. And if you go:
   mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
 while using the iso-8859-7 locale, the wrong thing will occur
 (assuming it even works, though I think it should because all these
 characters are represented in iso-8859-7, yes?)

All the rest you i understood only the above quotes its still unclear to me.
I cant see to understand it.

Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 
codepoints similar?

For example char 'a' has the value of '65' for all of those character sets?
Is hat what you mean?

s = 'a'  (This is unicode right?  Why when we assign a string to a variable 
that string's type is always unicode and does not automatically become utf-8 
which includes all available world-wide characters? Unicode is something 
different that a character set? )

utf8_byte = s.encode('utf-8')

Now if we are to decode this back to utf8 we will receive the char 'a'.
I beleive same thing will happen with latin, greek, ascii isos. Correct?

utf8_a = utf8_byte.decode('iso-8859-7')
latin_a = utf8_byte.decode('iso-8859-1')
ascii_a = utf8_byte.decode('ascii')
utf8_a = utf8_byte.decode('iso-8859-7')

Is this correct? 
All of those decodes will work even if the encoded bytestring was of utf8 type?

The characters that will not decode correctly are those that their codepoints 
are greater that  127 ?

for example if s = 'α' (greek character equivalent to english 'a')

Is this what you mean?


Now back to my almost ready files.py script please:


#
# Collect filenames of the path dir as bytes
greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in greek_filenames:
# Compute 'path/to/filename' in bytes
greek_path = b'/home/nikos/public_html/data/apps/' + b'filename'
try:
filepath = greek_path.decode('iso-8859-7')

# Rename current filename from greek bytes -- utf-8 bytes
os.rename( greek_path, filepath.encode('utf-8') )
except UnicodeDecodeError:
# Since its not a greek bytestring then its a proper utf8 
bytestring
filepath = greek_path.decode('utf-8')


#
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
filename )
data = cur.fetchone()

if not data:
# First time for file; primary key is automatic, hit is 
defaulted 
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
if 

Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread MRAB

On 07/06/2013 12:53, Νικόλαος Κούρας wrote:
[snip]


#
# Collect filenames of the path dir as bytes
greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in greek_filenames:
# Compute 'path/to/filename' in bytes
greek_path = b'/home/nikos/public_html/data/apps/' + b'filename'
try:


This is a worse way of doing it because the ISO-8859-7 encoding has 1
byte per codepoint, meaning that it's more 'tolerant' (if that's the
word) of errors. A sequence of bytes that is actually UTF-8 can be
decoded as ISO-8859-7, giving gibberish.

UTF-8 is less tolerant, and it's the encoding that ideally you should
be using everywhere, so it's better to assume UTF-8 and, if it fails, 
try ISO-8859-7 and then rename so that any names that were ISO-8859-7

will be converted to UTF-8.

That's the reason I did it that way in the code I posted, but, yet
again, you've changed it without understanding why!


filepath = greek_path.decode('iso-8859-7')

# Rename current filename from greek bytes -- utf-8 bytes
os.rename( greek_path, filepath.encode('utf-8') )
except UnicodeDecodeError:
# Since its not a greek bytestring then its a proper utf8 
bytestring
filepath = greek_path.decode('utf-8')


[snip]

--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Steven D'Aprano
On Fri, 07 Jun 2013 04:53:42 -0700, Νικόλαος Κούρας wrote:

 Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st
 0-127 codepoints similar?

You can answer this yourself. Open a terminal window and start a Python 
interactive session. Then try it and see what happens:


s = ''.join(chr(i) for i in range(128))
bytes_as_utf8 = s.encode('utf-8')
bytes_as_latin1 = s.encode('latin-1')
bytes_as_greek_iso = s.encode('ISO-8859-7')
bytes_as_ascii = s.encode('ascii')

bytes_as_utf8 == bytes_as_latin1 == bytes_as_greek_iso == bytes_as_ascii


What result do you get? True or False?

And now you know the answer, without having to ask.


 For example char 'a' has the value of '65' for all of those character
 sets? Is hat what you mean?

You can answer that question yourself.

c = 'a'
for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'):
print(c.encode(encoding))


By the way, I believe that Python has made a strategic mistake in the way 
that bytes are printed. I think it leads to more confusion, not less. 
Better would be something like this:

c = 'a'
for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'):
print(hex(c.encode(encoding)[0]))


For historical reasons, most (but not all) charsets are supersets of 
ASCII. That is, the first 128 characters in the charset are the same as 
the 128 characters in ASCII.


 s = 'a'  (This is unicode right?  Why when we assign a string to a
 variable that string's type is always unicode 

Strings in Python 3 are Unicode strings. That's just the way Python 
works. Unicode was chosen because Unicode includes over a million 
different characters (well, potentially over a million, most of them are 
currently unused), and is a strict superset of *all* common legacy 
codepages from the old DOS and Windows 95 days.


 and does not automatically
 become utf-8 which includes all available world-wide characters? Unicode
 is something different that a character set? )

Unicode is a character set. It is an enormous set of over one million 
characters (technically code point, but don't worry about the 
difference right now) which can be collected in strings.

UTF-8 is an encoding that goes from a string using the Unicode character 
set into bytes, and back again. Sometimes, people are lazy and say 
UTF-8 when they mean Unicode, or visa versa. 

UTF-16 and UTF-32 are two different encodings for the same purpose, but 
for various technical reasons UTF-8 is better for files.

'λ' is a character which exists in some charsets but not others. It is 
not in the ASCII charset, nor is it in Latin-1, nor Big-5. It is in the 
ISO-8859-7 charset, and of course it is in Unicode.

In ISO-8859-7, the character 'λ' is stored as byte 0xEB (decimal 235), 
just as the character 'a' is stored as byte 0x61 (decimal 97).

In UTF-8, the character λ is stored as two bytes 0xCE 0xBB.

In UTF-16 (big-endian), the character λ is stored as two bytes 0x03 0xBB.

In UTF-32 (big-endian), the character λ is stored as four bytes 0x00 0x00 
0x03 0xBB.

That's four different ways of spelling the same character as bytes, 
just as three, trois, drei, τρία, três are all different ways 
of spelling the same number 3.


 utf8_byte = s.encode('utf-8')
 
 Now if we are to decode this back to utf8 we will receive the char 'a'.
 I beleive same thing will happen with latin, greek, ascii isos. Correct?

Why don't you try it for yourself and see?



 The characters that will not decode correctly are those that their
 codepoints are greater that  127 ?

Maybe, maybe not. It depends on which codepoint, and which encodings. 
Some encodings use the same bytes for the same characters. Some encodings 
use different bytes. It all depends on the encoding, just like American 
and English both spell 3 three, while French spells it trois.


 for example if s = 'α' (greek character equivalent to english 'a')

In Latin-1, 'α' does not exist:

py 'α'.encode('latin-1')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'latin-1' codec can't encode character '\u03b1' in 
position 0: ordinal not in range(256)


In the old Windows Greek charset, ISO-8859-7, 'α' is stored as byte 0xE1:

py 'α'.encode('ISO-8859-7')
b'\xe1'


But in the old Windows *Russian* charset, ISO-8859-5, the byte 0xE1 means 
a completely different character, CYRILLIC SMALL LETTER ES:

py b'\xE1'.decode('ISO-8859-5')
'с'

(don't be fooled that this looks like the English c, it is not the same).


In Unicode, 'α' is always codepoint 0x3B1 (decimal 945):

py ord('α')
945

but before you can store that on a disk, or as a file name, it needs to 
be converted to bytes, and which bytes you get depends on which encoding 
you use:

py 'α'.encode('utf-8')
b'\xce\xb1'

py 'α'.encode('utf-16be')
b'\x03\xb1'

py 'α'.encode('utf-32be')
b'\x00\x00\x03\xb1'


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Νικόλαος Κούρας
Τη Παρασκευή, 7 Ιουνίου 2013 5:29:25 μ.μ. UTC+3, ο χρήστης MRAB έγραψε:

 This is a worse way of doing it because the ISO-8859-7 encoding has 1
 byte per codepoint, meaning that it's more 'tolerant' (if that's the 
 word) of errors. A sequence of bytes that is actually UTF-8 can be
 decoded as ISO-8859-7, giving gibberish.

 UTF-8 is less tolerant, and it's the encoding that ideally you should 
 be using everywhere, so it's better to assume UTF-8 and, if it fails,  
 try ISO-8859-7 and then rename so that any names that were ISO-8859-7
 will be converted to UTF-8.

Indeed iw asnt aware of that, at that time, i was under the impression that if 
a string was encoded to bytes using soem charset can only be switched back with 
the use of that and only that charset. Since this is the case here is my 
fixning:


#
# Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in filename_bytes:
# Compute 'path/to/filename' into bytes
filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
flag = False

try:
# Assume current file is utf8 encoded
filepath = filepath_bytes.decode('utf-8')
flag = 'utf8' 
except UnicodeDecodeError:
try:
# Since current filename is not utf8 encoded then it 
has to be greek-iso encoded
filepath = filepath_bytes.decode('iso-8859-7')
flag = 'greek'
except UnicodeDecodeError:
print( '''I give up! File name is unreadable!''' )

if( flag = 'greek' )
# Rename filename from greek bytes -- utf-8 bytes
os.rename( filepath_bytes, filepath.encode('utf-8') )


#
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
filename )
data = cur.fetchone()

if not data:
# First time for file; primary key is automatic, hit is 
defaulted 
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
if rec not in filepaths:
cur.execute('''DELETE FROM files WHERE url = %s''', rec )

=
ni...@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] [client 
79.103.41.173]   File /home/nikos/public_html/cgi-bin/files.py, line 81
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 
'greek' )
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]   
  ^
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid 
syntax
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of 
script headers: files.py
---
i dont know why that if statement errors.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Zero Piraeus
:

On 7 June 2013 14:52, Νικόλαος Κούρας nikos.gr...@gmail.com wrote:
File /home/nikos/public_html/cgi-bin/files.py, line 81
 [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 
 'greek' )
 [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] 
 ^
 [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: 
 invalid syntax
 [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of 
 script headers: files.py
 ---
 i dont know why that if statement errors.

Oh for f... READ SOME DOCUMENTATION, FOR THE LOVE OF BOB!!! READ YOUR
OWN EFFING CODE!

Look at this:

  http://docs.python.org/2/tutorial/controlflow.html

Read it now? Of course not. Go away and read it.

Now have you read it? GO AND READ IT.

What does an if statement end with? Hint: yep, that's it.

 -[]z.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread MRAB

On 07/06/2013 20:31, Zero Piraeus wrote:

:

On 7 June 2013 14:52, Νικόλαος Κούρας nikos.gr...@gmail.com wrote:
File /home/nikos/public_html/cgi-bin/files.py, line 81

[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 
'greek' )
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]   
  ^
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid 
syntax
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of 
script headers: files.py
---
i dont know why that if statement errors.


Oh for f... READ SOME DOCUMENTATION, FOR THE LOVE OF BOB!!! READ YOUR
OWN EFFING CODE!

Look at this:

   http://docs.python.org/2/tutorial/controlflow.html

Read it now? Of course not. Go away and read it.

Now have you read it? GO AND READ IT.

What does an if statement end with? Hint: yep, that's it.


Have you noticed how the line in the traceback doesn't match the line
in the post?
--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Zero Piraeus
:

On 7 June 2013 16:45, MRAB pyt...@mrabarnett.plus.com wrote:
 On 07/06/2013 20:31, Zero Piraeus wrote:
 [something exasperated, in capitals]

 Have you noticed how the line in the traceback doesn't match the line
 in the post?

Actually, I hadn't. It's not exactly a surprise at this point, though ...

I learnt a new word today, while searching for an apt ending to the
sentence Reading Nikos' posts is the internet equivalent of ...

... and that word is Dermatillomania.

 -[]z.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Cameron Simpson
On 07Jun2013 11:52, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
nikos.gr...@gmail.com wrote:
| ni...@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] 
[client 79.103.41.173]   File /home/nikos/public_html/cgi-bin/files.py, line 
81
| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 
'greek' )
| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] 
^
| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: 
invalid syntax
| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of 
script headers: files.py
| ---
| i dont know why that if statement errors.

Python statements that continue (if, while, try etc) end in a colon, so:

  if flag == 'greek':

Cheers,
-- 
Cameron Simpson c...@zip.com.au

Hello, my name is Yog-Sothoth, and I'll be your eldritch horror today.
- Heather Keith hkei...@andrew.cmu.edu
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-07 Thread Cameron Simpson
On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
nikos.gr...@gmail.com wrote:
| Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson 
έγραψε:
|  | | errors='replace' mean dont break in case or error?
|  
|  | Yes. The result will be correct for correct iso-8859-7 and slightly 
mangled
|  | for something that would not decode smoothly.
|  
|  | How can it be correct? We have encoded out string in utf-8 and then
|  | we tried to decode it as greek-iso? How can this possibly be
|  | correct?
| 
|  If it is a valid iso-8859-7 sequence (which might cover everything, 
|  since I expect it is an 8-bit 1:1 mapping from bytes values to a 
|  set of codepoints, just like iso-8859-1) then it may decode to the 
|  wrong characters, but the reverse process (characters encoded as
|  bytes) should produce the original bytes.  With a mapping like this, 
|  errors='replace' may mean nothing; there will be no errors because
|  the only Unicode characters in play are all from iso-8859-7 to start
|  with. Of course another string may not be safe. 
| 
|  Visually, the names will be garbage. And if you go:
|mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
|  while using the iso-8859-7 locale, the wrong thing will occur
|  (assuming it even works, though I think it should because all these
|  characters are represented in iso-8859-7, yes?)
| 
| All the rest you i understood only the above quotes its still unclear to me.
| I cant see to understand it.
| 
| Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 
codepoints similar?

Yes. It is certainly true for utf-8 and latin-iso and ASCII.
I expect it to be so for greek-iso, but have not checked.

They're all essentially the ASCII set plus a range of other character
codepoints for the upper values.  The 8-bit sets iso-8859-1 (which
I take you to mean by latin-iso) and iso-8859-7 (which I take you
to mean by greek-iso) are single byte mapping with the top half
mapped to characters commonly used in a particular region.

Unicode has a much greater range, but the UTF-8 encoding of Unicode
deliberately has the bottom 0-127 identical to ASCII, and higher
values represented by multibyte sequences commences with at least
the first byte in the 128-255 range. In this way pure ASCII files
are already in UTF-8 (and, in fact, work just fine for the iso-8859-x
encodings as well).

| For example char 'a' has the value of '65' for all of those character sets?
| Is hat what you mean?

Yes.

| s = 'a'  (This is unicode right?  Why when we assign a string to
| a variable that string's type is always unicode and does not
| automatically become utf-8 which includes all available world-wide
| characters? Unicode is something different that a character set? )

In Python 3, yes. Strings are unicode. Note that that means they are
sequences of codepoints whose meaning is as for Unicode.

utf-8 is a byte encoding for Unicode strings. An external storage
format, if you like. The first 0-127 codepoints are 1:1 with byte
values, and the higher code points require multibyte sequences.

| utf8_byte = s.encode('utf-8')

Unicode string = utf-8 byte encoding.

| Now if we are to decode this back to utf8 we will receive the char 'a'.

Yes.

| I beleive same thing will happen with latin, greek, ascii isos. Correct?
| 
| utf8_a = utf8_byte.decode('iso-8859-7')
| latin_a = utf8_byte.decode('iso-8859-1')
| ascii_a = utf8_byte.decode('ascii')
| utf8_a = utf8_byte.decode('iso-8859-7')
| 
| Is this correct? 

Yes, because of the design decision about the 0-127 codepoints.

| All of those decodes will work even if the encoded bytestring was of utf8 
type?
| 
| The characters that will not decode correctly are those that their codepoints 
are greater that  127 ?
| for example if s = 'α' (greek character equivalent to english 'a')
| Is this what you mean?

Yes, exactly so.

| 
| 
| Now back to my almost ready files.py script please:
| 
| 
| #
| # Collect filenames of the path dir as bytes
| greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' )
| 
| for filename in greek_filenames:
|   # Compute 'path/to/filename' in bytes
|   greek_path = b'/home/nikos/public_html/data/apps/' + b'filename'

You don't mean b'filename', which is the literal word filename.
You mean: filename.encode('iso-8859-7')

More probably, you mean:

  dirpath = b'/home/nikos/public_html/data/apps/'
  greek_filenames = os.listdir(dirpath)
  for greek_filename in greek_filenames:
try:
  filename = greek_filename.decode('iso-8859-7')

and then:

  greek_path = dirpath + greek_filename
  utf8_filename = filename.encode('utf-8')
  utf8_path = dirpath + utf8_filename

|   try:
|   filepath = greek_path.decode('iso-8859-7')
|   # Rename current filename from greek bytes -- utf-8 bytes
|   os.rename( greek_path, filepath.encode('utf-8') 

Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Chris Angelico
On Thu, Jun 6, 2013 at 3:54 PM, jmfauth wxjmfa...@gmail.com wrote:
 (filesystems are just bytes,
 yeah, whatever...).

Sure. You tell me what a proper Unicode rendition of an animated GIF is.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Νικόλαος Κούρας
Yes this is a linxu issue although locale is se to utf-8

root@nikos [~]# locale
LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE=en_US.UTF-8
LC_MONETARY=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=
root@nikos [~]#


Since 'locale' is set to 'utf-8' why when i:

'mv 'Euxi tou Ihsou.mp3' 'Ευχή του Ιησού.mp3'

lead to that unknown encoded bytestream 
'\305\365\367\336\\364\357\365\311\347\363\357\375.mp3'

which isn't by default an utf-8 bytestream as locale indicated and python 
expected?

how 'files.py' is supposed to read this file now using:

# Compute a set of current fullpaths 
fullpaths = set() 
path = /home/nikos/public_html/data/apps/ 

for root, dirs, files in os.walk(path): 
for fullpath in files: 
fullpaths.add( os.path.join(root, fullpath) ) 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Heiko Wundram

Am 05.06.2013 18:44, schrieb MRAB:

 From the previous posts I guessed that the filename might be encoded
using ISO-8859-7:

  s = b\305\365\367\336\ \364\357\365\ \311\347\363\357\375.mp3
  s.decode(iso-8859-7)
'Ευχή\\ του\\ Ιησού.mp3'

Yes, that looks the same.


Most probably, his terminal is set to ISO-8859-7, so that when he issues 
the rename command on the command-line of his shell session, the mv 
command gets a stream of bytes as the new file name which happens to be 
the ISO-8859-7 encoding of the file name he'd like the file to have. 
This is what's stored on disk.


So, his biggest problem isn't that the operating system is encoding 
agnostic wrt. filenames (i.e., treats them as a stream of bytes), but 
rather that he's using an ISO-7 terminal window when having set up UTF-8 
as his operating system locale and expects filenames to be encoded in 
UTF-8 when he's not passing in UTF-8 byte streams from his client 
computer at all.


--
--- Heiko.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Mark Lawrence

On 06/06/2013 07:11, Chris Angelico wrote:

On Thu, Jun 6, 2013 at 3:54 PM, jmfauth wxjmfa...@gmail.com wrote:

(filesystems are just bytes,
yeah, whatever...).


Sure. You tell me what a proper Unicode rendition of an animated GIF is.

ChrisA



It's obviously one that doesn't use the flawed Python Flexible String 
Representation :)


--
Steve is going for the pink ball - and for those of you who are 
watching in black and white, the pink is next to the green. Snooker 
commentator 'Whispering' Ted Lowe.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Cameron Simpson
On 05Jun2013 11:43, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
nikos.gr...@gmail.com wrote:
| Τη Τετάρτη, 5 Ιουνίου 2013 9:32:15 μ.μ. UTC+3, ο χρήστης MRAB έγραψε:
|  Using Python, I think you could get the filenames using os.listdir, 
|  passing the directory name as a bytestring so that it'll return the
|  names as bytestrings.
| 
|  Then, for each name, you could decode from its current encoding and 
|  encode to UTF-8 and rename the file, passing the old and new paths to
|  os.rename as bytestrings.
| 
| Iam not sure i follow:
| 
| Change this:
| 
| # Compute a set of current fullpaths
| fullpaths = set()
| path = /home/nikos/public_html/data/apps/
| 
| for root, dirs, files in os.walk(path):
[...]

Have a read of this:

  http://docs.python.org/3/library/os.html#os.listdir

The UNIX API accepts bytes for filenames and paths.

Python 3 strs are sequences of Unicode code points. If you try to
open a file or directory on a UNIX system using a Python str, that
string must be converted to a sequence of bytes before being handed
to the OS.

This is done implicitly using your locale settings if you just use a str.

However, if you pass a bytes to open or listdir, this conversion
does not take place. You put bytes in and in the case of listdir
you get bytes out.

You can work on pathnames in bytes and never concern yourself with
encode/decode at all.

In this way you can write code that does not care about the translation
between Unicode and some arbitrary byte encoding.

Of course, the issue will still arise when accepting user input;
your shell has done exactly this kind of thing when you renamed
your MP3 file. But it is possible to write pure utility code that
doesn't care about filenames as Unicode or str if you work purely
in bytes.

Regarding user filenames, the common policy these days is to use
utf-8 throughout. Of course you need to get everything into that
regime to start with.
-- 
Cameron Simpson c...@zip.com.au

...but C++ gloggles the cheesewad, thus causing a type conflict.
- David Jevans, jev...@apple.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Νικόλαος Κούρας
Τη Πέμπτη, 6 Ιουνίου 2013 11:50:55 π.μ. UTC+3, ο χρήστης Heiko Wundram έγραψε:
 Am 05.06.2013 18:44, schrieb MRAB:
 
   From the previous posts I guessed that the filename might be encoded
 
  using ISO-8859-7:
 
 
 
s = b\305\365\367\336\ \364\357\365\ \311\347\363\357\375.mp3
 
s.decode(iso-8859-7)
 
  'οΏ½οΏ½οΏ½οΏ½\\ οΏ½οΏ½οΏ½\\ οΏ½οΏ½οΏ½οΏ½οΏ½.mp3'
 
 
 
  Yes, that looks the same.
 
 
 
 Most probably, his terminal is set to ISO-8859-7, so that when he issues 
 
 the rename command on the command-line of his shell session, the mv 
 
 command gets a stream of bytes as the new file name which happens to be 
 
 the ISO-8859-7 encoding of the file name he'd like the file to have. 
 
 This is what's stored on disk.
 
 
 
 So, his biggest problem isn't that the operating system is encoding 
 
 agnostic wrt. filenames (i.e., treats them as a stream of bytes), but 
 
 rather that he's using an ISO-7 terminal window when having set up UTF-8 
 
 as his operating system locale and expects filenames to be encoded in 
 
 UTF-8 when he's not passing in UTF-8 byte streams from his client 
 
 computer at all.
 
 
 
 -- 
 
 --- Heiko.

ni...@superhost.gr [~/www/data/apps]# ls -l | file -
/dev/stdin: ASCII text


# Compute a set of current fullpaths
fullpaths = set()
path = /home/nikos/public_html/data/apps/

for root, dirs, files in os.walk(path):
for fullpath in files:
fullpaths.add( os.path.join(root, fullpath) )


[Thu Jun 06 13:34:19 2013] [error] [client 79.103.41.173] 
cur.execute('''SELECT url FROM files WHERE url = %s''', 
fullpath.encode('iso-8859-7') )
[Thu Jun 06 13:34:19 2013] [error] [client 79.103.41.173]   File 
/usr/local/lib/python3.3/encodings/iso8859_7.py, line 12, in encode
[Thu Jun 06 13:34:19 2013] [error] [client 79.103.41.173] return 
codecs.charmap_encode(input,errors,encoding_table)
[Thu Jun 06 13:34:19 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 
'charmap' codec can't encode characters in position 34-37: character maps to 
undefined


[Thu Jun 06 13:27:17 2013] [error] [client 79.103.41.173] Traceback (most 
recent call last):
[Thu Jun 06 13:27:17 2013] [error] [client 79.103.41.173]   File files.py, 
line 73, in module
[Thu Jun 06 13:27:17 2013] [error] [client 79.103.41.173] 
cur.execute('''SELECT url FROM files WHERE url = %s''', 
fullpath.decode('iso-8859-7') )
[Thu Jun 06 13:27:17 2013] [error] [client 79.103.41.173] AttributeError: 'str' 
object has no attribute 'decode'

Same when i encode in latin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Heiko Wundram

Am 06.06.2013 12:35, schrieb Νικόλαος Κούρας:

ni...@superhost.gr [~/www/data/apps]# ls -l | file -
/dev/stdin: ASCII text


Did you actually try to understand what I wrote?

--
--- Heiko.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Νικόλαος Κούρας
Heiko, the ssh client i used to 'mv' the .mp3 was putty.Do you mean that putty 
is responsible for the encoding mess?


the rename command on the command-line of his shell session, the mv 
command gets a stream of bytes as the new file name which happens to be 
the ISO-8859-7 encoding of the file name he'd like the file to have. 
This is what's stored on disk. 




So, his biggest problem isn't that the operating system is encoding 
agnostic wrt. filenames (i.e., treats them as a stream of bytes), but 
rather that he's using an ISO-7 terminal window when having set up UTF-8 
as his operating system locale and expects filenames to be encoded in 
UTF-8 when he's not passing in UTF-8 byte streams from his client 
computer at all. 

the rename command on the command-line of his shell session, the mv 
command gets a stream of bytes as the new file name which happens to be 
the ISO-8859-7 encoding of the file name he'd like the file to have. 
This is what's stored on disk. 




So, his biggest problem isn't that the operating system is encoding 
agnostic wrt. filenames (i.e., treats them as a stream of bytes), but 
rather that he's using an ISO-7 terminal window when having set up UTF-8 
as his operating system locale and expects filenames to be encoded in 
UTF-8 when he's not passing in UTF-8 byte streams from his client 

the rename command on the command-line of his shell session, the mv 
command gets a stream of bytes as the new file name which happens to be 
the ISO-8859-7 encoding of the file name he'd like the file to have. 
This is what's stored on disk. 




So, his biggest problem isn't that the operating system is encoding 
agnostic wrt. filenames (i.e., treats them as a stream of bytes), but 
rather that he's using an ISO-7 terminal window when having set up UTF-8 
as his operating system locale and expects filenames to be encoded in 
UTF-8 when he's not passing in UTF-8 byte streams from his client 
computer at all. 

the rename command on the command-line of his shell session, the mv 
command gets a stream of bytes as the new file name which happens to be 
the ISO-8859-7 encoding of the file name he'd like the file to have. 
This is what's stored on disk. 




So, his biggest problem isn't that the operating system is encoding 
agnostic wrt. filenames (i.e., treats them as a stream of bytes), but 
rather that he's using an ISO-7 terminal window when having set up UTF-8 
as his operating system locale and expects filenames to be encoded in 
UTF-8 when he's not passing in UTF-8 byte streams from his client 
computer at all. 

the rename command on the command-line of his shell session, the mv 
command gets a stream of bytes as the new file name which happens to be 
the ISO-8859-7 encoding of the file name he'd like the file to have. 
This is what's stored on disk. 




So, his biggest problem isn't that the operating system is encoding 
agnostic wrt. filenames (i.e., treats them as a stream of bytes), but 
rather that he's using an ISO-7 terminal window when having set up UTF-8 
as his operating system locale and expects filenames to be encoded in 
UTF-8 when he's not passing in UTF-8 byte streams from his client 
computer at all. 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Heiko Wundram

Am 06.06.2013 13:00, schrieb Νικόλαος Κούρας:

Heiko, the ssh client i used to 'mv' the .mp3 was putty.Do you mean that putty 
is responsible for the encoding mess?


Exactly. Check the encoding that putty uses for the terminal session. If 
it doesn't use UTF-8, switch your terminal session to UTF-8 and try the 
rename again. If it does, try to use another terminal client (I 
recommend the Cygwin-Suite).


--
--- Heiko.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Νικόλαος Κούρας
Τη Πέμπτη, 6 Ιουνίου 2013 1:24:16 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
 On 05Jun2013 11:43, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 nikos.gr...@gmail.com wrote:
 
 | Τη Τετάρτη, 5 Ιουνίου 2013 9:32:15 μ.μ. UTC+3, ο χρήστης MRAB έγραψε:
 
 |  Using Python, I think you could get the filenames using os.listdir, 
 
 |  passing the directory name as a bytestring so that it'll return the
 
 |  names as bytestrings.
 
 | 
 
 |  Then, for each name, you could decode from its current encoding and 
 
 |  encode to UTF-8 and rename the file, passing the old and new paths to
 
 |  os.rename as bytestrings.
 
 | 
 
 | Iam not sure i follow:
 
 | 
 
 | Change this:
 
 | 
 
 | # Compute a set of current fullpaths
 
 | fullpaths = set()
 
 | path = /home/nikos/public_html/data/apps/
 
 | 
 
 | for root, dirs, files in os.walk(path):
 
 [...]
 
 
 
 Have a read of this:
 
 
 
   http://docs.python.org/3/library/os.html#os.listdir
 
 
 
 The UNIX API accepts bytes for filenames and paths.
 
 
 
 Python 3 strs are sequences of Unicode code points. If you try to
 
 open a file or directory on a UNIX system using a Python str, that
 
 string must be converted to a sequence of bytes before being handed
 
 to the OS.
 
 
 
 This is done implicitly using your locale settings if you just use a str.
 
 
 
 However, if you pass a bytes to open or listdir, this conversion
 
 does not take place. You put bytes in and in the case of listdir
 
 you get bytes out.
 
 
 
 You can work on pathnames in bytes and never concern yourself with
 
 encode/decode at all.
 
 
 
 In this way you can write code that does not care about the translation
 
 between Unicode and some arbitrary byte encoding.
 
 
 
 Of course, the issue will still arise when accepting user input;
 
 your shell has done exactly this kind of thing when you renamed
 
 your MP3 file. But it is possible to write pure utility code that
 
 doesn't care about filenames as Unicode or str if you work purely
 
 in bytes.



 
 Regarding user filenames, the common policy these days is to use
 
 utf-8 throughout. Of course you need to get everything into that
 
 regime to start with





Τη Πέμπτη, 6 Ιουνίου 2013 1:24:16 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
 On 05Jun2013 11:43, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 nikos.gr...@gmail.com wrote:
 
 | Τη Τετάρτη, 5 Ιουνίου 2013 9:32:15 μ.μ. UTC+3, ο χρήστης MRAB έγραψε:
 
 |  Using Python, I think you could get the filenames using os.listdir, 
 
 |  passing the directory name as a bytestring so that it'll return the
 
 |  names as bytestrings.
 
 | 
 
 |  Then, for each name, you could decode from its current encoding and 
 
 |  encode to UTF-8 and rename the file, passing the old and new paths to
 
 |  os.rename as bytestrings.
 
 | 
 
 | Iam not sure i follow:
 
 | 
 
 | Change this:
 
 | 
 
 | # Compute a set of current fullpaths
 
 | fullpaths = set()
 
 | path = /home/nikos/public_html/data/apps/
 
 | 
 
 | for root, dirs, files in os.walk(path):
 
 [...]
 
 
 
 Have a read of this:
 
 
 
   http://docs.python.org/3/library/os.html#os.listdir
 
 
 
 The UNIX API accepts bytes for filenames and paths.
 
 
 
 Python 3 strs are sequences of Unicode code points. If you try to
 
 open a file or directory on a UNIX system using a Python str, that
 
 string must be converted to a sequence of bytes before being handed
 
 to the OS.
 
 
 
 This is done implicitly using your locale settings if you just use a str.
 
 
 
 However, if you pass a bytes to open or listdir, this conversion
 
 does not take place. You put bytes in and in the case of listdir
 
 you get bytes out.
 
 
 
 You can work on pathnames in bytes and never concern yourself with
 
 encode/decode at all.
 
 
 
 In this way you can write code that does not care about the translation
 
 between Unicode and some arbitrary byte encoding.
 
 
 
 Of course, the issue will still arise when accepting user input;
 
 your shell has done exactly this kind of thing when you renamed
 
 your MP3 file. But it is possible to write pure utility code that
 
 doesn't care about filenames as Unicode or str if you work purely
 
 in bytes.
 
 
 
 Regarding user filenames, the common policy these days is to use
 
 utf-8 throughout. Of course you need to get everything into that
 
 regime to start with.

So i i nee to use os.listdir() to grab those filenames into bytes. okey.

So by changing this to:

fullpaths = set()
path = /home/nikos/public_html/data/apps/

for root, dirs, files in os.walk(path):
for fullpath in files:
fullpaths.add( os.path.join(root, fullpath) )



# Compute a set of current fullpaths
fullpaths = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for fullpath in fullpaths:
try: 
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
(fullpath,) )
data = cur.fetchone()#URL 

Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Νικόλαος Κούρας
Τη Πέμπτη, 6 Ιουνίου 2013 2:09:22 μ.μ. UTC+3, ο χρήστης Heiko Wundram έγραψε:
 Am 06.06.2013 13:00, schrieb Νικόλαος Κούρας:
 
  Heiko, the ssh client i used to 'mv' the .mp3 was putty.Do you mean that 
  putty is responsible for the encoding mess?
 
 
 
 Exactly. Check the encoding that putty uses for the terminal session. If 
 
 it doesn't use UTF-8, switch your terminal session to UTF-8 and try the 
 
 rename again. If it does, try to use another terminal client (I 
 
 recommend the Cygwin-Suite).

Okey, indeed it was using greek-sio encoding, i changed it to uf-8 and reopned 
the terminal session.

ni...@superhost.gr [~/www/data/apps]# mv *.mp3 'Ευχή του Ιησού.mp3'
mv: `\305\365\367\336 \364\357\365 \311\347\363\357\375.mp3' and 
`\305\365\367\3 
   36 \364\357\365 \311\347\363\357\375.mp3' are the same 
file
ni...@superhost.gr [~/www/data/apps]# mv *.mp3 'Ευχή του Ιησού.mp33'
ni...@superhost.gr [~/www/data/apps]# mv *.mp33 'Ευχή του Ιησού.mp3'
ni...@superhost.gr [~/www/data/apps]# ls -l
total 368548
drwxr-xr-x 2 nikos nikos 4096 Jun  6 14:22 ./
drwxr-xr-x 6 nikos nikos 4096 May 26 21:13 ../
-rwxr-xr-x 1 nikos nikos 13157283 Mar 17 12:57 100\ Mythoi\ tou\ Aiswpou.pdf*
-rwxr-xr-x 1 nikos nikos 29524686 Mar 11 18:17 Anekdotologio.exe*
-rw-r--r-- 1 nikos nikos 42413964 Jun  2 20:29 Battleship.exe
-rw-r--r-- 1 nikos nikos   236032 Jun  4 14:10 \323\352\335\370\357\365\ 
\335\35 
   5\341\355\ \341\361\351\350\354\374.exe
-rwxr-xr-x 1 nikos nikos 66896732 Mar 17 13:13 Kosmas\ o\ Aitwlos\ -\ 
Profiteies  
  .pdf*
-rw-r--r-- 1 nikos nikos 51819750 Jun  2 20:04 Luxor\ Evolved.exe
-rw-r--r-- 1 nikos nikos 60571648 Jun  2 14:59 Monopoly.exe
-rw-r--r-- 1 nikos nikos  3511233 Jun  4 14:11 \305\365\367\336\ \364\357\365\ 
\   
 311\347\363\357\375.mp3
-rwxr-xr-x 1 nikos nikos  1788164 Mar 14 11:31 Online\ Movie\ Player.zip*
-rw-r--r-- 1 nikos nikos  5277287 Jun  1 18:35 O\ Nomos\ tou\ Merfy\ v1-2-3.zip
-rwxr-xr-x 1 nikos nikos 16383001 Jun 22  2010 Orthodoxo\ Imerologio.exe*
-rw-r--r-- 1 nikos nikos  6084806 Jun  1 18:22 Pac-Man.exe
-rw-r--r-- 1 nikos nikos 25476584 Jun  2 19:50 Scrabble.exe
-rwxr-xr-x 1 nikos nikos 49141166 Mar 17 12:48 To\ 1o\ mou\ vivlio\ gia\ to\ 
ska 
   ki.pdf*
-rwxr-xr-x 1 nikos nikos  3298310 Mar 17 12:45 Vivlos\ gia\ Atheofovous.pdf*
-rw-r--r-- 1 nikos nikos  1764864 May 29 21:50 V-Radio\ v2.4.msi
ni...@superhost.gr [~/www/data/apps]# ls *.mp3 | file -
/dev/stdin: ASCII text
ni...@superhost.gr [~/www/data/apps]#

still same error.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Heiko Wundram

Am 06.06.2013 13:24, schrieb Νικόλαος Κούρας:

ni...@superhost.gr [~/www/data/apps]# ls *.mp3 | file -
/dev/stdin: ASCII text


Again, did you actually read (and try to understand) what I wrote? I 
said to redo the rename after you change your terminal session to UTF-8.


--
--- Heiko.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Νικόλαος Κούρας
# Compute a set of current fullpaths
fullpaths = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for fullpath in fullpaths:
try: 
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
fullpath.encode('utf-8') )
data = cur.fetchone()#URL is unique, so should only be 
one

print( fullpath.encode('utf-8') )


Now why this does not print out the filenames when iterated in the for loop?
One step forward is that when i run it liek this no error is being displyed in 
the error log.

Please help, i ahve tried os.listdir() as Cameron suggested.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread MRAB

On 06/06/2013 04:43, Νικόλαος Κούρας wrote:

Τη Τετάρτη, 5 Ιουνίου 2013 9:43:18 μ.μ. UTC+3, ο χρήστης Νικόλαος Κούρας έγραψε:
 Τη Τετάρτη, 5 Ιουνίου 2013 9:32:15 μ.μ. UTC+3, ο χρήστης MRAB έγραψε:

  On 05/06/2013 18:43, οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½ οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½ wrote:

 

   οΏ½οΏ½ οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½, 5 οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½ 2013 8:56:36 
οΏ½.οΏ½. UTC+3, οΏ½ οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½ Steven D'Aprano οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½:

 

  

 

   Somehow, I don't know how because I didn't see it happen, you have one or

 

   more files in that directory where the file name as bytes is invalid when

 

   decoded as UTF-8, but your system is set to use UTF-8. So to fix this you

 

   need to rename the file using some tool that doesn't care quite so much

 

   about encodings. Use the bash command line to rename each file in turn

 

   until the problem goes away.

 

  

 

 ' leade to that unknown encoding of this bytestream '\305\365\367\336\ 
\364\357\365\ \311\347\363\357\375.mp3'

 

  

 

   But please tell me Steven what linux tool you think it can encode the 
weird filename to proper 'οΏ½οΏ½οΏ½οΏ½ οΏ½οΏ½οΏ½ οΏ½οΏ½οΏ½οΏ½οΏ½.mp3' utf-8?

 

  

 

   or we cna write a script as i suggested to decode back the bytestream 
using all sorts of available decode charsets boiling down to the original greek letters.

 

  

 





 Actually you were correct i was typing greek and is aw the fileneme here in 
gogole groups as:



   But renaming ia hsell access like 'mv 'Euxi tou Ihsou.mp3' 'οΏ½οΏ½οΏ½οΏ½ 
οΏ½οΏ½οΏ½ οΏ½οΏ½οΏ½οΏ½οΏ½.mp3



 so maybe the filenames have to be decoded to greek-iso but then agian the 
contain both greek letters but their extension are in english chars like '.mp3'





  Using Python, I think you could get the filenames using os.listdir,

  passing the directory name as a bytestring so that it'll return the

  names as bytestrings.





  Then, for each name, you could decode from its current encoding and

  encode to UTF-8 and rename the file, passing the old and new paths to

  os.rename as bytestrings.



 Iam not sure i follow:



 Change this:



 # Compute a set of current fullpaths

 fullpaths = set()

 path = /home/nikos/public_html/data/apps/



 for root, dirs, files in os.walk(path):

for fullpath in files:

fullpaths.add( os.path.join(root, fullpath) )





 to what to make the full url readable by files.py?

MRAB can you please explain in more clarity your idea of solution?
I was suggesting a way to rename the files so that their names are 
encoded in UTF-8 (they appear to be encoded in ISO-8859-7).


You MUST TEST IT thoroughly first, of course, before trying it on the 
actual files.


It could go something like this:


import os

# Give the path as a bytestring so that we'll get the names as bytestrings.
root_folder = b/home/nikos/public_html/data/apps/

# Setting TESTING to True will make it print out what renamings it will 
do, but

# not actually do them.
TESTING = True

# Walk through the files.
for root, dirs, files in os.walk(root_folder):
for name in files:
try:
# Is this name encoded in UTF-8?
name.decode(utf-8)
except UnicodeDecodeError:
# Decoding from UTF- failed, which means that the name is 
not valid

# UTF-8.

# It appears (from elsewhere) that the names are encoded in
# ISO-8859-7, so decode from that and re-encode to UTF-8.
new_name = name.decode(iso-8859-7).encode(utf-8)

old_path = os.path.join(root, name)
new_path = os.path.join(root, new_name)
if TESTING:
print(Will rename {!r} to {!r}.format(old_path, 
new_path))

else:
print(Renaming {!r} to {!r}.format(old_path, new_path))
os.rename(old_path, new_path)

--
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Νικόλαος Κούρας
First of all thank you for helping me MRAB.
After make some alternation to your code ia have this:


# Give the path as a bytestring so that we'll get the filenames as bytestrings
path = b/home/nikos/public_html/data/apps/ 

# Setting TESTING to True will make it print out what renamings it will do, but 
not actually do them
TESTING = True 

# Walk through the files. 
for root, dirs, files in os.walk( path ): 
for filename in files: 
try: 
# Is this name encoded in UTF-8? 
filename.decode('utf-8') 
except UnicodeDecodeError: 
# Decoding from UTF-8 failed, which means that the name 
is not valid UTF-8
# It appears that the filenames are encoded in 
ISO-8859-7, so decode from that and re-encode to UTF-8
new_filename = 
filename.decode('iso-8859-7').encode('utf-8') 

old_path = os.path.join(root, filename) 
new_path = os.path.join(root, new_filename)
if TESTING:
print( '''brWill rename {!r} --- 
{!r}brbr'''.format( old_path, new_path ) )
else: 
print( '''brRenaming {!r} --- 
{!r}brbr'''.format( old_path, new_path ) )
os.rename( old_path, new_path )
sys.exit(0)
-

and the output can be seen here: http://superhost.gr/cgi-bin/files.py

We are in test mode so i dont know if when renaming actually take place what 
the encodings will be.

Shall i switch off test mode and try it for real?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread Steven D'Aprano
On Tue, 04 Jun 2013 02:00:43 -0700, Νικόλαος Κούρας wrote:

 Τη Τρίτη, 4 Ιουνίου 2013 11:47:01 π.μ. UTC+3, ο χρήστης Steven D'Aprano
 έγραψε:
 
 Please run these commands, and show what result they give:
[...]
 ni...@superhost.gr [~/www/data/apps]# alias ls 
 alias ls='/bin/ls $LS_OPTIONS'

And what does 

echo $LS_OPTIONS


give?

[...]
 Seems that the way the system used to actually rename the file matters.

Yes. This is where you get interactions between different systems that 
use different encodings, and they don't work well together.

Some day, everything will use UTF-8, and these problems will go away.


 If all else fails, you could just rename the troublesome file and
 hopefully the problem will go away:
 mv *Ο.mp3 1.mp3
 mv 1.mp3 Eυχή του Ιησού.mp3
 
 Yes, but why you are doing it it 2 steps and not as:
 
 mv *Ο.mp3 'Eυχή του Ιησού.mp3'

I don't remember. I had a reason that made sense at the time, but I can't 
remember what it was.


I think I can reproduce your problem. If I open a terminal, set to use 
UTF-8, I can do this:

[steve@ando ~]$ cd /tmp
[steve@ando tmp]$ touch '999-Eυχή-του-Ιησού'
[steve@ando tmp]$ ls 999*
999-Eυχή-του-Ιησού


Now if I change the terminal to use Greek ISO-8859-7, and hit UP-ARROW to 
grab the previous command line from history, the *displayed* file name 
changes, but the actual file being touched remains the same:

[steve@ando tmp]$ touch '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ'
[steve@ando tmp]$ ls 999*
999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ


In Python 3.3, I can demonstrate the same sort of thing:

py s = '999-Eυχή-του-Ιησού'
py bytes_as_utf8 = s.encode('utf-8')
py t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
py print(t)
999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ


So that demonstrates part of your problem: even though your Linux system 
is using UTF-8, your terminal is probably set to ISO-8859-7. The 
interaction between these will lead to strange and disturbing Unicode 
errors.


To continue, back in the terminal set to ISO-8859-7, if instead of using 
the history line, if I re-copy and paste the file name:

[steve@ando tmp]$ touch '999-Eυχή-του-Ιησού'
[steve@ando tmp]$ ls 999*
999-E???-???-?  999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ


So now I end up with two files, one with a file name that is utter 
garbage bytes, and one that is only a little better, being mojibake.

Resetting the terminal to use UTF-8 at least now restores the *display* 
of the earlier file's name:

[steve@ando tmp]$ ls 999*
999-E???-???-?  999-Eυχή-του-Ιησού
[steve@ando tmp]$ ls -b 999*
999-E\365\367\336-\364\357\365-\311\347\363\357\375  999-Eυχή-του-Ιησού

but the other file name is still made of garbage bytes.


So I believe I understand how your file name has become garbage. To fix 
it, make sure that your terminal is set to use UTF-8, and then rename it. 
Do the same with every file in the directory until the problem goes away.

(If one file has garbage bytes in the file name, chances are that more 
than one do.)


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Changing filenames from Greeklish = Greek (subprocess complain)

2013-06-06 Thread MRAB

On 06/06/2013 13:04, Νικόλαος Κούρας wrote:

First of all thank you for helping me MRAB.
After make some alternation to your code ia have this:


# Give the path as a bytestring so that we'll get the filenames as bytestrings
path = b/home/nikos/public_html/data/apps/

# Setting TESTING to True will make it print out what renamings it will do, but 
not actually do them
TESTING = True

# Walk through the files.
for root, dirs, files in os.walk( path ):
for filename in files:
try:
# Is this name encoded in UTF-8?
filename.decode('utf-8')
except UnicodeDecodeError:
# Decoding from UTF-8 failed, which means that the name 
is not valid UTF-8
# It appears that the filenames are encoded in 
ISO-8859-7, so decode from that and re-encode to UTF-8
new_filename = 
filename.decode('iso-8859-7').encode('utf-8')

old_path = os.path.join(root, filename)
new_path = os.path.join(root, new_filename)
if TESTING:
print( '''brWill rename {!r} --- 
{!r}brbr'''.format( old_path, new_path ) )
else:
print( '''brRenaming {!r} --- 
{!r}brbr'''.format( old_path, new_path ) )
os.rename( old_path, new_path )
sys.exit(0)
-

and the output can be seen here: http://superhost.gr/cgi-bin/files.py

We are in test mode so i dont know if when renaming actually take place what 
the encodings will be.

Shall i switch off test mode and try it for real?


The first one is '/home/nikos/public_html/data/apps/Ευχή του Ιησού.mp3'.

The second one is '/home/nikos/public_html/data/apps/Σκέψου έναν 
αριθμό.exe'.


These names are currently encoded in ISO-8859-7, but will be encoded in
UTF-8 if you turn off test mode.

If you're happy for that change to happen, then go ahead.
--
http://mail.python.org/mailman/listinfo/python-list


  1   2   3   >