subject:"Changing filenames from Greeklish = Greek \(subprocess complain\)"

Τη Δευτέρα, 10 Ιουνίου 2013 2:59:03 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:
> On Mon, 10 Jun 2013 00:10:38 -0700, nagia.retsina wrote:
> 
> 
> 
> > Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano
> 
> > έγραψε:
> 
> > 
> 
> >> py> c = 'α'
> 
> >> py> ord(c)
> 
> >> 945
> 
> > 
> 
> > The number 945 is the characters 'α' ordinal value in the unicode
> 
> > charset correct?
> 
> 
> 
> Correct.
> 
> 
> 
> 
> 
> > The command in the python interactive session to show me how many bytes
> 
> > this character will take upon encoding to utf-8 is:
> 
> > 
> 
>  s = 'α'
> 
>  s.encode('utf-8')
> 
> > b'\xce\xb1'
> 
> > 
> 
> > I see that the encoding of this char takes 2 bytes. But why two exactly?
> 
> 
> 
> Because that's how UTF-8 works. If it was a different encoding, it might 
> 
> be 4 bytes, or 2, or 1, or 101, or 7, or 3. But it is UTF-8, so it takes 
> 
> 2 bytes. If you want to understand how UTF-8 works, look it up on 
> 
> Wikipedia. 
> 
> 
> 
> 
> 
> > How do i calculate how many bits are needed to store this char into
> 
> > bytes?
> 
> 
> 
> Every byte is made of 8 bits. There are two bytes. So multiply 8 by 2.
> 
> 
> 
> 
> 
> > Trying to to the same here but it gave me no bytes back.
> 
> > 
> 
>  s = 'a'
> 
>  s.encode('utf-8')
> 
> > b'a'
> 
> 
> 
> There is a byte there. The byte is printed by Python as b'a', which in my  
> opinion is a design mistake. That makes it look like a string, but it is  
> not a string, and would be better printed as b'\x61'. But regardless of 
> the display, it is still a single byte.


Perhaps, up to 127 ASCII chars python thinks its better for human to read the 
character representaion of the stored byte, instead of hex's. Just a guess.

> Just like 0o1234 uses octal, "o" for Octal.
> And 0x123EF uses hexadecimal. "x" for heXadecimal.

Why the leadin zero before octal's 'o' and hex's 'x'  and binary's 'b' ?


Iam not goin to tired you any more, because ia hve exhaust myself tlo days now 
tryign to get my head around this.

Please confirm i ahve understood correctly:

I did but docs confuse me even more. Can you pleas ebut it simple.

Unicode as i understand it was created out of need for a bigger character set 
compared to ASCII which could hold up to 127 chars(and extended versions of it 
up to 256), that could be able to hold all worlds symbols.

ASCII and Unicode are character sets.

Everything else sees to be an encoding system that work upon those characters 
sets.

If what i said is true the last thing that still confuses me is that

iso-8859-7(256 chars) seems like charactet set and an encoding method too.
Can it be both or it is iso-8859-7 encoding method of Unicode character set 
similar as uTF8 is also Unicode's encoding method?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-10 Thread Steven D'Aprano

On Mon, 10 Jun 2013 00:10:38 -0700, nagia.retsina wrote:

> Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano
> έγραψε:
> 
>> py> c = 'α'
>> py> ord(c)
>> 945
> 
> The number 945 is the characters 'α' ordinal value in the unicode
> charset correct?

Correct.

> The command in the python interactive session to show me how many bytes
> this character will take upon encoding to utf-8 is:
> 
 s = 'α'
 s.encode('utf-8')
> b'\xce\xb1'
> 
> I see that the encoding of this char takes 2 bytes. But why two exactly?

Because that's how UTF-8 works. If it was a different encoding, it might 
be 4 bytes, or 2, or 1, or 101, or 7, or 3. But it is UTF-8, so it takes 
2 bytes. If you want to understand how UTF-8 works, look it up on 
Wikipedia. 

> How do i calculate how many bits are needed to store this char into
> bytes?

Every byte is made of 8 bits. There are two bytes. So multiply 8 by 2.

> Trying to to the same here but it gave me no bytes back.
> 
 s = 'a'
 s.encode('utf-8')
> b'a'

There is a byte there. The byte is printed by Python as b'a', which in my 
opinion is a design mistake. That makes it look like a string, but it is 
not a string, and would be better printed as b'\x61'. But regardless of 
the display, it is still a single byte.

>>py> c.encode('utf-8')
>> b'\xce\xb1'
> 
> 2 bytes here. why 2?

Because that's how UTF-8 works.

>> py> c.encode('utf-16be')
>> b'\x03\xb1'
> 
> 2 byets here also. but why 3 different bytes? 

Because it is a different encoding.

> the ordinal value of char 'a' is the same in unicode.

The same as what?

> the encodign system just takes the ordinal value end encode, but 
> sinc eit uses 2 bytes should these 2 bytes be the same?

No.

That's like saying that since a dog in Germany has four legs and one 
head, and a dog in France has four legs and one head, dog should be 
spelled "Hund" in both Germany and France.

Different encodings are like different languages. They spell the same 
word differently.

>> py> c.encode('utf-32be')
>> b'\x00\x00\x03\xb1
> 
> every char here takes exactly 4 bytes to be stored. okey.
> 
>> py> c.encode('iso-8859-7')
>> b'\xe1'
> 
> And also does '\x' means that the value is being respresented in hex
> way? and when i bin(6) i see '0b101'
> 
> I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to
> say?

"b" for Binary.

Just like 0o1234 uses octal, "o" for Octal.

And 0x123EF uses hexadecimal. "x" for heXadecimal.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-10 Thread Andreas Perstinger

On 10.06.2013 11:59, Νικόλαος Κούρας wrote:

 s = 'α'
 s.encode('utf-8')
> b'\xce\xb1'

'b' stands for binary right?

No, here it stands for bytes:
http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

  b'\xce\xb1' = we are looking at a byte in a hexadecimal format?

No, b'\xce\xb1' represents a byte object containing 2 bytes.
Yes, each byte is represented in hexadecimal format.

if yes how could we see it in binary and decimal represenation?

>>> s = b'\xce\xb1'
>>> s[0]
206
>>> bin(s[0])
'0b11001110'
>>> s[1]
177
>>> bin(s[1])
'0b10110001'

A byte object is a sequence of bytes (= integer values) and support 
indexing.

http://docs.python.org/3/library/stdtypes.html#bytes

Since 2^8 = 256, utf-8 should store the first 256 chars of unicode
charset using 1 byte.

Also Since 2^16 = 65535, utf-8 should store the first 65535 chars of
unicode charset using 2 bytes and so on.

But i know that this is not the case. But i dont understand why.

Because your method doesn't work.
If you use all possible 256 bit-combinations to represent a valid 
character, how do you decide where to stop in a sequence of bytes?

 s = 'a'
 s.encode('utf-8')
> b'a'
utf-8 takes ASCII as it is, as 1 byte. They are the same

EBCDIC and ASCII and Unicode are charactet sets, correct?

iso-8859-1, iso-8859-7, utf-8, utf-16, utf-32 and so on are encoding methods, 
right?

Look at http://www.unicode.org/glossary/ for an explanation of all the 
terms.

Bye, Andreas
--
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

>  s = 'α' 
>  s.encode('utf-8') 
> > b'\xce\xb1' 

'b' stands for binary right? 
 b'\xce\xb1' = we are looking at a byte in a hexadecimal format? 
if yes how could we see it in binary and decimal represenation? 
  
> > I see that the encoding of this char takes 2 bytes. But why two exactly? 
> > How do i calculate how many bits are needed to store this char into bytes? 
  
> Because utf-8 takes 1 to 4 bytes to encode characters 

Since 2^8 = 256, utf-8 should store the first 256 chars of unicode charset 
using 1 byte. 

Also Since 2^16 = 65535, utf-8 should store the first 65535 chars of unicode 
charset using 2 bytes and so on. 

But i know that this is not the case. 
But i dont understand why. 


>  s = 'a' 
>  s.encode('utf-8') 
> > b'a' 
> utf-8 takes ASCII as it is, as 1 byte. They are the same 

EBCDIC and ASCII and Unicode are charactet sets, correct? 

iso-8859-1, iso-8859-7, utf-8, utf-16, utf-32 and so on are encoding methods, 
right?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Δευτέρα, 10 Ιουνίου 2013 11:15:38 π.μ. UTC+3, ο χρήστης Andreas Perstinger 
έγραψε:

What is the difference between len('nikos') and len(b'nikos')
First beeing the length of string nikos in characters while the second being 
the length of an ???


> The python interpreter will represent all values below 256 as ASCII 
> characters if they are printable:

>  >>> ord(b'a')
> 97
>  >>> hex(97)
> '0x61'
>  >>> b'\x61' == b'a'
> True
> The Python designers have decided to use b'a' instead of b'\x61'.

b'a' and b'\x61' are the bytestrings of char 'a' after utf-8 encoding?

This ord(b'a' )should give an error in my opinion:

ord('a') should return the ordinal value of char 'a', not ord(b'a')
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-10 Thread Andreas Perstinger

On 10.06.2013 09:10, nagia.rets...@gmail.com wrote:

Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:

py> c = 'α'
py> ord(c)
945

The number 945 is the characters 'α' ordinal value in the unicode charset 
correct?

Yes, the unicode character set is just a big list of characters. The 
946th character in that list (starting from 0) happens to be 'α'.

The command in the python interactive session to show me how many bytes
this character will take upon encoding to utf-8 is:

s = 'α'
s.encode('utf-8')

b'\xce\xb1'

I see that the encoding of this char takes 2 bytes. But why two exactly?

That's how the encoding is designed. Haven't you read the wikipedia 
article which was already mentioned several times?

How do i calculate how many bits are needed to store this char into bytes?

You need to understand how UTF-8 works. Read the wikipedia article.

Trying to to the same here but it gave me no bytes back.

s = 'a'
s.encode('utf-8')

b'a'

The encode method returns a byte object. It's length will tell you how 
many bytes there are:

>>> len(b'a')
1
>>> len(b'\xce\xb1')
2

The python interpreter will represent all values below 256 as ASCII 
characters if they are printable:

>>> ord(b'a')
97
>>> hex(97)
'0x61'
>>> b'\x61' == b'a'
True

The Python designers have decided to use b'a' instead of b'\x61'.

py> c.encode('utf-8')
b'\xce\xb1'

2 bytes here. why 2?

Same as your first question.

py> c.encode('utf-16be')
b'\x03\xb1'

2 byets here also. but why 3 different bytes? the ordinal value of
char 'a' is the same in unicode. the encodign system just takes the
ordinal value end encode, but sinc eit uses 2 bytes should these 2 bytes
be the same?

'utf-16be' is a different encoding scheme, thus it uses other rules to 
determine how each character is translated into a byte sequence.

py> c.encode('iso-8859-7')
b'\xe1'

And also does '\x' means that the value is being respresented in hex way?
and when i bin(6) i see '0b101'

I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say?

'\x' is an escape sequence and means that the following two characters 
should be interpreted as a number in hexadecimal notation (see also the 
table of allowed escape sequences: 
http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals 
).

'0b' tells you that the number is printed in binary notation.
Leading zeros are usually discarded when a number is printed:
>>> bin(70)
'0b1000110'
>>> 0b100110 == 0b00100110
True
>>> 0b100110 == 0b00100110
True

It's the same with decimal notation. You wouldn't say 00123 is different 
from 123, would you?

Bye, Andreas
--
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Δευτέρα, 10 Ιουνίου 2013 10:51:34 π.μ. UTC+3, ο χρήστης Larry Hudson έγραψε:

> > I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant 
> > up to 256, not above 256.

> 0 - 127, yes.
> 128 - 255 -> one byte of a multibyte code.

you mean that in utf-8 for 1 character to be stored, we need 2 bytes?
I still havign troubl e understanding this.

Since 2^8 = 256, utf-8 would need 1 byte to store the 1st 256 characters but 
instead its using 1 byte up to the first 127 value and then 2 bytes for 
anyhtign above.  Why?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-10 Thread Larry Hudson

On 06/09/2013 03:37 AM, Νικόλαος Κούρας wrote:

I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to
256, not above 256.

NO!!

0 - 127, yes.
128 - 255 -> one byte of a multibyte code.

That's why the decode fails, it sees it as incomplete data so it can't do
anything with it.

A surrogate pair is like itting for example Ctrl-A, which means is a
combination character that consists of 2 different characters?
Is this what a surrogate is? a pari of 2 chars?

You're confusing character encodings with the way NON-CHARACTER keys on the KEYBOARD are encoded
(function keys, arrow keys and such). These are NOT text characters but KEYBOARD key codes.
These are NOT text codes and are entirely different and not related to any character encoding.
How programs interpret and use these codes depends entirely on the individual programs. There
are common conventions on how many are used, but there are no standards.

Also the control-codes are the first 32 values of the ASCII (and ASCII-compatible) character set
and are not multi-character key codes like the keyboard non-character keys.

However, there are a few keyboard keys that actually produce control-codes. A
few examples:

Return/Enter -> Ctrl-M
Tab -> Ctrl-I
Backspace -> Ctrl-H

So character 'A' <-> 65 (in decimal uses in charset's table) <-> 01011100 (as binary
stored in disk) <-> 0xEF (as hex, when we open the file with a hex editor)

You are trying to put too much meaning to this. The value stored on disk, in memory, or
whatever is binary bits, nothing else. How you describe the value, in decimal, in octal, in
hex, in base-12, or... is totally irrelevant. These are simply different ways of describing or
naming these numeric values.

It's the same as saying 3 in English is three, in Spanish is tres, in German is drei... (I
don't know Greek, sorry.) No matter what you call it, it is still the numeric integer value
that is between 2 and 4.

--
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-10 Thread nagia . retsina

Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:

> py> c = 'α'
> py> ord(c)
> 945

The number 945 is the characters 'α' ordinal value in the unicode charset 
correct?

The command in the python interactive session to show me how many bytes
this character will take upon encoding to utf-8 is:

>>> s = 'α'
>>> s.encode('utf-8')
b'\xce\xb1'

I see that the encoding of this char takes 2 bytes. But why two exactly?
How do i calculate how many bits are needed to store this char into bytes?


Trying to to the same here but it gave me no bytes back.

>>> s = 'a'
>>> s.encode('utf-8')
b'a'


>py> c.encode('utf-8')
> b'\xce\xb1'

2 bytes here. why 2?

> py> c.encode('utf-16be')
> b'\x03\xb1'

2 byets here also. but why 3 different bytes? the ordinal value of char 'a' is 
the same in unicode. the encodign system just takes the ordinal value end 
encode, but sinc eit uses 2 bytes should these 2 bytes be the same?

> py> c.encode('utf-32be')
> b'\x00\x00\x03\xb1

every char here takes exactly 4 bytes to be stored. okey.

> py> c.encode('iso-8859-7')
> b'\xe1'

And also does '\x' means that the value is being respresented in hex way?
and when i bin(6) i see '0b101'

I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-09 Thread Benjamin Kaplan

On Sun, Jun 9, 2013 at 2:20 AM, Νικόλαος Κούρας  wrote:
> Τη Κυριακή, 9 Ιουνίου 2013 12:12:36 μ.μ. UTC+3, ο χρήστης Cameron Simpson 
> έγραψε:
>> On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
>>  wrote:
>>
>> | Steven wrote:
>>
>> | >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>>
>> | >> values up to 256?
>>
>> |
>>
>> | >Because then how do you tell when you need one byte, and when you need
>>
>> | >two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>>
>> | >characters, with ordinal values 0x4C and 0xFA, or one character with
>>
>> | >ordinal value 0x4CFA?
>>
>> |
>>
>> | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant 
>> up to 256, not above 256.
>>
>>
>>
>> Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your 
>> >suggestion will not.
>
> I dont follow.
>

The point in the UTF formats is that they can encode any of the 1.1
million codepoints available in Unicode. Your suggestion can only
encode 256 code points. We have that encoding already- it's called
Latin-1 and it can't encode any of your Greek characters (hence why
ISO-8859-7 exists, which can encode the Greek characters but not the
Latin ones).

If you were to use the whole byte to store the first 256 characters,
you wouldn't be able to store character number 256 because the
computer wouldn't be able to tell the difference between character 257
(0x01 0x01) and two chr(1)s. UTF-8 gets around this by reserving the
top bit as a "am I part of a multibyte sequence" flag,

>> | >> UTF-8 and UTF-16 and UTF-32
>>
>> | >> I though the number beside of UTF- was to declare how many bits the
>>
>> | >> character set was using to store a character into the hdd, no?
>>
>> |
>>
>> | >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
>>
>> | >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
>>
>> | >values to make a surrogate pair.
>>
>> |
>>
>> | A surrogate pair is like itting for example Ctrl-A, which means is a 
>> combination character that consists of 2 different characters?
>>
>> | Is this what a surrogate is? a pari of 2 chars?
>>
>>
>>
>> Essentially. The combination represents a code point.
>>
>>
>>
>> | >UTF-8 uses 8-bit values, but sometimes
>>
>> | >it combines two, three or four of them to represent a single code-point.
>>
>> |
>>
>> | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
>>
>> | 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 
>> 127 )
>>
>> | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? 
>> (since ordinal >  65000 )
>>
>> |
>>
>> | The amount of bytes needed to store a character solely depends on the 
>> character's ordinal value in the Unicode table?
>>
>>
>>
>> Essentially. You can read up on the exact process in Wikipedia or the 
>> Unicode Standard.
>
>
>
> When you say essentially means you agree with my statements?
> --

In UTF-8 or UTF-16, the number of bytes required for the character is
dependent on its code point, yes. That isn't the case for UTF-32,
where every character uses exactly four bytes.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-09 Thread Benjamin Kaplan

On Sun, Jun 9, 2013 at 2:38 AM, Νικόλαος Κούρας  wrote:
> Τη Κυριακή, 9 Ιουνίου 2013 12:20:58 μ.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:
>
>> > How about a string i wonder?
>> > s = "νίκος"
>> > what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')
>
>> Ignoring the usual syntax error, this is just a variant of the code I
>> posted: "s.encode('iso-8869-7')" produces a bytes instance which
>> *cannot* be "re-encoded" again in whatever encoding.
>
> s = 'a'
> s = s.encode('iso-8859-7').decode('utf-8')
> print( s )
>
> a (we got the original character back)
> 
> s = 'α'
> s = s.encode('iso-8859-7').decode('utf-8')
> print( s )
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: 
> unexpected end of data
>
> Why this error? because 'a' ordinal value > 127 ?
> --

No. You get that error because the string is not encoded in UTF-8.
It's encoded in ISO-8859-7. For ASCII strings (ord(x) < 127),
ISO-8859-7 and UTF-8 look exactly the same. For anything else, they
are different. If you were to try to decode it as ISO-8859-1, it would
succeed, but you would get the character "á" back instead of α.

You're misunderstanding the decode function. Decode doesn't turn it
into a string with the specified encoding. It takes it *from* the
string with the specified encoding and turns it into Python's internal
string representation. In Python 3.3, that encoding doesn't even have
a name because it's not a standard encoding. So you want the decode
argument to match the encode argument.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-09 Thread nagia . retsina

Τη Κυριακή, 9 Ιουνίου 2013 3:36:51 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:

> > printing a greek Unicode string in the error with ASCII 
> > as the output encoding (default when not a tty IIRC).

> An interesting thought. How would we test that?

Please elaborare this for me. I ditn undertood what you are trying to say, your 
assumption of why still ima getting decode issues.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Sun, 09 Jun 2013 02:38:13 -0700, Νικόλαος Κούρας wrote:

> s = 'α'
> s = s.encode('iso-8859-7').decode('utf-8')
> 
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0:
> unexpected end of data
> 
> Why this error? because 'a' ordinal value > 127 ?

Look at it this way... consider encoding and decoding to be like 
translating from one language to another.

Suppose you start with the English word "street". You encode it to German 
by looking it up in an English-To-German dictionary:

street -> Straße

The you decode the German by looking "Straße" up in a German-To-English 
dictionary:

Straße -> street

and everything is good. But suppose that after encoding the English to 
German, you get confused, and think that it is Italian, not German. So 
when it comes to decoding, you try to look up 'Staße' in an Italian-To-
English dictionary, and discover that there is no such thing as letter ß 
in Italian. So you cannot look the word up, and you get frustrated and 
shout "this is rubbish, there's no such thing as ß, that's not a letter!"

Not in Italian, but it is a perfectly good letter in German. But you're 
looking it up in the wrong dictionary.

Same thing with UTF-8. You encoded the string 'α' by looking it up in the 
"Unicode To ISO-8859-7 bytes" dictionary. Then you try to decode it by 
looking for those bytes in the "UTF-8 bytes To Unicode" dictionary. But 
you can't find byte 0xe1 on its own in UTF-8 bytes, so Python shouts 
"this is rubbish, there's no such thing as 0xe1 on its own in UTF-8!" and 
raises UnicodeDecodeError.

Sometimes you don't get an exception. Suppose that you are encoding from 
French to German:

qui -> die  (both words mean "who" in English)

Now if you get confused, and decode the word 'die' by looking it up in an 
English-To-French dictionary, instead of German-To-French, you get:

die -> mourir

So instead of getting 'qui' back again, you get 'mourir'. This is like 
mojibake: the results are garbage, but there is no exception raised to 
warn you.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Sun, 09 Jun 2013 19:16:06 +1000, Cameron Simpson wrote:


> If he's lucky the UnicodeEncodeError occurred while trying to print an
> error message, 

That's not what happens at the interactive console:

py> assert os.path.exists('Ж1')
Traceback (most recent call last):
  File "", line 1, in 
AssertionError


> printing a greek Unicode string in the error with ASCII
> as the output encoding (default when not a tty IIRC).


An interesting thought. How would we test that?



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Sun, 09 Jun 2013 02:00:46 -0700, Νικόλαος Κούρας wrote:

> Steven wrote:
>>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>>> values up to 256?
> 
>>Because then how do you tell when you need one byte, and when you need
>>two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>>characters, with ordinal values 0x4C and 0xFA, or one character with
>>ordinal value 0x4CFA?
> 
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I
> meant up to 256, not above 256.

Think about it. Draw up a big table of one million plus characters:

Ordinal   Character

0 NUL control code
1 SOH control code
...
84LATIN CAPITAL LETTER T
85LATIN CAPITAL LETTER U
...
255   LATIN SMALL LETTER Y WITH DIAERESIS
256   LATIN CAPITAL LETTER A WITH MACRON
...
8485  OUNCE SIGN

and so on, all the way to 1114111. Now, suppose you read a file, and see 
two bytes, shown in decimal: 84 followed by 85, or in hexadecimal, 0x54 
followed by 0x55.

How do you tell whether that means two characters, T followed by U, or a 
single character, ℥ (OUNCE SIGN)?

With UTF-32, you can, because every value takes exactly the same space. 
So a T followed by a U is:

0x0054
0x0055

while a single ℥ is:

0x2125

and it is easy to tell them apart: each block of 4 bytes is exactly one 
character. But notice how many NUL bytes there are? In the three 
characters shown, there are eight NUL bytes. Most text will be filled 
with NUL bytes, which is very wasteful.

UTF-8 is designed to be compact, and also to be backwards-compatible with 
ASCII. Characters which are in ASCII will be a single byte, so there are 
no null bytes used for padding, (except for NUL itself, of course). So 
the three characters TU℥ will be:

0x54
0x55
0xE2
0x84
0xA5

Five bytes in total, instead of 12 for UTF-32. But the only tricky part 
is that character with ordinal value 0xE2 (decimal 226, â) cannot be 
encoded as the single byte 0xE2, otherwise you would mistake the three 
bytes 0xE284A5 as starting with 'â' followed by two more characters. And 
indeed, 'â' is encoded as two bytes:

0xC3
0xA2

Likewise, character with ordinal value 0xC3 (decimal 195, Ã) is also 
encoded as two bytes:

0xC3
0x83

And so on. This way, there is never any confusion as to whether (say) 
three bytes are three one-byte characters, or one three-byte character.

>>> UTF-8 and UTF-16 and UTF-32
>>> I though the number beside of UTF- was to declare how many bits the
>>> character set was using to store a character into the hdd, no?
> 
>>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
>>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
>>values to make a surrogate pair.
> 
> A surrogate pair is like itting for example Ctrl-A, which means is a
> combination character that consists of 2 different characters? Is this
> what a surrogate is? a pari of 2 chars?

Yes, a surrogate pair is a pair of two "characters". But they're not 
*real* characters. They don't exist in any human language. They are just 
values that tells the program "these go together, and count as a single 
character".

(This is why Unicode prefers to talk about *code points* rather than 
characters. Some code points are characters, and some are not.)

>>UTF-8 uses 8-bit values, but sometimes it combines two, three or four of
>>them to represent a single code-point.
> 
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)

Correct.

> 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is >
> 127 ) 

That looks like two characters to me, 'α' followed by '΄'. That will take 
4 bytes, two for 'α' and two for '΄'.

> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored
> ? (since ordinal >  65000 )

Not necessarily four bytes. Could be three. Depends on the ideogram.

> The amount of bytes needed to store a character solely depends on the
> character's ordinal value in the Unicode table?

Yes.

>>UTF-8 solves this problem by reserving some values to mean "this byte,
>>on its own", and others to mean "this byte, plus the next byte,
>>together", and so forth, up to four bytes.
> 
> Some of the utf-8 bits that are used to represent a character's ordinal
> value are actually been also used to seperate or join the ordinal values
> themselves? Can you give an example please? How there are beign
> seperated?

Did you look up UTF-8 on Wikipedia like I suggested?

>>Computers are digital and work with numbers.
> 
> So character 'A' <-> 65 (in decimal uses in charset's table)  <->
> 01011100 (as binary stored in disk) <-> 0xEF (as hex, when we open the
> file with a hex editor)
> 
> Is this how the thing works? (above values are fictional)

You can check this in Python:

py> c = 'A'
py> ord(c)
65
py> bin(65)
'0b101'
py> hex(65)
'0x41'

py> c = 'α'
py> ord(c)
945
py> c.encode('utf-8')
b'\xce\xb1'
py> c.encode('utf-16be')
b'\x03\xb

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-09 Thread Andreas Perstinger


On 09.06.2013 11:38, Νικόλαος Κούρας wrote:

s = 'α'
s = s.encode('iso-8859-7').decode('utf-8')
print( s )

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: 
unexpected end of data

Why this error? because 'a' ordinal value > 127 ?


>>> s = 'α'
>>> s.encode('iso-8859-7')
b'\xe1'
>>> bin(0xe1)
'0b1111'

Now look at the table on https://en.wikipedia.org/wiki/UTF-8#Description 
to find out how many bytes a UTF-8 decoder expects when it reads that value.


Bye, Andreas
--
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Please and tell me that this actually can be solved.
Iam willing to try anything for 'files.py' to load propelry.
Every thign works as expected in my webiste, have manages to correct 
pelatologio.poy and koukos.py.

This is the last thing the webiste needs, that is files.py to load so users can 
grab importan files in greek format.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Sun, 09 Jun 2013 10:55:43 +0200, Lele Gaifax wrote:

> Steven D'Aprano  writes:
> 
>> On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:
>>
>>> chr('A') would give me the mapping of this char, the number 65 while
>>> ord(65) would output the char 'A' likewise.
>>
>> Correct. Python uses Unicode, where code-point 65 ("ordinal value 65")
>> means letter "A".
> 
> Actually, that's the other way around:
> 
> >>> chr(65)
> 'A'
> >>> ord('A')
> 65

/facepalm 

Of course you are right.


>>> What would happen if we we try to re-encode bytes on the disk? like
>>> trying:
>>> 
>>> s = "νίκος"
>>> utf8_bytes = s.encode('utf-8')
>>> greek_bytes = utf_bytes.encode('iso-8869-7')
>>> 
>>> Can we re-encode twice or as many times we want and then decode back
>>> respectively lke?
>>
>> Of course. [...]

> Uhm, no: "encode" transforms a Unicode string into an array of bytes,
> "decode" does the opposite transformation. You cannot do the former on
> an "arbitrary" array of bytes:

And two for two. I misunderstood Nikos' question.

As you point out, no, Python 3 will not allow you to re-encode bytes. You 
must first decode them to a string first, then encode them using a 
different encoding. (I thought that this was was Nikos actually meant, 
but I on re-reading his question more closely, that's not actually what 
he asked.)

Sorry for any confusion.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

I k nwo i have been a pain in the ass these days but this is the lats 
explanation i want from you, just to understand it completely.

>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
>> values up to 256? 

>Because then how do you tell when you need one byte, and when you need 
>two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
>characters, with ordinal values 0x4C and 0xFA, or one character with 
>ordinal value 0x4CFA? 

I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 
256, not above 256.


>> UTF-8 and UTF-16 and UTF-32 
>> I though the number beside of UTF- was to declare how many bits the 
>> character set was using to store a character into the hdd, no? 

>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
>values to make a surrogate pair.

A surrogate pair is like itting for example Ctrl-A, which means is a 
combination character that consists of 2 different characters?
Is this what a surrogate is? a pari of 2 chars?


>UTF-8 uses 8-bit values, but sometimes 
>it combines two, three or four of them to represent a single code-point.

'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since 
ordinal >  65000 )

The amount of bytes needed to store a character solely depends on the 
character's ordinal value in the Unicode table?


>UTF-8 solves this problem by reserving some values to mean "this byte, on 
>its own", and others to mean "this byte, plus the next byte, together", 
>and so forth, up to four bytes.

Some of the utf-8 bits that are used to represent a character's ordinal value 
are actually been also used to seperate or join the ordinal values themselves?
Can you give an example please? How there are beign seperated?


>Computers are digital and work with numbers.


So character 'A' <-> 65 (in decimal uses in charset's table)  <-> 01011100 (as 
binary stored in disk) <-> 0xEF (as hex, when we open the file with a hex 
editor)

Is this how the thing works? (above values are fictional)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Κυριακή, 9 Ιουνίου 2013 12:14:12 μ.μ. UTC+3, ο χρήστης Νικόλαος Κούρας 
έγραψε:
> Τη Κυριακή, 9 Ιουνίου 2013 11:15:07 π.μ. UTC+3, ο χρήστης Steven D'Aprano 
> έγραψε:
> 
> 
> 
> > Please try this: log into the Linux server, and then start up a Python 
> 
> 
> 
> > import os, sys 
> 
> > print(sys.version)
> 
> > s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}' 
> 
> >  '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}' 
> 
> >  '\N{GREEK SMALL LETTER EPSILON}')
> 
> > print(s)
> 
> > filename = '/tmp/' + s
> 
> > open(filename, 'w')
> 
> > os.path.exists(filename)
> 
> 
> 
> > Copy and paste the results back here please.
> 
> 
> 
> Of course: here it is:
> 
> 
> 
> root@nikos [/home/nikos/www/cgi-bin]# python
> 
> Python 3.3.2 (default, Jun  3 2013, 16:18:05)
> 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux
> 
> Type "help", "copyright", "credits" or "license" for more information.
> 
> >>> import os, sys
> 
> >>> print(sys.version)
> 
> 3.3.2 (default, Jun  3 2013, 16:18:05)
> 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)]
> 
> >>> s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
> 
> ...  '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
> 
> ...  '\N{GREEK SMALL LETTER EPSILON}')
> 
> print(s)
> 
> >>> αβγδε
> 
> >>> filename = '/tmp/' + s
> 
> >>> open(filename, 'w')
> 
> <_io.TextIOWrapper name='/tmp/αβγδε' mode='w' encoding='UTF-8'>
> 
> >>> os.path.exists(filename)
> 
> True
> 
> >>>

I dont much but it lloks correct to me, but then agian why this error?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Κυριακή, 9 Ιουνίου 2013 12:20:58 μ.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:

> > How about a string i wonder? 
> > s = "νίκος" 
> > what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')

> Ignoring the usual syntax error, this is just a variant of the code I 
> posted: "s.encode('iso-8869-7')" produces a bytes instance which
> *cannot* be "re-encoded" again in whatever encoding.

s = 'a'
s = s.encode('iso-8859-7').decode('utf-8')
print( s )

a (we got the original character back)

s = 'α'
s = s.encode('iso-8859-7').decode('utf-8')
print( s )

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: 
unexpected end of data

Why this error? because 'a' ordinal value > 127 ?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Κυριακή, 9 Ιουνίου 2013 12:12:36 μ.μ. UTC+3, ο χρήστης Cameron Simpson 
έγραψε:
> On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
>  wrote:
> 
> | Steven wrote:
> 
> | >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
> 
> | >> values up to 256? 
> 
> | 
> 
> | >Because then how do you tell when you need one byte, and when you need 
> 
> | >two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
> 
> | >characters, with ordinal values 0x4C and 0xFA, or one character with 
> 
> | >ordinal value 0x4CFA? 
> 
> | 
> 
> | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant 
> up to 256, not above 256.
> 
> 
> 
> Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your 
> >suggestion will not.

I dont follow.

> | >> UTF-8 and UTF-16 and UTF-32 
> 
> | >> I though the number beside of UTF- was to declare how many bits the 
> 
> | >> character set was using to store a character into the hdd, no? 
> 
> | 
> 
> | >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
> 
> | >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
> 
> | >values to make a surrogate pair.
> 
> | 
> 
> | A surrogate pair is like itting for example Ctrl-A, which means is a 
> combination character that consists of 2 different characters?
> 
> | Is this what a surrogate is? a pari of 2 chars?
> 
> 
> 
> Essentially. The combination represents a code point.
> 
> 
> 
> | >UTF-8 uses 8-bit values, but sometimes 
> 
> | >it combines two, three or four of them to represent a single code-point.
> 
> | 
> 
> | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
> 
> | 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 
> 127 )
> 
> | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since 
> ordinal >  65000 )
> 
> | 
> 
> | The amount of bytes needed to store a character solely depends on the 
> character's ordinal value in the Unicode table?
> 
> 
> 
> Essentially. You can read up on the exact process in Wikipedia or the Unicode 
> Standard.



When you say essentially means you agree with my statements?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-09 Thread Lele Gaifax

Νικόλαος Κούρας  writes:

> Τη Κυριακή, 9 Ιουνίου 2013 11:55:43 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:
>> Uhm, no: "encode" transforms a Unicode string into an array of bytes,
>> "decode" does the opposite transformation. You cannot do the former on
>> an "arbitrary" array of bytes:
>> 
>> >>> s = "νίκος"
>> >>> utf8_bytes = s.encode('utf-8')
>> >>> greek_bytes = utf8_bytes.encode('iso-8869-7')
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> AttributeError: 'bytes' object has no attribute 'encode'
>
> So something encoded into bytes cannot be re-encoded to some other bytes.
>
> How about a string i wonder?
> s = "νίκος"
> what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')

Ignoring the usual syntax error, this is just a variant of the code I
posted: “s.encode('iso-8869-7')” produces a bytes instance which
*cannot* be "re-encoded" again in whatever encoding.

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-09 Thread Cameron Simpson

On 09Jun2013 08:15, Steven D'Aprano  
wrote:
| On Sun, 09 Jun 2013 00:00:53 -0700, nagia.retsina wrote:
| > path = b'/home/nikos/public_html/data/apps/'
| > files = os.listdir( path )
| > 
| > for filename in files:
| > # Compute 'path/to/filename'
| > filepath_bytes = path + filename
| > for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
| > try:
| > filepath = filepath_bytes.decode( encoding )
| > except UnicodeDecodeError:
| > continue
| > 
| > # Rename to something valid in UTF-8
| > if encoding != 'utf-8':
| > os.rename( filepath_bytes, 
| >  filepath.encode('utf-8') )
| > assert os.path.exists( filepath )
| > break
| > else:
| > # This only runs if we never reached the break 
| >   raise ValueError(
| > 'unable to clean filename %r' % filepath_bytes )
| 
| Editing the traceback to get rid of unnecessary noise from the logging:
| 
| Traceback (most recent call last):
|   File "/home/nikos/public_html/cgi-bin/files.py", line 83, in 
|   assert os.path.exists( filepath )
|   File "/usr/local/lib/python3.3/genericpath.py", line 18, in exists
|   os.stat(path)
| UnicodeEncodeError: 'ascii' codec can't encode characters in position 
| 34-37: ordinal not in range(128)
| 
| > Why am i still receing unicode decore errors? With the help of you guys
| > we have writen a prodecure just to avoid this kind of decoding issues
| > and rename all greek_byted_filenames to utf-8_byted.
| 
| That's a very good question. It works for me when I test it, so I cannot 
| explain why it fails for you.

If he's lucky the UnicodeEncodeError occurred while trying to print
an error message, printing a greek Unicode string in the error with
ASCII as the output encoding (default when not a tty IIRC).

Cheers,
-- 
Cameron Simpson 

I generally avoid temptation unless I can't resist it.  - Mae West
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Κυριακή, 9 Ιουνίου 2013 11:15:07 π.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:

> Please try this: log into the Linux server, and then start up a Python 

> import os, sys 
> print(sys.version)
> s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}' 
>  '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}' 
>  '\N{GREEK SMALL LETTER EPSILON}')
> print(s)
> filename = '/tmp/' + s
> open(filename, 'w')
> os.path.exists(filename)

> Copy and paste the results back here please.

Of course: here it is:

root@nikos [/home/nikos/www/cgi-bin]# python
Python 3.3.2 (default, Jun  3 2013, 16:18:05)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> print(sys.version)
3.3.2 (default, Jun  3 2013, 16:18:05)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)]
>>> s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
...  '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
...  '\N{GREEK SMALL LETTER EPSILON}')
print(s)
>>> αβγδε
>>> filename = '/tmp/' + s
>>> open(filename, 'w')
<_io.TextIOWrapper name='/tmp/αβγδε' mode='w' encoding='UTF-8'>
>>> os.path.exists(filename)
True
>>>

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-09 Thread Cameron Simpson

On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 wrote:
| Steven wrote:
| >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
| >> values up to 256? 
| 
| >Because then how do you tell when you need one byte, and when you need 
| >two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
| >characters, with ordinal values 0x4C and 0xFA, or one character with 
| >ordinal value 0x4CFA? 
| 
| I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up 
to 256, not above 256.

Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your 
suggestion will not.

I'd point out that if you did this, you'd be back in the same
situation you just encountered with ASCII: the first above-255 value
would raise a UnicodeEncodeError (an error which does not even exist
at present:-)

| >> UTF-8 and UTF-16 and UTF-32 
| >> I though the number beside of UTF- was to declare how many bits the 
| >> character set was using to store a character into the hdd, no? 
| 
| >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
| >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
| >values to make a surrogate pair.
| 
| A surrogate pair is like itting for example Ctrl-A, which means is a 
combination character that consists of 2 different characters?
| Is this what a surrogate is? a pari of 2 chars?

Essentially. The combination represents a code point.

| >UTF-8 uses 8-bit values, but sometimes 
| >it combines two, three or four of them to represent a single code-point.
| 
| 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
| 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
| 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since 
ordinal >  65000 )
| 
| The amount of bytes needed to store a character solely depends on the 
character's ordinal value in the Unicode table?

Essentially. You can read up on the exact process in Wikipedia or the Unicode 
Standard.

Cheers,
-- 
Cameron Simpson 

The most annoying thing about being without my files after our disc crash was
discovering once again how widespread BLINK was on the web.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Κυριακή, 9 Ιουνίου 2013 11:55:43 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:
> Steven D'Aprano  writes:
> 
> 
> 
> > On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:
> 
> >
> 
> >> chr('A') would give me the mapping of this char, the number 65 while
> 
> >> ord(65) would output the char 'A' likewise.
> 
> >
> 
> > Correct. Python uses Unicode, where code-point 65 ("ordinal value 65") 
> 
> > means letter "A".
> 
> 
> 
> Actually, that's the other way around:
> 
> 
> 
> >>> chr(65)
> 
> 'A'
> 
> >>> ord('A')
> 
> 65
> 
> 
> 
> >> What would happen if we we try to re-encode bytes on the disk? like
> 
> >> trying:
> 
> >> 
> 
> >> s = "νίκος"
> 
> >> utf8_bytes = s.encode('utf-8')
> 
> >> greek_bytes = utf_bytes.encode('iso-8869-7')
> 
> >> 
> 
> >> Can we re-encode twice or as many times we want and then decode back
> 
> >> respectively lke?
> 
> >
> 
> > Of course. Bytes have no memory of where they came from, or what they are 
> 
> > used for. All you are doing is flipping bits on a memory chip, or on a 
> 
> > hard drive. So long as *you* remember which encoding is the right one, 
> 
> > there is no problem. If you forget, and start using the wrong one, you 
> 
> > will get garbage characters, mojibake, or errors.
> 
> 
> 
> Uhm, no: "encode" transforms a Unicode string into an array of bytes,
> 
> "decode" does the opposite transformation. You cannot do the former on
> 
> an "arbitrary" array of bytes:
> 
> 
> 
> >>> s = "νίκος"
> 
> >>> utf8_bytes = s.encode('utf-8')
> 
> >>> greek_bytes = utf8_bytes.encode('iso-8869-7')
> 
> Traceback (most recent call last):
> 
>   File "", line 1, in 
> 
> AttributeError: 'bytes' object has no attribute 'encode'

So something encoded into bytes cannot be re-encoded to some other bytes.

How about a string i wonder?
s = "νίκος"
what_are these_bytes = s.encode('iso-8869-7').encode(utf-8')
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Κυριακή, 9 Ιουνίου 2013 11:02:48 π.μ. UTC+3, ο χρήστης Cameron Simpson 
έγραψε:

> In this scenario, really it is the Terminal program (eg Putty) which
> cares about text (what you type, and what gets displayed). It is
> because of mismatches between your Terminal local settings and the
> encoding that was chosen for the filenames that you get garbage
> listings, one way or another.

Ca n you give an example please that shows a string being greek-iso encoded and 
then being utf8 decoded and presented back as:

1. properly
2. garbage ( means trash but dont what a garbage char is)
3. error
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Steven wrote:
>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
>> values up to 256? 

>Because then how do you tell when you need one byte, and when you need 
>two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
>characters, with ordinal values 0x4C and 0xFA, or one character with 
>ordinal value 0x4CFA? 

I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 
256, not above 256.


>> UTF-8 and UTF-16 and UTF-32 
>> I though the number beside of UTF- was to declare how many bits the 
>> character set was using to store a character into the hdd, no? 

>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
>values to make a surrogate pair.

A surrogate pair is like itting for example Ctrl-A, which means is a 
combination character that consists of 2 different characters?
Is this what a surrogate is? a pari of 2 chars?


>UTF-8 uses 8-bit values, but sometimes 
>it combines two, three or four of them to represent a single code-point.

'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since 
ordinal >  65000 )

The amount of bytes needed to store a character solely depends on the 
character's ordinal value in the Unicode table?


>UTF-8 solves this problem by reserving some values to mean "this byte, on 
>its own", and others to mean "this byte, plus the next byte, together", 
>and so forth, up to four bytes.

Some of the utf-8 bits that are used to represent a character's ordinal value 
are actually been also used to seperate or join the ordinal values themselves?
Can you give an example please? How there are beign seperated?


>Computers are digital and work with numbers.


So character 'A' <-> 65 (in decimal uses in charset's table)  <-> 01011100 (as 
binary stored in disk) <-> 0xEF (as hex, when we open the file with a hex 
editor)

Is this how the thing works? (above values are fictional)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-09 Thread Lele Gaifax

Steven D'Aprano  writes:

> On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:
>
>> chr('A') would give me the mapping of this char, the number 65 while
>> ord(65) would output the char 'A' likewise.
>
> Correct. Python uses Unicode, where code-point 65 ("ordinal value 65") 
> means letter "A".

Actually, that's the other way around:

>>> chr(65)
'A'
>>> ord('A')
65

>> What would happen if we we try to re-encode bytes on the disk? like
>> trying:
>> 
>> s = "νίκος"
>> utf8_bytes = s.encode('utf-8')
>> greek_bytes = utf_bytes.encode('iso-8869-7')
>> 
>> Can we re-encode twice or as many times we want and then decode back
>> respectively lke?
>
> Of course. Bytes have no memory of where they came from, or what they are 
> used for. All you are doing is flipping bits on a memory chip, or on a 
> hard drive. So long as *you* remember which encoding is the right one, 
> there is no problem. If you forget, and start using the wrong one, you 
> will get garbage characters, mojibake, or errors.

Uhm, no: "encode" transforms a Unicode string into an array of bytes,
"decode" does the opposite transformation. You cannot do the former on
an "arbitrary" array of bytes:

>>> s = "νίκος"
>>> utf8_bytes = s.encode('utf-8')
>>> greek_bytes = utf8_bytes.encode('iso-8869-7')
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'bytes' object has no attribute 'encode'

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Sun, 09 Jun 2013 00:00:53 -0700, nagia.retsina wrote:

> path = b'/home/nikos/public_html/data/apps/'
> files = os.listdir( path )
> 
> for filename in files:
>   # Compute 'path/to/filename'
>   filepath_bytes = path + filename
>   for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
>   try:
>   filepath = filepath_bytes.decode( encoding )
>   except UnicodeDecodeError:
>   continue
> 
>   # Rename to something valid in UTF-8
>   if encoding != 'utf-8':
>   os.rename( filepath_bytes, 
>  filepath.encode('utf-8') )
>   assert os.path.exists( filepath )
>   break
>   else:
>   # This only runs if we never reached the break 
>   raise ValueError(
> 'unable to clean filename %r' % filepath_bytes )

Editing the traceback to get rid of unnecessary noise from the logging:

Traceback (most recent call last):
  File "/home/nikos/public_html/cgi-bin/files.py", line 83, in 
  assert os.path.exists( filepath )
  File "/usr/local/lib/python3.3/genericpath.py", line 18, in exists
  os.stat(path)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 
34-37: ordinal not in range(128)


> Why am i still receing unicode decore errors? With the help of you guys
> we have writen a prodecure just to avoid this kind of decoding issues
> and rename all greek_byted_filenames to utf-8_byted.

That's a very good question. It works for me when I test it, so I cannot 
explain why it fails for you.

Please try this: log into the Linux server, and then start up a Python 
interactive session by entering:

python3.3

at the $ prompt. Then, at the >>> prompt, enter these lines of code. You 
can copy and paste them:


import os, sys
print(sys.version)
s = ('\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}'
 '\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER DELTA}'
 '\N{GREEK SMALL LETTER EPSILON}')
print(s)
filename = '/tmp/' + s
open(filename, 'w')
os.path.exists(filename)


Copy and paste the results back here please.



> Is it the assert that fail? Do we have some logic error someplace i dont
> see?

Please read the error message. Does it say AssertionError?

If it says AssertionError, then the assert has failed. If it says 
something else, the code failed before the assert can run.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

I'm sorry posted by mistake unnessary code: here is the correct one that 
prodiuced the above error:


#
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )

for filename in files:
# Compute 'path/to/filename'
filepath_bytes = path + filename
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try: 
filepath = filepath_bytes.decode( encoding )
except UnicodeDecodeError:
continue

# Rename to something valid in UTF-8 
if encoding != 'utf-8': 
os.rename( filepath_bytes, filepath.encode('utf-8') )

assert os.path.exists( filepath )
break 
else: 
# This only runs if we never reached the break
raise ValueError( 'unable to clean filename %r' % 
filepath_bytes ) 


#
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
(filename,) )
data = cur.fetchone()

if not data:
# First time for file; primary key is automatic, hit is 
defaulted
print( "iam here", filename + '\n' )
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = set()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
if rec not in filepaths:
cur.execute('''DELETE FROM files WHERE url = %s''', rec )
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-09 Thread Cameron Simpson

On 09Jun2013 06:25, Steven D'Aprano  
wrote:
| [... heaps of useful explaination ...]
| > When locale to linux system is set to utf-8 that would mean that the 
| > linux applications, should try to encode string into hdd by using 
| > system's default encoding to utf-8 nad read them back from bytes by
| > also using utf-8. Is that correct?
| 
| Yes.

Although I'd point out that only application that care about text
as _text_ need to consider Unicode and the encoding. A command like
"mv" does not care. You type the command and "mv" receives byte
strings as its arguments. So it is doing straight forward "bytes"
file renames. It does not care or even know about encodings.

In this scenario, really it is the Terminal program (eg Putty) which
cares about text (what you type, and what gets displayed). It is
because of mismatches between your Terminal local settings and the
encoding that was chosen for the filenames that you get garbage
listings, one way or another.

Cheers,
-- 
Cameron Simpson 

But then, I'm only 50. Things may well get a bit much for me when I
reach the gasping heights of senile decrepitude of which old Andy
Woodward speaks with such feeling.
- Chris Malcolm, c...@uk.ac.ed.aifh, DoD #205
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-09 Thread nagia . retsina

Thanks Stevn, i ll read them in a bit. When i read them can you perhaps tell me 
whats wrong and ima still getting decode issues?

[CODE]
# 
=
# If user downloaded a file, thank the user !!!
# 
=
if filename:
#update file counter if cookie does not exist
if not nikos:
cur.execute('''UPDATE files SET hits = hits + 1, host = %s, 
lastvisit = %s WHERE url = %s''', (host, lastvisit, filename) )

print('''Το αρχείο  %s κατεβαίνει!''' % filename )
print('')
print('''Και τώρα Tetris μέχρι να 
ολοκληρωθεί :-)''' )
print('''http://fpdownload.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,0,0";
 width="450" height="300"">http://www.fugly.com/f/1e6d8cd7b905f4e1bf72"; />http://www.fugly.com/f/1e6d8cd7b905f4e1bf72"; AllowScriptAccess="always" 
menu="false" quality="high" width="450" height="300" name="FuglyGame" 
align="middle" type="application/x-shockwave-flash" 
pluginspage="http://www.macromedia.com/go/getflashplayer"; />''')

print( '' % 
filename )
sys.exit(0)


# 
=
# Display download button for each file and download it on click
# 
=
print('''
 
 
''')


#
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )

for filename in files:
# Compute 'path/to/filename'
filepath_bytes = path + filename
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try: 
filepath = filepath_bytes.decode( encoding )
except UnicodeDecodeError:
continue

# Rename to something valid in UTF-8 
if encoding != 'utf-8': 
os.rename( filepath_bytes, filepath.encode('utf-8') )

assert os.path.exists( filepath )
break 
else: 
# This only runs if we never reached the break
raise ValueError( 'unable to clean filename %r' % 
filepath_bytes ) 


#
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
(filename,) )
data = cur.fetchone()

if not data:
# First time for file; primary key is automatic, hit is 
defaulted
print( "iam here", filename + '\n' )
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = set()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
if rec not in filepaths:
cur.execute('''DELETE FROM files WHERE url = %s''', rec )
[/CODE] 

When trying to run it is still erroting out:

[CODE]
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] Original exception 
was:, referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] Traceback (most 
recent call last):, referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173]   File 
"/home/nikos/public_html/cgi-bin/files.py", line 83, in , referer: 
http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] assert 
os.path.exists( filepath ), referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173]   File 
"/usr/local/lib/python3.3/genericpath.py", line 18, in exists, referer: 
http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] os.stat(path), 
referer: http://superhost.gr/
[Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 
'ascii' codec can't encode character

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Sat, 08 Jun 2013 22:09:57 -0700, nagia.retsina wrote:

> chr('A') would give me the mapping of this char, the number 65 while
> ord(65) would output the char 'A' likewise.

Correct. Python uses Unicode, where code-point 65 ("ordinal value 65") 
means letter "A".

There are older encodings. For example, a very old one, used on IBM 
mainframes, is EBCDIC, where ordinal value 65 means the letter "â", and 
the letter "A" has ordinal value 193.

> What would happen if we we try to re-encode bytes on the disk? like
> trying:
> 
> s = "νίκος"
> utf8_bytes = s.encode('utf-8')
> greek_bytes = utf_bytes.encode('iso-8869-7')
> 
> Can we re-encode twice or as many times we want and then decode back
> respectively lke?

Of course. Bytes have no memory of where they came from, or what they are 
used for. All you are doing is flipping bits on a memory chip, or on a 
hard drive. So long as *you* remember which encoding is the right one, 
there is no problem. If you forget, and start using the wrong one, you 
will get garbage characters, mojibake, or errors.

[...]
> And also is there a deiffrence between "encoding" and "compressing" ?

Of course. They are totally unrelated.

> Isnt the latter useing some form of encoding to take a string or bytes
> to make hold less space on disk?

Correct, except forget about "encoding". It's not relevant (except, 
maybe, in a mathematical sense) and will just confuse you.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Sun, 09 Jun 2013 07:46:40 +0300, Νικόλαος Κούρας wrote:

> Why does every character in a character set needs to be associated with
> a numeric value?

Because computers are digital, not analog, and because bytes are numbers.

Here are a few of the 256 possible bytes, written in binary, decimal and 
hexadecimal:

0b 0 0x00
0b0001 1 0x01
0b0010 2 0x02
[...]
0b0111 127 0x7F
0b1000 128 0x80
[...]
0b1110 254 0xFE
0b 255 0xFF

EVERYTHING in computers are numbers, because everything is stored as 
bytes. Text is stored as bytes. Sound files are stored as bytes. Images 
are stored as bytes. Programs are stored as bytes. So everything is being 
stored as numbers. But the *meaning* we give to those numbers depends on 
what we do with them, whether we treat them as characters, bitmapped 
images, floating point values, or something else.

Once we decide we want to store the character "A" as bytes, we need to 
decide which number it should be. That is the job of the charset.

ASCII:

65 <--> 'A'
66 <--> 'B'
67 <--> 'C'
etc.

> I mean couldn't we just have characters sets that wouldn't have numeric
> associations like:
> 
> 'A'  => encoding process(i.e. uf-8) => bytes bytes => decoding
> process(i.e. utf-8) =>  character 'A'

No. How would you store it in a computer's memory, or on a hard drive? By 
carving a tiny, microscopic "A" onto the hard drive? How would you read 
it back?

It is theoretically possible to build an analog computer, out of 
clockwork, or water flowing through pipes, or something, but nobody 
really bothers because it is much harder and not very useful.

> An ordinal = ordered numbers like 7,8,910 and so on?

Yes.

> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
> values up to 256?

Because then how do you tell when you need one byte, and when you need 
two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
characters, with ordinal values 0x4C and 0xFA, or one character with 
ordinal value 0x4CFA?

UTF-8 solves this problem by reserving some values to mean "this byte, on 
its own", and others to mean "this byte, plus the next byte, together", 
and so forth, up to four bytes.

If you look up UTF-8 on Wikipedia, you will see more about this.

> UTF-8 and UTF-16 and UTF-32
> I though the number beside of UTF- was to declare how many bits the 
> character set was using to store a character into the hdd, no?

Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
values to make a surrogate pair. UTF-8 uses 8-bit values, but sometimes 
it combines two, three or four of them to represent a single code-point.

> > "Narrow" Unicode uses two bytes per character. Since two bytes is only
> > enough for about 65,000 characters, not 1,000,000+, the rest of the
> > characters are stored as pairs of two-byte "surrogates".
> 
> Can you please explain this line "the rest of thecharacters are stored 
> as pairs of two-byte "surrogates"" more easily for me to understand it?
> I'm still having troubl understanding what a surrogate is.

Look up UTF-16 and "surrogate pair" on Wikepedia.

But basically, there are 65000+ different possible 16-bit values 
available for UTF-16 to use. Some of those values are reserved to mean 
"this value is not a character, it is half of a surrogate pair". Since 
they are *pairs*, they must always come in twos. A surrogate pair makes 
up a valid character. Half of a surrogate pair, on its own, is an error.

A lot of this complexity is because of historical reasons. For example, 
when Unicode was first invented, there was only 65 thousand characters, 
and a fixed 16 bits was all you needed. But it was soon learned that 65 
thousand was not enough (there are more than 65,000 Asian characters 
alone!) and so UTF-16 developed the trick with surrogate pairs to cover 
the extras.

[...]
> When locale to linux system is set to utf-8 that would mean that the 
> linux applications, should try to encode string into hdd by using 
> system's default encoding to utf-8 nad read them back from bytes by
> also using utf-8. Is that correct?

Yes.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-08 Thread nagia . retsina

Τη Σάββατο, 8 Ιουνίου 2013 9:47:53 μ.μ. UTC+3, ο χρήστης Chris Angelico έγραψε:

> Fortunately, Python lets us hide away pretty much all those details, 
> just as it lets us hide away the details of what makes up a list, a
> dictionary, or an integer. You can safely assume that the string "foo"
> is a string of three characters, which you can work with as
> characters. The chr() and ord() functions let you switch between
> characters and numbers, and str.encode() and bytes.decode() let you
> switch between characters and byte sequences. Once you get your head
> around the differences between those three, it all works fairly
> neatly.

I'm trying too!

So,

chr('A') would give me the mapping of this char, the number 65 while
ord(65) would output the char 'A' likewise.

>and str.encode() and bytes.decode() let you switch between characters and byte 
>>sequences. Once

What would happen if we we try to re-encode bytes on the disk?
like trying:

s = "νίκος"
utf8_bytes = s.encode('utf-8')
greek_bytes = utf_bytes.encode('iso-8869-7')

Can we re-encode twice or as many times we want and then decode back 
respectively lke?

utf8_bytes = greek_bytes.decode('iso-8859-7')
s = utf8_bytes.decoce('utf-8')

Is somethign like that totally crazy?

And also is there a deiffrence between "encoding" and "compressing" ?

Isnt the latter useing some form of encoding to take a string or bytes to make 
hold less space on disk?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)


On 9/6/2013 1:32 πμ, Cameron Simpson wrote:

On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 wrote:
| Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:
| > ASCII actually needs 7 bits to store a character. Since computers are
| > optimized to work with bytes, not bits, normally ASCII characters are
| > stored in a single byte, with one bit wasted.
|
| So ASCII and Unicode are 2 Encoding Systems currently in use.
| How should i imagine them, visualize them?
| Like tables 'A' = 65, 'B' = 66 and so on?

Yes, that works.

| But if i do then that would be the visualization of a 'charset' not of an 
encoding system.
| What the diffrence of an encoding system and of a charset?

And encoding system is the method or transcribing these values to bytes and 
back again.

So we have:

( 'A' mapped to the value of '65' ) => encoding process(i.e. uf-8) => bytes
bytes => decoding process(i.e. utf-8) =>  ( '65' mapped to character 'A' )

Why does every character in a character set needs to be associated with 
a numeric value?
I mean couldn't we just have characters sets that wouldn't have numeric 
associations like:


'A'  => encoding process(i.e. uf-8) => bytes
bytes => decoding process(i.e. utf-8) =>  character 'A'




EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets.
(1:1 mappings of characters to numbers/ordinals).

And encoding is a way of writing these values to bytes.
Decoding reads bytes and emits character values.

Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the 
range 0-255,
they are usually transcribed (encoded) directly, one byte per ordinal.

Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one 
value.
There are several ways of transcribing Unicode. UTF-8 is a popular and usually 
compact form,
using one byte for values below 128 and and multiple bytes for higher values.

An ordinal = ordered numbers like 7,8,910 and so on?

Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
values up to 256?


UTF-8 and UTF-16 and UTF-32
I though the number beside of UTF- was to declare how many bits the 
character set was using to store a character into the hdd, no?


"Narrow" Unicode uses two bytes per character. Since two bytes is only
enough for about 65,000 characters, not 1,000,000+, the rest of the
characters are stored as pairs of two-byte "surrogates".

Can you please explain this line "the rest of thecharacters are stored 
as pairs of two-byte "surrogates"" more easily for me to understand it?

I'm still having troubl understanding what a surrogate is.

Again, thank you very much for explaining the encodings to me, they were 
giving me trouble for years in all of my scripts.



And one last thing.
When locale to linux system is set to utf-8 that would mean that the 
linux applications, should try to encode string into hdd by using 
system's default encoding to utf-8 nad read them back from bytes by also 
using utf-8. Is that correct?

--
Webhost && Weblog 
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-08 Thread Cameron Simpson

On 08Jun2013 14:14, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 wrote:
| Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:
| > ASCII actually needs 7 bits to store a character. Since computers are  
| > optimized to work with bytes, not bits, normally ASCII characters are
| > stored in a single byte, with one bit wasted.
| 
| So ASCII and Unicode are 2 Encoding Systems currently in use.
| How should i imagine them, visualize them?
| Like tables 'A' = 65, 'B' = 66 and so on?

Yes, that works.

| But if i do then that would be the visualization of a 'charset' not of an 
encoding system.
| What the diffrence of an encoding system and of a charset?

And encoding system is the method or transcribing these values to bytes and 
back again.

| ebcdic - ascii - unicode = al of them are encoding systems
| greek-iso - latin-iso - utf8 - utf16 = all of them are charsets.

No.

EBCDIC and ASCII and Unicode and Greek-ISO (iso-8859-7) are all character sets.
(1:1 mappings of characters to numbers/ordinals).

And encoding is a way of writing these values to bytes.
Decoding reads bytes and emits character values.

Because all of EBCDIC, ASCII and the iso-8859-x characters sets fit in the 
range 0-255,
they are usually transcribed (encoded) directly, one byte per ordinal.

Unicode is much larger. It cannot be transcribed (encoded) as one bytes to one 
value.
There are several ways of transcribing Unicode. UTF-8 is a popular and usually 
compact form,
using one byte for values below 128 and and multiple bytes for higher values.

| Why python interprets by default all given strings as unicode and
| not ascii? because the former supports many positions while ascii
| only 127 positions , hence can interpet only 127 different characters?

Yes.

[...]
| > Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is 
| > also similar, also 256 positions, but the characters are different. And 
| > so on, with dozens of charsets. 
| 
| Latin has to display english chars(capital, small) + numbers + symbols. that 
would be 127 why 256?

ASCII runs up to 127. Essentially English, numerals, control codes and various 
symbols.

The iso-8859-x sets run to 255, and the upper 128 values map to
characters popular in various regions.

| greek = all of the above plus greek chars, no?

So iso-8859-7 included the Greek characters.

| > And then there is Unicode, which includes *every* character is all of 
| > those dozens of charsets. It has 1114111 positions (most are currently  
| > unfilled).
| 
| Shouldt the positions that Unicode has to use equal to the summary
| of all available characters of all the languages of the worlds plus
| numbers and special chars? why 1.000.000+ why the need for so many
| positions? Narrow Unicode format (2 byted) can cover all ofmthe
| worlds symbols.

2 bytes is not enough. Chinese alone has more glyphs than that.

| > An encoding is simply a program that takes a character and returns a 
| > byte, or visa versa. For instance, the ASCII encoding will take character 
| > 'A'. That is found at position 65, which is 0x41 in hexadecimal, so the 
| > ASCII encoding turns character 'A' into byte 0x41, and visa versa.
| 
| Why you say ASCII turn a character into HEX format and not as in binary 
format?

Steven didn't say that. He said "position 65". People often write
bytes in hex (eg 0x41) because a byte always fits in a 2-character
hex (16 x 16) and because often these values have binary-based
subranges, and hex makes that more obvious.

For example, 'A' is 0x41. 'a' is 0x61. So you can look at the hex
code and almost visually know if you're dealing with upper or lower
case, etc.

| Isnt the latter the way bytes are stored into hdd, like 01010010101 etc?
| Are they stored as hex instead or you just said so to avoid printing 0s and 
1s?

They're stored as bits at the gate level. But writing hex codes
_in_ _text_ is more compact, and more readable for humans.

Cheers,
-- 
Cameron Simpson 

A lot of people don't know the difference between a violin and a viola, so
I'll tell you.  A viola burns longer.   - Victor Borge
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Sun, Jun 9, 2013 at 7:21 AM, Νικόλαος Κούρας  wrote:
> Sorry for displaying my code so many times, i know i ahve exhaust you but hti 
> is the last thinkg i am gonna ask from you in this thread. We are very close 
> to have this working.

You need to spend more time reading and less time frantically jumping
around. Go read my post on Unicode; it answers several of the
questions you posted in response to Steven's. And please, don't use
this list as your substitute for source control. Don't keep posting
your code. Most of us are ignoring it already.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Sorry for displaying my code so many times, i know i ahve exhaust you but hti 
is the last thinkg i am gonna ask from you in this thread. We are very close to 
have this working.


#
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )

for filename in files:
# Compute 'path/to/filename'
filepath_bytes = path + filename
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try: 
filepath = filepath_bytes.decode( encoding )
except UnicodeDecodeError:
continue

# Rename to something valid in UTF-8 
if encoding != 'utf-8': 
os.rename( filepath_bytes, filepath.encode('utf-8') )

assert os.path.exists( filepath )
break 
else: 
# This only runs if we never reached the break
raise ValueError( 'unable to clean filename %r' % 
filepath_bytes ) 


#
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
(filename,) )
data = cur.fetchone()

if not data:
# First time for file; primary key is automatic, hit is 
defaulted
print( "iam here", filename + '\n' )
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
# Collect filenames of the path dir as strings
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
if rec not in filepaths:
cur.execute('''DELETE FROM files WHERE url = %s''', rec )





=
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] Original exception 
was:
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] Traceback (most 
recent call last):
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173]   File 
"/home/nikos/public_html/cgi-bin/files.py", line 78, in 
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] assert 
os.path.exists( filepath )
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173]   File 
"/usr/local/lib/python3.3/genericpath.py", line 18, in exists
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] os.stat(path)
[Sun Jun 09 00:16:14 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 
'ascii' codec can't encode characters in position 34-37: ordinal not in 
range(128)
==

Asserts what to make sure the the path/to/file afetr the rename exists but why 
are we still get those unicodeencodeerrors?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Σάββατο, 8 Ιουνίου 2013 10:01:57 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:

> ASCII actually needs 7 bits to store a character. Since computers are  
> optimized to work with bytes, not bits, normally ASCII characters are
> stored in a single byte, with one bit wasted.

So ASCII and Unicode are 2 Encoding Systems currently in use.
How should i imagine them, visualize them?
Like tables 'A' = 65, 'B' = 66 and so on?

But if i do then that would be the visualization of a 'charset' not of an 
encoding system.
What the diffrence of an encoding system and of a charset?

ebcdic - ascii - unicode = al of them are encoding systems

greek-iso - latin-iso - utf8 - utf16 = all of them are charsets.

What are these differences? i cant imagine them all, i can only imagine 
charsets not encodign systems.

Why python interprets by default all given strings as unicode and not ascii? 
because the former supports many positions while ascii only 127 positions , 
hence can interpet only 127 different characters? 


> "Narrow" Unicode uses two bytes per character. Since two bytes is only 
> enough for about 65,000 characters, not 1,000,000+, the rest of the 
> characters are stored as pairs of two-byte "surrogates".

surrogates literal means a replacemnt?


> Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is 
> also similar, also 256 positions, but the characters are different. And 
> so on, with dozens of charsets. 

Latin has to display english chars(capital, small) + numbers + symbols. that 
would be 127 why 256?

greek = all of the above plus greek chars, no?

> And then there is Unicode, which includes *every* character is all of 
> those dozens of charsets. It has 1114111 positions (most are currently  
> unfilled).

Shouldt the positions that Unicode has to use equal to the summary of all 
available characters of all the languages of the worlds plus numbers and 
special chars? why 1.000.000+ why the need for so many positions? Narrow 
Unicode format (2 byted) can cover all ofmthe worlds symbols.

> An encoding is simply a program that takes a character and returns a 
> byte, or visa versa. For instance, the ASCII encoding will take character 
> 'A'. That is found at position 65, which is 0x41 in hexadecimal, so the 
> ASCII encoding turns character 'A' into byte 0x41, and visa versa.

Why you say ASCII turn a character into HEX format and not as in binary format?
Isnt the latter the way bytes are stored into hdd, like 01010010101 etc?
Are they stored as hex instead or you just said so to avoid printing 0s and 1s?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Sun, Jun 9, 2013 at 4:01 AM, Νικόλαος Κούρας  wrote:
> Hold on!
>
> In the beginning there was ASCII with 0-127 values and then there was
> Unicode with 0-127 of ASCII's + i dont know how much many more?
>
> Now ASCIII needs 1 byte to store a single character while Unicode needs 2
> bytes to store a character and that is because it has > 256 characters to
> store > 2^8bits ?
>
> Is this correct?

No. Let me start from the beginning.

Computers don't work with characters, or strings, natively. They work
with numbers. To be specific, they work with bits; and it's only by
convention that we can work with anything larger. For instance,
there's a VERY common convention around the PC world that a set of
bits can be interpreted as a signed integer; if the highest bit is
set, it's negative. There are also standards for floating-point (IEEE
754), and so on.

ASCII is a character set. It defines a mapping of numbers to
characters - for instance, @ is 64, SOH is 1, $ is 36, etcetera,
etcetera. There are 128 such mappings. Since they all fit inside a
7-bit number, there's a trivial way to represent ASCII characters in a
PC's 8-bit byte: you just leave the high bit clear and use the other
seven. There have been various schemes for using the eighth bit -
serial ports with parity, WordStar (I think) marking the ends of
words, and most notably, Extended ASCII schemes that give you another
whole set of 128 characters. And that was the beginning of Code Pages,
because nobody could agree on what those extra 128 should be.
Norwegians used Norwegian, the Greeks were taught their Greek,
Arabians created themselves an Arabian codepage with the speed of
summer lightning, and Hebrews allocated from 255 down to 128, which is
absolutely frightening. But I digress.

There were a variety of multi-byte schemes devised at various times,
but we'll ignore all of them and jump straight to Unicode. With
Unicode, there's (theoretically) no need to use any other system ever
again, because whatever character you want, it'll exist in Unicode. In
theory, of course; there are debates over that. Now, Unicode currently
has defined an "address space" of roughly 20 bits, and in a throwback
to the first programming I ever did, it's a segmented system: sixteen
or seventeen planes of 65,536 characters each. (Fortunately the planes
are identified by low numbers, not high numbers, and there's no
stupidity of overlapping planes the way the 8086 did with memory!) The
highest planes are  special (plane 14 has a few special-purpose
characters, planes 15 and 16 are for private use), and most of the
middle ones have no characters assigned to them, so for the most part,
you'll see characters from the first three planes.

So what do we now have? A mapping of characters to "code points",
which are numbers. (I'm leaving aside the issues of combining
characters and such for the moment.) But computers don't work with
numbers, they work with bits. Somehow we have to store those bits in
memory.

There are a good few ways to do that; one is to note that every
Unicode character can be represented inside 32 bits, so we can use the
standard integer scheme safely. (Since they fit inside 31 bits, we
don't even need to care if it's signed or unsigned.) That's called
UTF-32 or UCS-4, and it's a great way to handle the full Unicode range
in a manner that makes a Texan look agoraphobic. Wide builds of Python
up to 3.2 did this. Or you can try to store them in 16-bit numbers,
but then you have to worry about the ones that don't fit in 16 bits,
because it's really hard to squeeze 20 bits of information into 16
bits of storage. UTF-16 is one way to do this; special numbers mean
"grab another number". It has its issues, but is (in my opinion,
unfortunately) fairly prevalent. Narrow builds of Python up to 3.2 did
this. Finally, you can use a more complicated scheme that uses
anywhere from 1 to 4 bytes for each character, by carefully encoding
information into the top bit - if it's set, you have a multi-byte
character. That's how UTF-8 works, and is probably the most prevalent
disk/network encoding.

All of the UTF-X systems are called "UCS Transformation Formats" (UCS
meaning Universal Character Set, roughly "Unicode"). They are mappings
from Unicode numbers to bytes. Between Unicode and UTF-X, you have a
mapping from character to byte sequence.

> Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters into
> the hard drive?

The ISO standard 8859 specifies a number of ASCII-compatible
encodings, referred to as ISO-8859-1 through ISO-8859-16. You've been
working with ISO-8859-1, also called Latin-1, and ISO-8859-7, which
has your Greek characters in it. These are all ways of translating
characters into numbers; and since they all fit within 8 bits, they're
most commonly represented on PCs with single bytes.

> So taken form above example(the closest i could think of), the way i
> understand them is:
>
> A 'string' can be of (unicode's or ascii's) type and that type n

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Sat, 08 Jun 2013 21:01:23 +0300, Νικόλαος Κούρας wrote:

> In the beginning there was ASCII with 0-127 values 

No, there were encoding systems that existed before ASCII, such as 
EBCDIC. But we can ignore those, and just start with ASCII.

> and then there was
> Unicode with 0-127 of ASCII's + i dont know how much many more?

No, you have missed the utter chaos of dozens and dozens of Windows 
codepages and charsets. We still have to live with the pain of that.

But now we have Unicode, with 0x10 (decimal 1114111) code points. You 
can consider a code point to be the same as a character, at least for now.

> Now ASCIII needs 1 byte to store a single character 

ASCII actually needs 7 bits to store a character. Since computers are 
optimized to work with bytes, not bits, normally ASCII characters are 
stored in a single byte, with one bit wasted.

> while Unicode needs 2 bytes to store a character 

No. Since there are 0x10 different Unicode "characters" (really code 
points, but ignore the difference) two bytes is not enough. Unicode needs 
21 bits to store a character. Since that is more than 2 bytes, but less 
than 3, there are a few different ways for Unicode to be stored in 
memory, including:

"Wide" Unicode uses four bytes per character. Why four instead of three? 
Because computers are more efficient when working with chunks of memory 
that is a multiple of four.

"Narrow" Unicode uses two bytes per character. Since two bytes is only 
enough for about 65,000 characters, not 1,000,000+, the rest of the 
characters are stored as pairs of two-byte "surrogates".

> and that is because it has > 256 characters
> to store > 2^8bits ?

Correct.

> Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters
> into the hard drive?

Your computer cannot carve a tiny little "A" into the hard drive when it 
stores that letter in a file. It has to write some bytes. So you need to 
know:

- what byte, or bytes, represents the letter "A"?

- what byte, or bytes, represents the letter "B"?

- what byte, or bytes, represents the letter "λ"?

and so on. This set of rules, "byte  means letter ", is called an 
encoding. If you don't know what encoding to use, you cannot tell what 
the byte means.

> Because in some post i have read that 'UTF-8 encoding of Unicode'. Can
> you please explain to me whats the difference of ASCII-Unicode
> themselves aand then of them compared to 'Charsets' . I'm still confused
> about this.

A charset is an ordered set of characters. For example, ASCII has 127 
characters, starting with NUL:

NUL ... A B C D E ... Z [ \ ] ^ ... a b c ... z ... 

where NULL is at position 0, 'A' is at position 65, 'B' at position 66, 
and so on.

Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is 
also similar, also 256 positions, but the characters are different. And 
so on, with dozens of charsets.

And then there is Unicode, which includes *every* character is all of 
those dozens of charsets. It has 1114111 positions (most are currently 
unfilled).

An encoding is simply a program that takes a character and returns a 
byte, or visa versa. For instance, the ASCII encoding will take character 
'A'. That is found at position 65, which is 0x41 in hexadecimal, so the 
ASCII encoding turns character 'A' into byte 0x41, and visa versa.

> Is it like we said in C++:
> ' int a', a variable with name 'a' of type integer. 'char a',   a
> variable with name 'a' of type char
> 
> So taken form above example(the closest i could think of), the way i
> understand them is:
> 
> A 'string' can be of (unicode's or ascii's) type and that type needs a
> way (thats a charset) to store this string into the hdd as a sequense of
> bytes?

Correct.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)


On 8/6/2013 5:49 πμ, Cameron Simpson wrote:

On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 wrote:
| Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson 
έγραψε:
| > | >| errors='replace' mean dont break in case or error?
| >
| > | >Yes. The result will be correct for correct iso-8859-7 and slightly 
mangled
| > | >for something that would not decode smoothly.
| >
| > | How can it be correct? We have encoded out string in utf-8 and then
| > | we tried to decode it as greek-iso? How can this possibly be
| > | correct?
|
| > If it is a valid iso-8859-7 sequence (which might cover everything,
| > since I expect it is an 8-bit 1:1 mapping from bytes values to a
| > set of codepoints, just like iso-8859-1) then it may decode to the
| > "wrong" characters, but the reverse process (characters encoded as
| > bytes) should produce the original bytes.  With a mapping like this,
| > errors='replace' may mean nothing; there will be no errors because
| > the only Unicode characters in play are all from iso-8859-7 to start
| > with. Of course another string may not be safe.
|
| > Visually, the names will be garbage. And if you go:
| >   mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
| > while using the iso-8859-7 locale, the wrong thing will occur
| > (assuming it even works, though I think it should because all these
| > characters are represented in iso-8859-7, yes?)
|
| All the rest you i understood only the above quotes its still unclear to me.
| I cant see to understand it.
|
| Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 
codepoints similar?

Yes. It is certainly true for utf-8 and latin-iso and ASCII.
I expect it to be so for greek-iso, but have not checked.

They're all essentially the ASCII set plus a range of other character
codepoints for the upper values.  The 8-bit sets iso-8859-1 (which
I take you to mean by "latin-iso") and iso-8859-7 (which I take you
to mean by "greek-iso") are single byte mapping with the top half
mapped to characters commonly used in a particular region.

Unicode has a much greater range, but the UTF-8 encoding of Unicode
deliberately has the bottom 0-127 identical to ASCII, and higher
values represented by multibyte sequences commences with at least
the first byte in the 128-255 range. In this way pure ASCII files
are already in UTF-8 (and, in fact, work just fine for the iso-8859-x
encodings as well).


Hold on!

In the beginning there was ASCII with 0-127 values and then there was 
Unicode with 0-127 of ASCII's + i dont know how much many more?


Now ASCIII needs 1 byte to store a single character while Unicode needs 
2 bytes to store a character and that is because it has > 256 characters 
to store > 2^8bits ?


Is this correct?

Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters 
into the hard drive?


Because in some post i have read that 'UTF-8 encoding of Unicode'.
Can you please explain to me whats the difference of ASCII-Unicode 
themselves aand then of them compared to 'Charsets' . I'm still confused 
about this.


Is it like we said in C++:
' int a', a variable with name 'a' of type integer.
'char a',   a variable with name 'a' of type char

So taken form above example(the closest i could think of), the way i 
understand them is:


A 'string' can be of (unicode's or ascii's) type and that type needs a 
way (thats a charset) to store this string into the hdd as a sequense of 
bytes?







--
Webhost && Weblog 
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-08 Thread MRAB


On 08/06/2013 17:53, Νικόλαος Κούρας wrote:

Sorry for th delay guys, was busy with other thigns today and i am still 
reading your resposes, still ahvent rewad them all just Cameron's:

Here is what i have now following Cameron's advices:


#
# Collect filenames of the path directory as bytes
path = b'/home/nikos/public_html/data/apps/'
filenames_bytes = os.listdir( path )

for filename_bytes in filenames_bytes:
try:
filename = filename_bytes.decode('utf-8)
except UnicodeDecodeError:
# Since its not a utf8 bytestring then its for sure a greek 
bytestring

# Prepare arguments for rename to happen
utf8_filename = filename_bytes.encode('utf-8')
greek_filename = filename_bytes.encode('iso-8859-7')

utf8_path = path + utf8_filename
greek_path = path + greek_filename

# Rename current filename from greek bytes --> utf8 bytes
os.rename( greek_path, utf8_path )
==

I know this is wrong though.


Yet you did it anyway!


Since filename_bytes is the current filename encoded as utf8 or greek-iso
then i cannot just *encode* what is already encoded by doing this:

utf8_filename = filename_bytes.encode('utf-8')
greek_filename = filename_bytes.encode('iso-8859-7')


Try reading and understanding the code I originally posted.

--
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Okey after reading also Steven post, i was relived form the previous suck 
position i was, so with an alternation of a few variable names here is the code 
now:


#
# Collect directory and its filenames as bytes
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )

for filename in files:
# Compute 'path/to/filename'
filepath_bytes = path + filename
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try: 
filepath = filepath_bytes.decode( encoding )
except UnicodeDecodeError:
continue

# Rename to something valid in UTF-8 
if encoding != 'utf-8': 
os.rename( filepath_bytes, filepath.encode('utf-8') )

assert os.path.exists( filepath )
break 
else: 
# This only runs if we never reached the break
raise ValueError( 'unable to clean filename %r' % 
filepath_bytes ) 

=

I dont know why it is still failing when it tried to decode stuff since it 
tries 3 ways of decoding. Here is the exact error.


ni...@superhost.gr [~/www/cgi-bin]# [Sat Jun 08 20:32:44 2013] [error] [client 
79.103.41.173] Error in sys.excepthook:
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] ValueError: 
underlying buffer has been detached
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173]
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] Original exception 
was:
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] Traceback (most 
recent call last):
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173]   File 
"/home/nikos/public_html/cgi-bin/files.py", line 78, in 
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] assert 
os.path.exists( filepath )
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173]   File 
"/usr/local/lib/python3.3/genericpath.py", line 18, in exists
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] os.stat(path)
[Sat Jun 08 20:32:44 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 
'ascii' codec can't encode characters in position 34-37: ordinal not in 
range(128)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Sorry for th delay guys, was busy with other thigns today and i am still 
reading your resposes, still ahvent rewad them all just Cameron's:

Here is what i have now following Cameron's advices:


#
# Collect filenames of the path directory as bytes
path = b'/home/nikos/public_html/data/apps/'
filenames_bytes = os.listdir( path )

for filename_bytes in filenames_bytes:
try:
filename = filename_bytes.decode('utf-8)
except UnicodeDecodeError:
# Since its not a utf8 bytestring then its for sure a greek 
bytestring

# Prepare arguments for rename to happen
utf8_filename = filename_bytes.encode('utf-8')
greek_filename = filename_bytes.encode('iso-8859-7')

utf8_path = path + utf8_filename
greek_path = path + greek_filename

# Rename current filename from greek bytes --> utf8 bytes
os.rename( greek_path, utf8_path )
==

I know this is wrong though.
Since filename_bytes is the current filename encoded as utf8 or greek-iso
then i cannot just *encode* what is already encoded by doing this:

utf8_filename = filename_bytes.encode('utf-8')
greek_filename = filename_bytes.encode('iso-8859-7')
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-08 Thread MRAB


On 08/06/2013 07:49, Νικόλαος Κούρας wrote:

Τη Σάββατο, 8 Ιουνίου 2013 5:52:22 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:

On 07Jun2013 11:52, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 wrote:

| ni...@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] [client 
79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 81

| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 
'greek' )

| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] 
^

| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: 
invalid syntax

| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of 
script headers: files.py

| ---

| i dont know why that if statement errors.



Python statements that continue (if, while, try etc) end in a colon, so:


Oh iam very sorry.
Oh my God i cant beleive i missed a colon *again*:

I have corrected this:

#
# Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in filename_bytes:
# Compute 'path/to/filename' into bytes
filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
flag = False

try:
# Assume current file is utf8 encoded
filepath = filepath_bytes.decode('utf-8')
flag = 'utf8'
except UnicodeDecodeError:
try:
# Since current filename is not utf8 encoded then it 
has to be greek-iso encoded
filepath = filepath_bytes.decode('iso-8859-7')
flag = 'greek'
except UnicodeDecodeError:
print( '''I give up! File name is unreadable!''' )

if flag == 'greek':
# Rename filename from greek bytes --> utf-8 bytes
os.rename( filepath_bytes, filepath.encode('utf-8') )
==

Now everythitng were supposed to work but instead iam getting this surrogate 
error once more.
What is this surrogate thing?

Since i make use of error cathcing and handling like 'except 
UnicodeDecodeError:'

then it utf8's decode fails for some reason, it should leave that file alone 
and try the next file?
try:
# Assume current file is utf8 encoded
filepath = filepath_bytes.decode('utf-8')
flag = 'utf8'
except UnicodeDecodeError:

This is what it supposed to do, correct?

==
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]   File 
"/home/nikos/public_html/cgi-bin/files.py", line 94, in 
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] 
cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]   File 
"/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py",
 line 108, in execute
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] query = 
query.encode(charset)
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 
'utf-8' codec can't encode character '\\udcce' in position 35: surrogates not 
allowed


Look at the traceback.

It says that the exception was raised by:

query = query.encode(charset)

which was called by:

cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )

But what is 'filename'? And what has it to do with the first code
snippet? Does the traceback have _anything_ to do with the first code
snippet?

--
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-08 Thread Roel Schroeven


Νικόλαος Κούρας schreef:

Session settings afaik is for putty to remember hosts to connect to,
not terminal options. I might be worng though. No matter how many times
i change its options next time i run it always defaults back.


Putty can most definitely remember its settings:
- Start PuTTY; you should get the "PuTTY Configuration" window
- Select a session in the list of sessions
- Click Load
- Change any setting you want to change
- Go back to Session in the Category treeview
- Click Save

HTH

--
"People almost invariably arrive at their beliefs not on the basis of
proof but on the basis of what they find attractive."
-- Pascal Blaise

r...@roelschroeven.net

--
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Thu, 06 Jun 2013 23:35:33 -0700, nagia.retsina wrote:

>> Working with bytes is only for when the file names are turned to
>> garbage. Your file names (some of them) are turned to garbage. Fix
>> them, and then use file names as strings.
> 
> Can't '~/data/apps/' is filled every day with more and more files which
> are uploaded via FileZilla client, which i think it behaves pretty much
> like putty, uploading filenames as greek-iso bytes.

Well, that is certainly a nuisance. Try something like this:

# Untested.

dir = b'/home/nikos/public_html/data/apps/'  # This must be bytes.
files = os.listdir(dir)
for name in files:
pathname_as_bytes = dir + name
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try:
pathname = pathname_as_bytes.decode(encoding)
except UnicodeDecodeError:
continue
# Rename to something valid in UTF-8.
if encoding != 'utf-8':
os.rename(pathname_as_bytes, pathname.encode('utf-8'))
assert os.path.exists(pathname)
break
else:
# This only runs if we never reached the break.
raise ValueError('unable to clean filename %r'%pathname_as_bytes)

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Sat, Jun 8, 2013 at 5:26 PM, Steven D'Aprano
 wrote:
> On Fri, 07 Jun 2013 23:49:17 -0700, Νικόλαος Κούρας wrote:
>
> [...]
>> Oh iam very sorry.
>> Oh my God i cant beleive i missed a colon *again*:
>>
>> I have corrected this:
>
> [snip code]
>
> Stop posting your code after every trivial edit!!!

I think he uses the python-list archives as ersatz source control.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Fri, 07 Jun 2013 23:49:17 -0700, Νικόλαος Κούρας wrote:

[...]
> Oh iam very sorry.
> Oh my God i cant beleive i missed a colon *again*:
> 
> I have corrected this:

[snip code]

Stop posting your code after every trivial edit!!!


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On Sat, Jun 8, 2013 at 4:49 PM, Νικόλαος Κούρας  wrote:
> Oh my God i cant beleive i missed a colon *again*:

For most Python programmers, this is a matter of moments to solve. Run
the program, get a SyntaxError, fix it. Non-interesting event. (Maybe
even sooner than that, if the editor highlights it for you.) This is
why you really need to start yourself a testbox. DO NOT PLAY ON YOUR
LIVE SYSTEM. This is sysadminning 101. And Python programming 101: The
error traceback points to the error, or just after it.

Get to know how error messages work. This is not even Python-specific.
*Every* language behaves this way. You look at the highlighted line,
if you can't see an error there you look a little bit higher.

You should not need to beg for help for such trivial problems. This is
the mark of a novice. You ought no longer to be a novice, based on how
long you've been doing this stuff. You ought not to behave like one.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Σάββατο, 8 Ιουνίου 2013 5:52:22 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
> On 07Jun2013 11:52, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
>  wrote:
> 
> | ni...@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] 
> [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", 
> line 81
> 
> | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 
> 'greek' )
> 
> | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]   
>   ^
> 
> | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: 
> invalid syntax
> 
> | [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of 
> script headers: files.py
> 
> | ---
> 
> | i dont know why that if statement errors.
> 
> 
> 
> Python statements that continue (if, while, try etc) end in a colon, so:

Oh iam very sorry.
Oh my God i cant beleive i missed a colon *again*:

I have corrected this:

#
# Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in filename_bytes:
# Compute 'path/to/filename' into bytes
filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
flag = False

try:
# Assume current file is utf8 encoded
filepath = filepath_bytes.decode('utf-8')
flag = 'utf8' 
except UnicodeDecodeError:
try:
# Since current filename is not utf8 encoded then it 
has to be greek-iso encoded
filepath = filepath_bytes.decode('iso-8859-7')
flag = 'greek'
except UnicodeDecodeError:
print( '''I give up! File name is unreadable!''' )

if flag == 'greek':
# Rename filename from greek bytes --> utf-8 bytes
os.rename( filepath_bytes, filepath.encode('utf-8') )
==

Now everythitng were supposed to work but instead iam getting this surrogate 
error once more. 
What is this surrogate thing?

Since i make use of error cathcing and handling like 'except 
UnicodeDecodeError:'

then it utf8's decode fails for some reason, it should leave that file alone 
and try the next file?
try:
# Assume current file is utf8 encoded
filepath = filepath_bytes.decode('utf-8')
flag = 'utf8' 
except UnicodeDecodeError:

This is what it supposed to do, correct?

==
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]   File 
"/home/nikos/public_html/cgi-bin/files.py", line 94, in 
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] 
cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173]   File 
"/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py",
 line 108, in execute
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] query = 
query.encode(charset)
[Sat Jun 08 09:39:34 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 
'utf-8' codec can't encode character '\\udcce' in position 35: surrogates not 
allowed
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On 07Jun2013 04:53, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 wrote:
| Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson 
έγραψε:
| > | >| errors='replace' mean dont break in case or error?
| > 
| > | >Yes. The result will be correct for correct iso-8859-7 and slightly 
mangled
| > | >for something that would not decode smoothly.
| > 
| > | How can it be correct? We have encoded out string in utf-8 and then
| > | we tried to decode it as greek-iso? How can this possibly be
| > | correct?
| 
| > If it is a valid iso-8859-7 sequence (which might cover everything, 
| > since I expect it is an 8-bit 1:1 mapping from bytes values to a 
| > set of codepoints, just like iso-8859-1) then it may decode to the 
| > "wrong" characters, but the reverse process (characters encoded as
| > bytes) should produce the original bytes.  With a mapping like this, 
| > errors='replace' may mean nothing; there will be no errors because
| > the only Unicode characters in play are all from iso-8859-7 to start
| > with. Of course another string may not be safe. 
| 
| > Visually, the names will be garbage. And if you go:
| >   mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
| > while using the iso-8859-7 locale, the wrong thing will occur
| > (assuming it even works, though I think it should because all these
| > characters are represented in iso-8859-7, yes?)
| 
| All the rest you i understood only the above quotes its still unclear to me.
| I cant see to understand it.
| 
| Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 
codepoints similar?

Yes. It is certainly true for utf-8 and latin-iso and ASCII.
I expect it to be so for greek-iso, but have not checked.

They're all essentially the ASCII set plus a range of other character
codepoints for the upper values.  The 8-bit sets iso-8859-1 (which
I take you to mean by "latin-iso") and iso-8859-7 (which I take you
to mean by "greek-iso") are single byte mapping with the top half
mapped to characters commonly used in a particular region.

Unicode has a much greater range, but the UTF-8 encoding of Unicode
deliberately has the bottom 0-127 identical to ASCII, and higher
values represented by multibyte sequences commences with at least
the first byte in the 128-255 range. In this way pure ASCII files
are already in UTF-8 (and, in fact, work just fine for the iso-8859-x
encodings as well).

| For example char 'a' has the value of '65' for all of those character sets?
| Is hat what you mean?

Yes.

| s = 'a'  (This is unicode right?  Why when we assign a string to
| a variable that string's type is always unicode and does not
| automatically become utf-8 which includes all available world-wide
| characters? Unicode is something different that a character set? )

In Python 3, yes. Strings are unicode. Note that that means they are
sequences of codepoints whose meaning is as for Unicode.

"utf-8" is a byte encoding for Unicode strings. An external storage
format, if you like. The first 0-127 codepoints are 1:1 with byte
values, and the higher code points require multibyte sequences.

| utf8_byte = s.encode('utf-8')

Unicode string => utf-8 byte encoding.

| Now if we are to decode this back to utf8 we will receive the char 'a'.

Yes.

| I beleive same thing will happen with latin, greek, ascii isos. Correct?
| 
| utf8_a = utf8_byte.decode('iso-8859-7')
| latin_a = utf8_byte.decode('iso-8859-1')
| ascii_a = utf8_byte.decode('ascii')
| utf8_a = utf8_byte.decode('iso-8859-7')
| 
| Is this correct? 

Yes, because of the design decision about the 0-127 codepoints.

| All of those decodes will work even if the encoded bytestring was of utf8 
type?
| 
| The characters that will not decode correctly are those that their codepoints 
are greater that > 127 ?
| for example if s = 'α' (greek character equivalent to english 'a')
| Is this what you mean?

Yes, exactly so.

| 
| 
| Now back to my almost ready files.py script please:
| 
| 
| #
| # Collect filenames of the path dir as bytes
| greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' )
| 
| for filename in greek_filenames:
|   # Compute 'path/to/filename' in bytes
|   greek_path = b'/home/nikos/public_html/data/apps/' + b'filename'

You don't mean b'filename', which is the literal word "filename".
You mean: filename.encode('iso-8859-7')

More probably, you mean:

  dirpath = b'/home/nikos/public_html/data/apps/'
  greek_filenames = os.listdir(dirpath)
  for greek_filename in greek_filenames:
try:
  filename = greek_filename.decode('iso-8859-7')

and then:

  greek_path = dirpath + greek_filename
  utf8_filename = filename.encode('utf-8')
  utf8_path = dirpath + utf8_filename

|   try:
|   filepath = greek_path.decode('iso-8859-7')
|   # Rename current filename from greek bytes --> utf-8 bytes
|   os.rename( greek_path, filepath.e

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On 07Jun2013 11:52, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 wrote:
| ni...@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] 
[client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 
81
| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 
'greek' )
| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] 
^
| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: 
invalid syntax
| [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of 
script headers: files.py
| ---
| i dont know why that if statement errors.

Python statements that continue (if, while, try etc) end in a colon, so:

  if flag == 'greek':

Cheers,
-- 
Cameron Simpson 

Hello, my name is Yog-Sothoth, and I'll be your eldritch horror today.
- Heather Keith 
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-07 Thread Zero Piraeus

:

On 7 June 2013 16:45, MRAB  wrote:
> On 07/06/2013 20:31, Zero Piraeus wrote:
>> [something exasperated, in capitals]
>
> Have you noticed how the line in the traceback doesn't match the line
> in the post?

Actually, I hadn't. It's not exactly a surprise at this point, though ...

I learnt a new word today, while searching for an apt ending to the
sentence "Reading Nikos' posts is the internet equivalent of ..."

... and that word is Dermatillomania.

 -[]z.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-07 Thread MRAB


On 07/06/2013 20:31, Zero Piraeus wrote:

:

On 7 June 2013 14:52, Νικόλαος Κούρας  wrote:
File "/home/nikos/public_html/cgi-bin/files.py", line 81

[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 
'greek' )
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]   
  ^
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid 
syntax
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of 
script headers: files.py
---
i dont know why that if statement errors.


Oh for f... READ SOME DOCUMENTATION, FOR THE LOVE OF BOB!!! READ YOUR
OWN EFFING CODE!

Look at this:

   http://docs.python.org/2/tutorial/controlflow.html

Read it now? Of course not. Go away and read it.

Now have you read it? GO AND READ IT.

What does an if statement end with? Hint: yep, that's it.


Have you noticed how the line in the traceback doesn't match the line
in the post?
--
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-07 Thread Zero Piraeus

:

On 7 June 2013 14:52, Νικόλαος Κούρας  wrote:
File "/home/nikos/public_html/cgi-bin/files.py", line 81
> [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 
> 'greek' )
> [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] 
> ^
> [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: 
> invalid syntax
> [Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of 
> script headers: files.py
> ---
> i dont know why that if statement errors.

Oh for f... READ SOME DOCUMENTATION, FOR THE LOVE OF BOB!!! READ YOUR
OWN EFFING CODE!

Look at this:

  http://docs.python.org/2/tutorial/controlflow.html

Read it now? Of course not. Go away and read it.

Now have you read it? GO AND READ IT.

What does an if statement end with? Hint: yep, that's it.

 -[]z.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Παρασκευή, 7 Ιουνίου 2013 5:29:25 μ.μ. UTC+3, ο χρήστης MRAB έγραψε:

> This is a worse way of doing it because the ISO-8859-7 encoding has 1
> byte per codepoint, meaning that it's more 'tolerant' (if that's the 
> word) of errors. A sequence of bytes that is actually UTF-8 can be
> decoded as ISO-8859-7, giving gibberish.

> UTF-8 is less tolerant, and it's the encoding that ideally you should 
> be using everywhere, so it's better to assume UTF-8 and, if it fails,  
> try ISO-8859-7 and then rename so that any names that were ISO-8859-7
> will be converted to UTF-8.

Indeed iw asnt aware of that, at that time, i was under the impression that if 
a string was encoded to bytes using soem charset can only be switched back with 
the use of that and only that charset. Since this is the case here is my 
fixning:


#
# Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in filename_bytes:
# Compute 'path/to/filename' into bytes
filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
flag = False

try:
# Assume current file is utf8 encoded
filepath = filepath_bytes.decode('utf-8')
flag = 'utf8' 
except UnicodeDecodeError:
try:
# Since current filename is not utf8 encoded then it 
has to be greek-iso encoded
filepath = filepath_bytes.decode('iso-8859-7')
flag = 'greek'
except UnicodeDecodeError:
print( '''I give up! File name is unreadable!''' )

if( flag = 'greek' )
# Rename filename from greek bytes --> utf-8 bytes
os.rename( filepath_bytes, filepath.encode('utf-8') )


#
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
filename )
data = cur.fetchone()

if not data:
# First time for file; primary key is automatic, hit is 
defaulted 
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for rec in data:
if rec not in filepaths:
cur.execute('''DELETE FROM files WHERE url = %s''', rec )

=
ni...@superhost.gr [~/www/cgi-bin]# [Fri Jun 07 21:49:33 2013] [error] [client 
79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 81
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] if( flag == 
'greek' )
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173]   
  ^
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] SyntaxError: invalid 
syntax
[Fri Jun 07 21:49:33 2013] [error] [client 79.103.41.173] Premature end of 
script headers: files.py
---
i dont know why that if statement errors.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-07 Thread Steven D'Aprano

On Fri, 07 Jun 2013 04:53:42 -0700, Νικόλαος Κούρας wrote:

> Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st
> 0-127 codepoints similar?

You can answer this yourself. Open a terminal window and start a Python 
interactive session. Then try it and see what happens:

s = ''.join(chr(i) for i in range(128))
bytes_as_utf8 = s.encode('utf-8')
bytes_as_latin1 = s.encode('latin-1')
bytes_as_greek_iso = s.encode('ISO-8859-7')
bytes_as_ascii = s.encode('ascii')

bytes_as_utf8 == bytes_as_latin1 == bytes_as_greek_iso == bytes_as_ascii

What result do you get? True or False?

And now you know the answer, without having to ask.

> For example char 'a' has the value of '65' for all of those character
> sets? Is hat what you mean?

You can answer that question yourself.

c = 'a'
for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'):
print(c.encode(encoding))

By the way, I believe that Python has made a strategic mistake in the way 
that bytes are printed. I think it leads to more confusion, not less. 
Better would be something like this:

c = 'a'
for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'):
print(hex(c.encode(encoding)[0]))

For historical reasons, most (but not all) charsets are supersets of 
ASCII. That is, the first 128 characters in the charset are the same as 
the 128 characters in ASCII.

> s = 'a'  (This is unicode right?  Why when we assign a string to a
> variable that string's type is always unicode 

Strings in Python 3 are Unicode strings. That's just the way Python 
works. Unicode was chosen because Unicode includes over a million 
different characters (well, potentially over a million, most of them are 
currently unused), and is a strict superset of *all* common legacy 
codepages from the old DOS and Windows 95 days.

> and does not automatically
> become utf-8 which includes all available world-wide characters? Unicode
> is something different that a character set? )

Unicode is a character set. It is an enormous set of over one million 
characters (technically "code point", but don't worry about the 
difference right now) which can be collected in strings.

UTF-8 is an encoding that goes from a string using the Unicode character 
set into bytes, and back again. Sometimes, people are lazy and say 
"UTF-8" when they mean "Unicode", or visa versa. 

UTF-16 and UTF-32 are two different encodings for the same purpose, but 
for various technical reasons UTF-8 is better for files.

'λ' is a character which exists in some charsets but not others. It is 
not in the ASCII charset, nor is it in Latin-1, nor Big-5. It is in the 
ISO-8859-7 charset, and of course it is in Unicode.

In ISO-8859-7, the character 'λ' is stored as byte 0xEB (decimal 235), 
just as the character 'a' is stored as byte 0x61 (decimal 97).

In UTF-8, the character λ is stored as two bytes 0xCE 0xBB.

In UTF-16 (big-endian), the character λ is stored as two bytes 0x03 0xBB.

In UTF-32 (big-endian), the character λ is stored as four bytes 0x00 0x00 
0x03 0xBB.

That's four different ways of "spelling" the same character as bytes, 
just as "three", "trois", "drei", "τρία", "três" are all different ways 
of spelling the same number 3.

> utf8_byte = s.encode('utf-8')
> 
> Now if we are to decode this back to utf8 we will receive the char 'a'.
> I beleive same thing will happen with latin, greek, ascii isos. Correct?

Why don't you try it for yourself and see?

> The characters that will not decode correctly are those that their
> codepoints are greater that > 127 ?

Maybe, maybe not. It depends on which codepoint, and which encodings. 
Some encodings use the same bytes for the same characters. Some encodings 
use different bytes. It all depends on the encoding, just like American 
and English both spell 3 "three", while French spells it "trois".

> for example if s = 'α' (greek character equivalent to english 'a')

In Latin-1, 'α' does not exist:

py> 'α'.encode('latin-1')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'latin-1' codec can't encode character '\u03b1' in 
position 0: ordinal not in range(256)

In the old Windows Greek charset, ISO-8859-7, 'α' is stored as byte 0xE1:

py> 'α'.encode('ISO-8859-7')
b'\xe1'

But in the old Windows *Russian* charset, ISO-8859-5, the byte 0xE1 means 
a completely different character, CYRILLIC SMALL LETTER ES:

py> b'\xE1'.decode('ISO-8859-5')
'с'

(don't be fooled that this looks like the English c, it is not the same).

In Unicode, 'α' is always codepoint 0x3B1 (decimal 945):

py> ord('α')
945

but before you can store that on a disk, or as a file name, it needs to 
be converted to bytes, and which bytes you get depends on which encoding 
you use:

py> 'α'.encode('utf-8')
b'\xce\xb1'

py> 'α'.encode('utf-16be')
b'\x03\xb1'

py> 'α'.encode('utf-32be')
b'\x00\x00\x03\xb1'

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-07 Thread MRAB


On 07/06/2013 12:53, Νικόλαος Κούρας wrote:
[snip]


#
# Collect filenames of the path dir as bytes
greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in greek_filenames:
# Compute 'path/to/filename' in bytes
greek_path = b'/home/nikos/public_html/data/apps/' + b'filename'
try:


This is a worse way of doing it because the ISO-8859-7 encoding has 1
byte per codepoint, meaning that it's more 'tolerant' (if that's the
word) of errors. A sequence of bytes that is actually UTF-8 can be
decoded as ISO-8859-7, giving gibberish.

UTF-8 is less tolerant, and it's the encoding that ideally you should
be using everywhere, so it's better to assume UTF-8 and, if it fails, 
try ISO-8859-7 and then rename so that any names that were ISO-8859-7

will be converted to UTF-8.

That's the reason I did it that way in the code I posted, but, yet
again, you've changed it without understanding why!


filepath = greek_path.decode('iso-8859-7')

# Rename current filename from greek bytes --> utf-8 bytes
os.rename( greek_path, filepath.encode('utf-8') )
except UnicodeDecodeError:
# Since its not a greek bytestring then its a proper utf8 
bytestring
filepath = greek_path.decode('utf-8')


[snip]

--
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Παρασκευή, 7 Ιουνίου 2013 11:53:04 π.μ. UTC+3, ο χρήστης Cameron Simpson 
έγραψε:
> On 07Jun2013 09:56, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
>  wrote:
> 
> | On 7/6/2013 4:01 πμ, Cameron Simpson wrote:
> 
> | >On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
>  wrote:
> 
> | >| Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
> έγραψε:
> 
> | >| > py> s = '999-Eυχή-του-Ιησού'
> 
> | >| > py> bytes_as_utf8 = s.encode('utf-8')
> 
> | >| > py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
> 
> | >| > py> print(t)
> 
> | >| > 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
> 
> | >|
> 
> | >| errors='replace' mean dont break in case or error?
> 
> | >
> 
> | >Yes. The result will be correct for correct iso-8859-7 and slightly mangled
> 
> | >for something that would not decode smoothly.
> 
> |
> 
> | How can it be correct? We have encoded out string in utf-8 and then
> 
> | we tried to decode it as greek-iso? How can this possibly be
> 
> | correct?

> If it is a valid iso-8859-7 sequence (which might cover everything, 
> since I expect it is an 8-bit 1:1 mapping from bytes values to a 
> set of codepoints, just like iso-8859-1) then it may decode to the 
> "wrong" characters, but the reverse process (characters encoded as
> bytes) should produce the original bytes.  With a mapping like this, 
> errors='replace' may mean nothing; there will be no errors because
> the only Unicode characters in play are all from iso-8859-7 to start
> with. Of course another string may not be safe. 

> Visually, the names will be garbage. And if you go:
>   mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'
> while using the iso-8859-7 locale, the wrong thing will occur
> (assuming it even works, though I think it should because all these
> characters are represented in iso-8859-7, yes?)

All the rest you i understood only the above quotes its still unclear to me.
I cant see to understand it.

Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st 0-127 
codepoints similar?

For example char 'a' has the value of '65' for all of those character sets?
Is hat what you mean?

s = 'a'  (This is unicode right?  Why when we assign a string to a variable 
that string's type is always unicode and does not automatically become utf-8 
which includes all available world-wide characters? Unicode is something 
different that a character set? )

utf8_byte = s.encode('utf-8')

Now if we are to decode this back to utf8 we will receive the char 'a'.
I beleive same thing will happen with latin, greek, ascii isos. Correct?

utf8_a = utf8_byte.decode('iso-8859-7')
latin_a = utf8_byte.decode('iso-8859-1')
ascii_a = utf8_byte.decode('ascii')
utf8_a = utf8_byte.decode('iso-8859-7')

Is this correct? 
All of those decodes will work even if the encoded bytestring was of utf8 type?

The characters that will not decode correctly are those that their codepoints 
are greater that > 127 ?

for example if s = 'α' (greek character equivalent to english 'a')

Is this what you mean?


Now back to my almost ready files.py script please:


#
# Collect filenames of the path dir as bytes
greek_filenames = os.listdir( b'/home/nikos/public_html/data/apps/' )

for filename in greek_filenames:
# Compute 'path/to/filename' in bytes
greek_path = b'/home/nikos/public_html/data/apps/' + b'filename'
try:
filepath = greek_path.decode('iso-8859-7')

# Rename current filename from greek bytes --> utf-8 bytes
os.rename( greek_path, filepath.encode('utf-8') )
except UnicodeDecodeError:
# Since its not a greek bytestring then its a proper utf8 
bytestring
filepath = greek_path.decode('utf-8')


#
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
filename )
data = cur.fetchone()

if not data:
# First time for file; primary key is automatic, hit is 
defaulted 
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filepaths = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filepaths.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filen

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-07 Thread alex23

On Jun 7, 6:53 pm, Cameron Simpson  wrote:
>   Experiment:
>
>     LC_ALL=C ls -b
>     LC_ALL=utf-8 ls -b
>     LC_ALL=iso-8859-7 ls -b
>
>   And the Terminal itself is decoding the output for display, and
>   encoding your input keystrokes to feed as input to the command
>   line.

This reminded me of something I saw on stackoverflow recently:
http://stackoverflow.com/questions/11735363/python3-unicodeencodeerror-only-when-run-from-crontab

Script would run from shell but not from crontab, as the crontab
environment had different locale settings. Solution was to prepend the
correct LC_CTYPE to the command in the crontab. Would it be similar
for httpd processes?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On 07Jun2013 09:56, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 wrote:
| On 7/6/2013 4:01 πμ, Cameron Simpson wrote:
| >On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 wrote:
| >| Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:
| >| > py> s = '999-Eυχή-του-Ιησού'
| >| > py> bytes_as_utf8 = s.encode('utf-8')
| >| > py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
| >| > py> print(t)
| >| > 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
| >|
| >| errors='replace' mean dont break in case or error?
| >
| >Yes. The result will be correct for correct iso-8859-7 and slightly mangled
| >for something that would not decode smoothly.
|
| How can it be correct? We have encoded out string in utf-8 and then
| we tried to decode it as greek-iso? How can this possibly be
| correct?

Ok, not correct. But consistent. Safe to call.

If it is a valid iso-8859-7 sequence (which might cover everything,
since I expect it is an 8-bit 1:1 mapping from bytes values to a
set of codepoints, just like iso-8859-1) then it may decode to the
"wrong" characters, but the reverse process (characters encoded as
bytes) should produce the original bytes.  With a mapping like this,
errors='replace' may mean nothing; there will be no errors because
the only Unicode characters in play are all from iso-8859-7 to start
with. Of course another string may not be safe.

| >| You took the unicode 's' string you utf-8 bytestringed it.
| >| Then how its possible to ask for the utf8-bytestring to decode
| >| back to unicode string with the use of a different charset that the
| >| one used for encoding and thsi actually printed the filename in
| >| greek-iso?
| >
| >It is easily possible, as shown above. Does it make sense? Normally
| >not, but Steven is demonstrating how your "mv" exercises have
| >behaved: a rename using utf-8, then a _display_ using iso-8859-7.
|
| Same as above, i don't understand it at all, since different
| charsets(encodings) used in the encode/decode process.

Visually, the names will be garbage. And if you go:

  mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'

while using the iso-8859-7 locale, the wrong thing will occur
(assuming it even works, though I think it should because all these
characters are represented in iso-8859-7, yes?)

Why?

In the iso-8859-7 locale, your (currently named under an utf-8
regime) file looks like '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' (because the
Unicode byte sequence maps to those characters in iso-8859-7). Why
you issue the about "mv" command, the new name will be the _iso-8859-7_
bytes encoding for '999-Eυχή-του-Ιησού.mp3'.  Which, under an utf-8
regime will decode to _other_ characters.

If you want to repair filenames, by which I mean, cause them to be correctly
encoded for utf-8, you are best to work in utf-8 (using "mv" or python).

Of course, the badly named files will then look wrong in your listing.

If you _know_ the filenames were written using iso-8859-7 encoding, and that 
the names are "right" under that encoding, you can write python code to rename 
them to utf-8.

Totally untested example code:

  import sys
  from binascii import hexlify

  for bytename in os.listdir( b'.' ):
unicode_name = bytename.decode('iso-8859-7')
new_bytename = unicode_name.encode('utf-8')
print("%s: %s => %s" % (unicode_name, hexlify(bytename), 
hexlify(new_bytename)), file=sys.stderr)
os.rename(bytename, new_bytename)

That code should not care what locale you are using because it uses
bytes for the file calls and is explicit about the encoding/decoding
steps.

| >| a) WHAT does it mean when a linux system is set to use utf-8?
| >
| >It means the locale settings _for the current process_ are set for
| >UTF-8. The "locale" command will show you the current state.
|
| That means that, when a linux application needs to saved a filename
| to the linux filesystem, the app checks the filesytem's 'locale', so
| to encode the filename using the utf-8 charset ?

At the command line, many will not. They'll just read and write bytes.

Some will decode/encode. Those that do, should by default use the
current locale.

But broadly, it is GUI apps that care about this because they must
translate byte sequences to glyphs: images of characters. So plenty
of command line tools do not need to care; the terminal application
is the one that presents the names to you; _it_ will decode them
for display. And it is the terminal app that translates your
keystrokes into bytes to feed to the command line.

NOTE: it is NOT the filesystem's locale. It is the current process'
locale, which is deduced from environment variables (which have
defaults if they are not set).

Under Windows I believe filesystems have locales; this can prevent
you storing some files on some filesystems on Windows, because the
filesystem doesn't cope. UNIX just takes bytes.

| And likewise when a linux application wants to decode a filename is
| also checking the filesystem's 'locale' setting so to know what
|

Re: Changing filenames from Greeklish => Greek (subprocess complain)

On 07Jun2013 11:10, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 wrote:
| On 7/6/2013 10:42 πμ, Michael Weylandt wrote:
| >os.rename( filepath_bytes filepath.encode('utf-8')
| >Missing comma, which is, after all, just a matter of syntax so it can't 
matter, right?
|
| I doubted that os.rename arguments must be comma seperated.

Why?

Every other python function separates arguments with commas.

| 'mv source target' didn't require commas so i though it was safe to assume 
that os.rename did not either.

"mv" is shell syntax.
os.rename is Python syntax.

Two totally separate languages.
-- 
Cameron Simpson 

Cynic, n. A blackguard whose faulty vision sees things as they are, not as
they ought to be.
Ambrose Bierce (1842-1914), U.S. author. The Devil's Dictionary (1881-1906).
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-07 Thread R. Michael Weylandt

On Fri, Jun 7, 2013 at 9:10 AM, Νικόλαος Κούρας  wrote:
> On 7/6/2013 10:42 πμ, Michael Weylandt wrote:
>
>>> os.rename( filepath_bytes filepath.encode('utf-8')
>
>> Missing comma, which is, after all, just a matter of syntax so it can't
>> matter, right?
>
> I doubted that os.rename arguments must be comma seperated.

All function calls in Python require commas if you are putting in more
than one argument. [0]

> But ater reading the docs.
>
> s.rename(src, dst)
>
> Rename the file or directory src to dst. If dst is a directory, OSError will
> be raised. On Unix, if dst exists and is a file, it will be replaced
> silently if the user has permission. The operation may fail on some Unix
> flavors if src and dst are on different filesystems. If successful, the
> renaming will be an atomic operation (this is a POSIX requirement). On
> Windows, if dst already exists, OSError will be raised even if it is a file;
> there may be no way to implement an atomic rename when dst names an existing
> file.
>
> Availability: Unix, Windows.
>
> Indeed it has to be:
>
> os.rename( filepath_bytes, filepath.encode('utf-8')

Parenthesis missing here as well.

>
> 'mv source target' didn't require commas so i though it was safe to assume
> that os.rename did not either.
>

That's for shell programming -- different language entirely.

The surrogate business is back to Unicode, which ain't my specialty so
I'll leave that to more able programmers.

MW

[0] You could pass multiple arguments by way of a tuple or dictionary
using */** but if you want arguments that aren't in the container
being passed, you're back to needing commas.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)


On 7/6/2013 10:42 πμ, Michael Weylandt wrote:


os.rename( filepath_bytes filepath.encode('utf-8')
Missing comma, which is, after all, just a matter of syntax so it can't matter, 
right?


I doubted that os.rename arguments must be comma seperated.
But ater reading the docs.

s.rename(/src/,/dst/)

   Rename the file or directory/src/to/dst/. If/dst/is a
   directory,OSError
   will
   be raised. On Unix, if/dst/exists and is a file, it will be replaced
   silently if the user has permission. The operation may fail on some
   Unix flavors if/src/and/dst/are on different filesystems. If
   successful, the renaming will be an atomic operation (this is a
   POSIX requirement). On Windows, if/dst/already exists,OSError
   will
   be raised even if it is a file; there may be no way to implement an
   atomic rename when/dst/names an existing file.

   Availability: Unix, Windows.

Indeed it has to be:

os.rename( filepath_bytes, filepath.encode('utf-8')

'mv source target' didn't require commas so i though it was safe to assume that 
os.rename did not either.


I'am happy to announce that after correcting many idiotic error like commas, 
missing colons and declaring of variables, this surrogate erro si the last i 
get.
I still dont understand what surrogate means. In english means replacement.
Here is the code:


#
# Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

# Iterate over all filenames in the path dir
for filename in filename_bytes:
# Compute 'path/to/filename' in bytes
filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
try:
filepath = filepath_bytes.decode('utf-8')
except UnicodeDecodeError:
try:
filepath = filepath_bytes.decode('iso-8859-7')

# Rename current filename from greek bytes => utf-8 
bytes
os.rename( filepath_bytes, filepath.encode('utf-8') )
except UnicodeDecodeError:
print( '''I give up! This filename is unreadable! ''')


#
# Get filenames of the apps directory as unicode
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in filenames:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
(filename,) )
data = cur.fetchone()#filename is unique, so should 
only be one

if not data:
# First time for file; primary key is automatic, hit is 
defaulted
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
filenames = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in filenames:
filenames.add( filename )

# Delete spurious
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for filename in data:
if filename not in filenames:
cur.execute('''DELETE FROM files WHERE url = %s''', (filename,) 
)



=

[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173]   File 
"/home/nikos/public_html/cgi-bin/files.py", line 88, in 
[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] 
cur.execute('''SELECT url FROM files WHERE url = %s''', filename )
[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173]   File 
"/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py",
 line 108, in execute
[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] query = 
query.encode(charset)
[Fri Jun 07 11:08:17 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 
'utf-8' codec can't encode character '\\udcce' in position 35: surrogates not 
allowed



--
Webhost && Weblog 
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-07 Thread Michael Weylandt



On Jun 7, 2013, at 8:32, Νικόλαος Κούρας  wrote:

> Τη Παρασκευή, 7 Ιουνίου 2013 10:09:29 π.μ. UTC+3, ο χρήστης Lele Gaifax 
> έγραψε:
> 
>> As already explained, often a SyntaxError is introduced by *preceeding*
>> "text", so you must look at your code with a "wider eye".
> 
> That what i ahte aabout error reporting. You have some syntax error someplace 
> and error reports you another line, so you have to check the whole code again.
> Well i just did, i see no syntactical errors.
> 
>> Yes: and that usually imply that the *function* accepts (at least) *two*
>> arguments, specifically the source and the target names, right? How many
>> arguments are you actually giving to the os.rename() function above?
> 
> i'm giving it two.
> os.rename( filepath_bytes filepath.encode('utf-8') 

Missing comma, which is, after all, just a matter of syntax so it can't matter, 
right?


> 
> 1st = filepath_bytes
> 2nd = filepath.encode('utf-8')
> 
> Source and Target respectively.
> -- 
> http://mail.python.org/mailman/listinfo/python-list
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Παρασκευή, 7 Ιουνίου 2013 10:09:29 π.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:

> As already explained, often a SyntaxError is introduced by *preceeding*
> "text", so you must look at your code with a "wider eye".

That what i ahte aabout error reporting. You have some syntax error someplace 
and error reports you another line, so you have to check the whole code again.
Well i just did, i see no syntactical errors.

> Yes: and that usually imply that the *function* accepts (at least) *two*
> arguments, specifically the source and the target names, right? How many
> arguments are you actually giving to the os.rename() function above?

i'm giving it two.
os.rename( filepath_bytes filepath.encode('utf-8') )

1st = filepath_bytes
2nd = filepath.encode('utf-8')

Source and Target respectively.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-07 Thread Chris Angelico

On Fri, Jun 7, 2013 at 5:08 PM, Νικόλαος Κούρας  wrote:
> I'll google Traal right now.

The one thing you're actually willing to go research, and it's
actually something that won't help you. Traal is the name of my
personal laptop. Spend your Googletrons on something else. :)

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Παρασκευή, 7 Ιουνίου 2013 9:46:53 π.μ. UTC+3, ο χρήστης Chris Angelico 
έγραψε:
> On Fri, Jun 7, 2013 at 4:35 PM,   wrote:
> 
> > Yes, but but 'putty' seems to always forget when i tell it to use utf8 for 
> > displaying and always picks up the Win8's default charset and it doesnt 
> > have a save options dialog. I cant always remember to switch to utf8 
> > charset or renaming all the time from termnal so many greek filenames.
> 
> 
> 
> 
> 
> I use PuTTY too (though that'll change when I next upgrade Traal, as
> 
> I'll no longer have any Windows clients), and it's set to UTF-8 in the
> 
> Winoow|Translation page. Far as I know, those settings are all saved
> 
> into the Saved Sessions settings, back on the Session page.
> 
> 
> 
> ChrisA


Session settings afaik is for putty to remember hosts to connect to, not 
terminal options. I might be worng though. No matter how many times i change 
its options next time i run it always defaults back.

I'll google Traal right now.
You should also take o look on 'Secure Shell' extension for Chrome i just found 
out.

Seems a great plugin for Chrome. You'll definately like it, i did!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-07 Thread Lele Gaifax

nagia.rets...@gmail.com writes:

>   File "files.py", line 75
> os.rename( filepath_bytes filepath.encode('utf-8') )
>  ^
> SyntaxError: invalid syntax
>
> I am seeign the caret pointing at filepath but i cant follow what it
> tries to tell me.

As already explained, often a SyntaxError is introduced by *preceeding*
"text", so you must look at your code with a "wider eye".

> This rename statement tries to convert the greek byted filepath to
> utf-8 byted filepath.

Yes: and that usually imply that the *function* accepts (at least) *two*
arguments, specifically the source and the target names, right? How many
arguments are you actually giving to the os.rename() function above?

> I can't see whay this is wrong though.

Try stronger, I won't be give you further indications to your
SyntaxErrors, you *must* learn how to detect and fix those by yourself.

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)


On 7/6/2013 4:01 πμ, Cameron Simpson wrote:

On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 wrote:
| Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:
| > py> s = '999-Eυχή-του-Ιησού'
| > py> bytes_as_utf8 = s.encode('utf-8')
| > py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
| > py> print(t)
| > 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
|
| errors='replace' mean dont break in case or error?

Yes. The result will be correct for correct iso-8859-7 and slightly mangled
for something that would not decode smoothly.
How can it be correct? We have encoded out string in utf-8 and then we 
tried to decode it as greek-iso? How can this possibly be correct?


| You took the unicode 's' string you utf-8 bytestringed it.
| Then how its possible to ask for the utf8-bytestring to decode
| back to unicode string with the use of a different charset that the
| one used for encoding and thsi actually printed the filename in
| greek-iso?

It is easily possible, as shown above. Does it make sense? Normally
not, but Steven is demonstrating how your "mv" exercises have
behaved: a rename using utf-8, then a _display_ using iso-8859-7.
Same as above, i don't understand it at all, since different 
charsets(encodings) used in the encode/decode process.

|
| a) WHAT does it mean when a linux system is set to use utf-8?

It means the locale settings _for the current process_ are set for
UTF-8. The "locale" command will show you the current state.
That means that, when a linux application needs to saved a filename to 
the linux filesystem, the app checks the filesytem's 'locale', so to 
encode the filename using the utf-8 charset ?
And likewise when a linux application wants to decode a filename is also 
checking the filesystem's 'locale' setting so to know what charset must 
use to decode the filename correctly back to the original string?


So locale is used for filesystem itself and linux apps to know how to 
read(decode) and write(enode) filenames from/into the system's hdd?



| c) WHAT happens when the two of them try to work together?

If everything matches, it is all good. If the locales do not match,
the mismatch will result in an undesired bytes<->characters
encode/decode step somewhere, and something will display incorrectly
or be entered as input incorrectly.


Cant quite grasp the idea:

local end: Win8,  locale = greek-iso
remote end: CentOS 6.4,  locale = utf-8

FileZilla by default uses "do not know what charset" to upload filenames
Putty by default uses greek-iso to display filenames


WHAT someone can expect to happen when all of the above work together?
Mess of course, but i want to hear in detail each step of the mess as it 
emerges.


--
Webhost && Weblog 
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-06 Thread Chris Angelico

On Fri, Jun 7, 2013 at 4:35 PM,   wrote:
> Yes, but but 'putty' seems to always forget when i tell it to use utf8 for 
> displaying and always picks up the Win8's default charset and it doesnt have 
> a save options dialog. I cant always remember to switch to utf8 charset or 
> renaming all the time from termnal so many greek filenames.

I use PuTTY too (though that'll change when I next upgrade Traal, as
I'll no longer have any Windows clients), and it's set to UTF-8 in the
Winoow|Translation page. Far as I know, those settings are all saved
into the Saved Sessions settings, back on the Session page.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-06 Thread nagia . retsina

Τη Παρασκευή, 7 Ιουνίου 2013 4:25:40 π.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:

> MRAB tells you to work with the bytes, because the filenames' bytes are 
> invalid decoded as UTF-8. If you fix the file names by renaming using a 
> terminal set to UTF-8, then they will be valid and you can forget about  
> working with bytes.

Yes, but but 'putty' seems to always forget when i tell it to use utf8 for 
displaying and always picks up the Win8's default charset and it doesnt have a 
save options dialog. I cant always remember to switch to utf8 charset or 
renaming all the time from termnal so many greek filenames.

> Working with bytes is only for when the file names are turned to garbage.  
> Your file names (some of them) are turned to garbage. Fix them, and then 
> use file names as strings.

Can't '~/data/apps/' is filled every day with more and more files which are 
uploaded via FileZilla client, which i think it behaves pretty much like putty, 
uploading filenames as greek-iso bytes.

So that garbage will happen every day due to 'Putty' & 'FileZilla' clients.

So files.py before doing their stuff must do the automatic conversions from 
greek bytes to utf-8 bytes.

Here is what i have up until now.

=
 # Collect filenames of the path dir as bytes
filename_bytes = os.listdir( b'/home/nikos/public_html/data/apps/' )

# Iterate over all filenames in the path dir
for filename in filenames_bytes:
# Compute 'path/to/filename' in bytes
filepath_bytes = b'/home/nikos/public_html/data/apps/' + b'filename'
try:
filepath = filepath_bytes.decode('utf-8')
except UnicodeDecodeError:
try:
filepath = filepath_bytes.decode('iso-8859-7')

# Rename filename from greek bytes => utf-8 bytes
os.rename( filepath_bytes filepath.encode('utf-8') )
except UnicodeDecodeError:
print "I give up! This filename is unreadable!"
=

This is the best i can come up with, but after:

ni...@superhost.gr [~/www/cgi-bin]# python files.py
  File "files.py", line 75
os.rename( filepath_bytes filepath.encode('utf-8') )
 ^
SyntaxError: invalid syntax
ni...@superhost.gr [~/www/cgi-bin]#



I am seeign the caret pointing at filepath but i cant follow what it tries to 
tell me. No parenthesis missed or added this time due to speed and tireness.

This rename statement tries to convert the greek byted filepath to utf-8 byted 
filepath.

I can't see whay this is wrong though.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-06 Thread rusi

On Jun 7, 12:03 am, Lele Gaifax  wrote:

> You should *read* and *understand* the error message!

When you *shout* at the deaf, the non-deaf get deaf .
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-06 Thread Steven D'Aprano

On Thu, 06 Jun 2013 11:46:20 -0700, Νικόλαος Κούρας wrote:

> Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano
> έγραψε:
> 
>> py> s = '999-Eυχή-του-Ιησού' 
>> py> bytes_as_utf8 = s.encode('utf-8') 
>> py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace') 
>> py> print(t)
>> 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
> 
> errors='replace' mean dont break in case or error? 

Please try reading the documentation for yourself before asking for help.

http://docs.python.org/3/library/stdtypes.html#bytes.decode

Yes, errors='replace' will mean that any time there is a decoding error, 
the official Unicode "U+FFFD REPLACEMENT CHARACTER" will be used instead 
of raising an error. Read the docs above, and follow the link, for more 
information.

> You took the unicode
> 's' string you utf-8 bytestringed it. 

The word is "encoded".

Encoding: Unicode string => bytes
Decoding: bytes => Unicode string

> Then how its possible to ask for
> the utf8-bytestring to decode back to unicode string with the use of a
> different charset that the one used for encoding and thsi actually
> printed the filename in greek-iso?

Bytes are bytes, no matter where they come from. Bytes don't remember 
whether they were from a Unicode string, or a float, or an integer, or a 
list of pointers. All they know is that they are a sequence of values, 
each value is 8 bits.

So bytes don't remember what charset (encoding) made them. If I have a 
set of bytes, I can *try* to do anything I like with them:

* decode those bytes as ASCII
* decode those bytes as UTF-8
* decode those bytes as ISO-8859-7
* decode those bytes as a list of floats
* decode those bytes as a binary tree of pointers

If the bytes are not actually ASCII, or UTF-8, etc., then I will get 
garbage, or an error.

>> So that demonstrates part of your problem: even though your Linux
>> system is using UTF-8, your terminal is probably set to ISO-8859-7. The
>> interaction between these will lead to strange and disturbing Unicode
>> errors.
> 
> Yes i feel this is the problem too.
> Its a wonder to me why putty used by default greek-iso instead of utf-8
> !!

Putty is probably getting the default charset from the Windows 8 system 
you are using, and Windows is probably using Greek ISO-8859-7 for 
compatibility with legacy data going back to Windows 95 or even DOS.

Someday everyone will use UTF-8, and this nonsense will be over.

> Please explain this t me because now that i begin to understand this
> encode/decode things i begin to like them!

Start here:

http://www.joelonsoftware.com/articles/Unicode.html

http://nedbatchelder.com/text/unipain.html

> a) WHAT does it mean when a linux system is set to use utf-8?

The Linux file system just treats file names as bytes. Any byte except 
0x00 and 0x2f (ASCII '\0' and '/') are legal in file names, so the Linux 
file system will store any other bytes.

But the applications on a Linux system don't work with bytes, they work 
with text strings. You want to see a file name like "My Music.mp3", not 
bytes like 0x4d79204d757369632e6d7033. So the applications need to know 
how to encode their text strings (file names) into bytes, and how to 
decode the file system bytes back into strings.

On Linux, there is a standard setting for doing this, the locale, which 
by default is set to use UTF-8 as the standard encoding. So well-behaved 
Linux applications will, directly or indirectly, interpret the bytes-on-
disk in file names as UTF-8, because that's what the locale tells them to 
do.

On Windows, there is a complete different setting for doing this, 
probably in the Registry.

> b) WHAT does it mean when a terminal client is set to use utf-8? 

Terminals need to accept bytes from the keyboard, and display them as 
text to the user. So they need to know what encoding to use to change 
bytes like 0x4d79204d757369632e6d7033 into something that is readable to 
a human being, "My Music.mp3". That is the encoding.

> c) WHAT happens when the two of them try to work together?

If they are set to the same encoding, everything just works.

If they are set to different encodings, you will probably have problems, 
just as you are having problems.

> ni...@superhost.gr [~/www/cgi-bin]# echo $LS_OPTIONS 
> --color=tty -F -a -b -T 0
> 
> Is this okey? The '-b' option is for to display a filename in binary
> mode?

That's fine.

> Indeed i have changed putty to use 'utf-8' and 'ls -l' now displays the
> file in correct greek letters. Switching putty's encoding back to
> 'greek-iso' then the *displayed* filanames shows in mojabike.
> 
> WHAT is being displayed and what is actually stored as bytes is two
> different thigns right?

Correct.

The bytes 0x200x40 means " @" (space at-sign) in ASCII or UTF-8, (and 
also many other encodings), but it means CJK UNIFIED IDEOGRAPH-4020 in 
UTF-16, it is invalid in UTF-32, and it means the number 32 as a 16-bit 
integer. Bytes are just sets of 8-bit values. The *meaning* of those 8-
bit values de

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-06 Thread Steven D'Aprano

On Thu, 06 Jun 2013 13:56:36 -0700, Νικόλαος Κούρας wrote:

> SyntaxError: invalid syntax
> 
> 
> Dont know how to add a bytestremed path to a bytestream filename

Nikos, READ THE ERROR MESSAGE!!!

The error doesn't say anything about *adding*. It is a SyntaxError.

Please stop flooding us with dozens and dozens of trivial posts asking 
the same questions over and over again. There are well over 120 posts in 
this thread, it is impossible for anyone to keep track of it.

* Do not send a new post every time you make a small change to the code.

* Do not send a new post every time you make a typo and get a SyntaxError.

* READ THE ERROR MESSAGES and try to understand them first.

* SyntaxError means YOU HAVE MADE A TYPING MISTAKE.

* SyntaxError means that your code is not executed at all. Not a
  single line of code is run. If no code is running, the problem
  cannot possibly be with "add" or some other operation.

  If your car will not start, the problem cannot be with the brakes.

  If your program will not start, the problem cannot be with adding
  two byte strings.

* You can fix syntax errors yourself. READ THE CODE that has the 
  syntax error and LOOK FOR WHAT IS WRONG. Then fix it.

* Don't tell us when you have fixed it. Nobody cares. Just fix it.

Here is the line of code again:

old_path = b'/home/nikos/public_html/data/apps/' + b'filename') 

There is a syntax error in this line of code. Hint: here are some simple 
examples of the same syntax error:

a = b + c)
x = y * z)
alist.sort())
assert 1+1 == 2)

Can you see the common factor? Each of those lines will give the same 
syntax error as your line.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-06 Thread Steven D'Aprano

On Thu, 06 Jun 2013 12:17:16 -0700, Νικόλαος Κούρας wrote:

> i can remove the bianry openign from os.listdir but then this will not
> work. MRAB has told me that i need to open those paths and filenames as
> bytestreams and not as unicode strings.

Do you understand why?

If you do not understand *why* we tell you to do a thing, then you have 
no understanding and are doing Cargo Cult programming:

http://en.wikipedia.org/wiki/Cargo_cult_programming
http://en.wikipedia.org/wiki/Cargo_cult

MRAB tells you to work with the bytes, because the file names' bytes are 
invalid when used as UTF-8. If you fix the file names by renaming using a 
terminal set to UTF-8, then they will be valid and you can forget about 
working with bytes.

Working with bytes is only for when the file names are turned to garbage. 
Your file names (some of them) are turned to garbage. Fix them, and then 
use file names as strings.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-06 Thread Cameron Simpson

On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 wrote:
| Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano 
έγραψε:
| > py> s = '999-Eυχή-του-Ιησού'
| > py> bytes_as_utf8 = s.encode('utf-8')
| > py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
| > py> print(t) 
| > 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
| 
| errors='replace' mean dont break in case or error?

Yes. The result will be correct for correct iso-8859-7 and slightly mangled
for something that would not decode smoothly.

| You took the unicode 's' string you utf-8 bytestringed it.
| Then how its possible to ask for the utf8-bytestring to decode
| back to unicode string with the use of a different charset that the
| one used for encoding and thsi actually printed the filename in
| greek-iso?

It is easily possible, as shown above. Does it make sense? Normally
not, but Steven is demonstrating how your "mv" exercises have
behaved: a rename using utf-8, then a _display_ using iso-8859-7.

| > So that demonstrates part of your problem: even though your Linux system  
| > is using UTF-8, your terminal is probably set to ISO-8859-7. The  
| > interaction between these will lead to strange and disturbing Unicode 
| > errors.
| 
| Yes i feel this is the problem too. 
| Its a wonder to me why putty used by default greek-iso instead of utf-8 !!

Putty will get its terminal setting from the system you came from.
I suppose Windows of some kind. If you look at Putty's settings you
may be able to specify UTF-8 explicitly; not sure. If you can, do
that. At least there will be one less layer of confusion to debug.

| Please explain this t me because now that i begin to understand
| this encode/decode things i begin to like them!
| 
| a) WHAT does it mean when a linux system is set to use utf-8?

It means the locale settings _for the current process_ are set for
UTF-8. The "locale" command will show you the current state. There
will also be some system settings with defaults for stuff started
up by the system. On CentOS and RedHat that is probably the file:

  /etc/sysconfig/i18n

_However_, when you ssh in to the system using Putty or another ssh
client, the settings at your local end are passes to the remote ssh
session. In this way different people using different locales can
ssh in and get the locales they expect to use.

Of course, of the locale settings differ and these people are working
on the same files and text, madness will ensue.

| b) WHAT does it mean when a terminal client is set to use utf-8?

It means the _display_ end of the terminal will render characters
using UTF-8. Data comes from the remote system as a sequence of
bytes. The terminal receives these bytes and _decodes_ them using
utf-8 (or whatever) in order to decides what characters to display.

| c) WHAT happens when the two of them try to work together?

If everything matches, it is all good. If the locales do not match,
the mismatch will result in an undesired bytes<->characters
encode/decode step somewhere, and something will display incorrectly
or be entered as input incorrectly.

| > So I believe I understand how your file name has become garbage. To fix 
| > it, make sure that your terminal is set to use UTF-8, and then rename it. 
| > Do the same with every file in the directory until the problem goes away.
| 
| ni...@superhost.gr [~/www/cgi-bin]# echo $LS_OPTIONS
| --color=tty -F -a -b -T 0
| 
| Is this okey? The '-b' option is for to display a filename in binary mode?

Probably. "man ls" will tell you.

Personally, I "unalias ls" on RedHat systems (and any other system
where an alias has been set up). I want ls to do what I say, not
what someone else thought was a good idea.

| Indeed i have changed putty to use 'utf-8' and 'ls -l' now displays
| the file in correct greek letters. Switching putty's encoding back
| to 'greek-iso' then the *displayed* filanames shows in mojabike.

Exactly so.

| WHAT is being displayed and what is actually stored as bytes is two different 
thigns right?

Yes. Display requires the byte stream to be decoded. Wrong decoding
display wrong characters/glyphs.

| Ευχη του Ιησου.mp3
| EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
| 
| is the way the filaname is displayed in the terminal depending
| on the encoding the terminal uses, correct? But no matter *how* its
| being dislayed those two are the same file?

In principle, yes. Nothing has changed on the filesystem itself.

Cheers,
-- 
Cameron Simpson 

in rec.moto, jsh wrote:
> Dan Nitschke wrote:
> > Ged Martin wrote:
> > > On Sat, 17 May 1997 16:53:33 +, Dan Nitschke scribbled:
> > > >(And you stay *out* of my dreams, you deviant little
> > > >weirdo.)
> > > Yeah, yeah, that's what you're saying in _public_
> > Feh. You know nothing of my dreams. I dream entirely in text (New Century
> > Schoolbook bold oblique 14 point), and never in color. I once dreamed I
> > was walking down a flowchart of my own code, and a waterfall of semicolons
> > was chasing me. (I hid behind a global variable un

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-06 Thread Cameron Simpson

On 06Jun2013 05:04, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= 
 wrote:
| We are in test mode so i dont know if when renaming actually take place what 
the encodings will be.
| Shall i switch off test mode and try it for real?

I would make a copy. Since you're renaming stuff, hard links would do:

  cp -rpl original-dir test-dir

Then test stuff in test-dir.
-- 
Cameron Simpson 

Too much of a good thing is never enough.   - Luba
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-06 Thread MRAB


On 06/06/2013 22:07, Lele Gaifax wrote:

Νικόλαος Κούρας  writes:


Tahnks here is what i have up until now with many corrections.


I'm afraid many more are needed :-)


...
# rename filename form greek bytestreams --> utf-8 bytestreams
old_path = b'/home/nikos/public_html/data/apps/' + b'filename')
new_path = b'/home/nikos/public_html/data/apps/' + 
b'new_filename')
os.rename( old_path, new_path )


a) there are two syntax errors, you have spurious close brackets there
b) you are basically assigning *constant* expressions to both variables,
most probably not what you meant

Yet again, he's changed things unnecessarily, and the code was meant 
only as a one-time

fix to correct the encoding of some filenames. :-(
--
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-06 Thread Lele Gaifax

Νικόλαος Κούρας  writes:

> The only problem now is the bytestrings:

*One*, not the *only*.

>
> ni...@superhost.gr [~/www/cgi-bin]# [Thu Jun 06 23:50:42 2013] [error] 
> [client 79.103.41.173]   File "files.py", line 78
> [Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173] old_path = 
> b'/home/nikos/public_html/data/apps/' + b'filename')
> [Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173] 
>   ^
> [Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173] SyntaxError: 
> invalid syntax
>
>
> Dont know how to add a bytestremed path to a bytestream filename

Come on Niklos, either you learn from what I (and others) try to teach
you, or I'm afraid you won't get more hints... this list cannot become
your remote editor tool!

*Read* the error message, *look* at the arrow (i.e. the caret character
 "^"), *understand* what that is trying to tell you...

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-06 Thread Lele Gaifax

Νικόλαος Κούρας  writes:

> Tahnks here is what i have up until now with many corrections.

I'm afraid many more are needed :-)

> ...
>   # rename filename form greek bytestreams --> utf-8 bytestreams
>   old_path = b'/home/nikos/public_html/data/apps/' + b'filename')
>   new_path = b'/home/nikos/public_html/data/apps/' + 
> b'new_filename')
>   os.rename( old_path, new_path )

a) there are two syntax errors, you have spurious close brackets there
b) you are basically assigning *constant* expressions to both variables,
   most probably not what you meant

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

I'm very sorry for continuous pastes.
Didnt include the whole thing before.
Here it is:


#
# Get filenames of the path dir as bytestrings
path = os.listdir( b'/home/nikos/public_html/data/apps/' )

# iterate over all filenames in the apps directory
for filename in path:
# Grabbing just the filename from path
try: 
# Is this name encoded in utf-8? 
filename.decode('utf-8') 
except UnicodeDecodeError: 
# Decoding from UTF-8 failed, which means that the name is not 
valid utf-8

# It appears that this filename is encoded in greek-iso, so 
decode from that and re-encode to utf-8
new_filename = filename.decode('iso-8859-7').encode('utf-8') 

# rename filename form greek bytestreams --> utf-8 bytestreams
old_path = b'/home/nikos/public_html/data/apps/' + b'filename')
new_path = b'/home/nikos/public_html/data/apps/' + 
b'new_filename')
os.rename( old_path, new_path )


#
# Get filenames of the apps directory as unicode
path = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in path:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
(filename,) )
data = cur.fetchone()#filename is unique, so should 
only be one

if not data:
# First time for file; primary key is automatic, hit is 
defaulted 
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
path = os.listdir( '/home/nikos/public_html/data/apps/' )
filenames = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in path
filenames.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for filename in data:
if filename not in filenames
cur.execute('''DELETE FROM files WHERE url = %s''', (filename,) 
)
=

Just the bytestream error and then i belive its ready this time.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Has some errors:

#
# Get filenames of the apps directory as bytestrings
path = os.listdir( b'/home/nikos/public_html/data/apps/' )

# iterate over all filenames in the apps directory
for filename in path:
# Grabbing just the filename from path
try: 
# Is this name encoded in utf-8? 
filename.decode('utf-8') 
except UnicodeDecodeError: 
# Decoding from UTF-8 failed, which means that the name is not 
valid utf-8

# It appears that this filename is encoded in greek-iso, so 
decode from that and re-encode to utf-8
new_filename = filename.decode('iso-8859-7').encode('utf-8') 

# rename filename form greek bytestreams --> utf-8 bytestreams
old_path = b'/home/nikos/public_html/data/apps/' + b'filename')
new_path = b'/home/nikos/public_html/data/apps/' + 
b'new_filename')
os.rename( old_path, new_path )


#
# Get filenames of the apps directory as unicode
path = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in path:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
(filename,) )
data = cur.fetchone()#filename is unique, so should 
only be one

if not data:
# First time for file; primary key is automatic, hit is 
defaulted 
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
path = os.listdir( '/home/nikos/public_html/data/apps/' )
filenames = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in path
filenames.add( filename )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's filenames against path's filenames
for filename in data:
if filename not in filenames
cur.execute('''DELETE FROM files WHERE url = %s''', (filename,) 
)
---

The only problem now is the bytestrings:

ni...@superhost.gr [~/www/cgi-bin]# [Thu Jun 06 23:50:42 2013] [error] [client 
79.103.41.173]   File "files.py", line 78
[Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173] old_path = 
b'/home/nikos/public_html/data/apps/' + b'filename')
[Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173]   
^
[Thu Jun 06 23:50:42 2013] [error] [client 79.103.41.173] SyntaxError: invalid 
syntax


Dont know how to add a bytestremed path to a bytestream filename
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Τη Πέμπτη, 6 Ιουνίου 2013 11:25:15 μ.μ. UTC+3, ο χρήστης Lele Gaifax έγραψε:
> Νικόλαος Κούρας  writes:
> 
> 
> 
> > Now the error afetr fixithg that transformed to:
> 
> >
> 
> > [Thu Jun 06 22:13:49 2013] [error] [client 79.103.41.173] filename = 
> > fullpath.replace( '/home/nikos/public_html/data/apps/', '' )
> 
> > [Thu Jun 06 22:13:49 2013] [error] [client 79.103.41.173] TypeError: 
> > expected bytes, bytearray or buffer compatible object
> 
> >
> 
> > MRAB has told me that i need to open those paths and filenames as 
> > bytestreams and not as unicode strings.
> 
> 
> 
> Yes, that way the function will return a list of bytes
> 
> instances. Knowing that, consider the following example, that should
> 
> ring a bell:
> 
> 
> 
> $ python3
> 
> Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 09:59:04) 
> 
> [GCC 4.7.2] on linux
> 
> Type "help", "copyright", "credits" or "license" for more information.
> 
> >>> path = b"some/path"
> 
> >>> path.replace('some', '')
> 
> Traceback (most recent call last):
> 
>   File "", line 1, in 
> 
> TypeError: expected bytes, bytearray or buffer compatible object
> 
> >>> path.replace(b'some', b'')
> 
> b'/path'

Ah yes, very logical, i should have though of that.
Tahnks here is what i have up until now with many corrections.


#
# Get filenames of the apps directory as bytestrings
path = os.listdir( b'/home/nikos/public_html/data/apps/' )

# iterate over all filenames in the apps directory
for filename in path:
# Grabbing just the filename from path
try: 
# Is this name encoded in utf-8? 
filename.decode('utf-8') 
except UnicodeDecodeError: 
# Decoding from UTF-8 failed, which means that the name is not 
valid utf-8

# It appears that this filename is encoded in greek-iso, so 
decode from that and re-encode to utf-8
new_filename = filename.decode('iso-8859-7').encode('utf-8') 

# rename filename form greek bytestreams --> utf-8 bytestreams
old_path = b'/home/nikos/public_html/data/apps/' + b'filename')
new_path = b'/home/nikos/public_html/data/apps/' + 
b'new_filename')
os.rename( old_path, new_path )


#
# Get filenames of the apps directory as unicode
path = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for filename in path:
try:
# Check the presence of a file against the database and insert 
if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', 
(filename,) )
data = cur.fetchone()#URL is unique, so should only be 
one

if not data:
# First time for file; primary key is automatic, hit is 
defaulted 
cur.execute('''INSERT INTO files (url, host, lastvisit) 
VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
except pymysql.ProgrammingError as e:
print( repr(e) )


#
# Empty set that will be filled in with 'path/to/filename' of path dir
urls = ()

# Build a set of 'path/to/filename' based on the objects of path dir
for filename in path
url = '/home/nikos/public_html/data/apps/' + filename
urls.add( url )

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

# Check database's urls against path's urls
for url in data:
if url not in urls
cur.execute('''DELETE FROM files WHERE url = %s''', (url,) )
==

I think its ready! But i want to hear from you, before i try it! :)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

2013-06-06 Thread Lele Gaifax

Νικόλαος Κούρας  writes:

> Now the error afetr fixithg that transformed to:
>
> [Thu Jun 06 22:13:49 2013] [error] [client 79.103.41.173] filename = 
> fullpath.replace( '/home/nikos/public_html/data/apps/', '' )
> [Thu Jun 06 22:13:49 2013] [error] [client 79.103.41.173] TypeError: expected 
> bytes, bytearray or buffer compatible object
>
> MRAB has told me that i need to open those paths and filenames as bytestreams 
> and not as unicode strings.

Yes, that way the function will return a list of bytes
instances. Knowing that, consider the following example, that should
ring a bell:

$ python3
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 09:59:04) 
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> path = b"some/path"
>>> path.replace('some', '')
Traceback (most recent call last):
  File "", line 1, in 
TypeError: expected bytes, bytearray or buffer compatible object
>>> path.replace(b'some', b'')
b'/path'
>>> 

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)

Actually about the Spurious procedure iam happy with myelf that came up with 
this:

# Delete spurious 
cur.execute('''SELECT url FROM files''')
data = cur.fetchall()

for filename in path
url = '/home/nikos/public_html/data/apps/' + filename
urls.add( url )

for url in data:
if url not in urls
cur.execute('''DELETE FROM files WHERE url = %s''', (url,) )


Ddint try it yet though, need to anwer previous post's

a) Is it correct that the first time i open os.listdir() as binary to grab the 
fileenames as bytestring and the 2nd normally to grab the filanems as unicode 
strings? 
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Changing filenames from Greeklish => Greek (subprocess complain)