Encoding

2006-02-13 Thread Lad
I have a script that uses utf-8 encoding. Now I would like to send some
data to an application ( on another server) that uses 1250 encoding.
How can I tell the server that the coming data will be in utf-8 code?
I tried this
params='some data'
h = httplib.HTTP(self.host)
h.putheader("Content-type", "multipart/form-data;
boundary=---7d329422844")
h.putheader("Content-length", "%d" % len(params))
h.putheader("Accept-Language", "us-en")
h.putheader('Accept', 'image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, */*')
h.putheader('User-Agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows
NT 5.1)')
h.putheader('Host', 'server')
h.endheaders()
.


but it doesn't work. The data are not encoded properly
Thanks for help
L.B.

-- 
http://mail.python.org/mailman/listinfo/python-list


Encoding

2006-03-18 Thread Mohamad Babaei

Hi,
I'm working on a program that fetches some translated texts from Altavista online translator , it works fine with languages like German, french & .. but it can not get translated text in Japanese or Russian or chinese. 

my code is something like this:
 
##
data1 = urllib.urlopen('
http://www.babelfish.altavista.com/babelfish/trurl_pagecontent?lp=en_ru&url="" )
data1=data1.decode('utf-8')
f=open('/usr/local/new/tt.html','w')
f.write(data1.encode('utf-8'))
f.close()
 
###
 
Any help appreciated.
 
Best Wishes,
M.Babaei
-- 
http://mail.python.org/mailman/listinfo/python-list

Encoding Questions

2005-04-19 Thread jalil
1. I download a page in python using urllib and now want to convert and
keep it as utf-8? I already know the original encoding of the page.
What calls should I make to convert the encoding of the page to utf8?
For example, let's say the page is encoded in gb2312 (simple chinese)
and I want to keep it in utf-8?

2. Is this a good approach? Can I keep any pages in any languages in
this way and return them when requested using utf-8 encoding?

3. Does python 2.4 support all encodings?

By the way, I have set my default encoding in Python to utf8.

I appreciate any help.

-JF

-- 
http://mail.python.org/mailman/listinfo/python-list


Trouble Encoding

2005-06-07 Thread fingermark
I'm using feedparser to parse the following:

Adv: Termite Inspections! Jenny Moyer welcomes
you to her HomeFinderResource.com TM A "MUST See …

I'm receiveing the following error when i try to print the feedparser
parsing of the above text:

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in
position 86: ordinal not in range(256)

Why is this happening and where does the problem lie?

thanks

-- 
http://mail.python.org/mailman/listinfo/python-list


Encoding sniffer?

2006-01-05 Thread Andreas Jung
Does anyone know of a Python module that is able to sniff the encoding of 
text? Please: I know that there is no reliable way to do this but I need 
something that works for most of the case...so please no discussion about 
the sense of such a module and approach.


Andreas

pgpj27jiq5WgN.pgp
Description: PGP signature
-- 
http://mail.python.org/mailman/listinfo/python-list

Encoding - unicode

2006-01-10 Thread Robert Deskoski
Hi there,

Currently I have a file with germanic names which are, unfortunately in this format:
B\xf6genschutz
As well as being mixed with those who actually have the correct character's in them.
What I am trying to do is convert the characters in the above format to the correct 
format in a text file. The below 5 lines of code work fine, so it changes the static 
line of text correctly, but when it reads lines in from the file, and i strip the endline off, 
it doesn't conver the character's properly. It just keeps them as they are when printed 
and outputted to the screen. 

outfile = open("Output.py", 'w')
ingermanfile = open("GermanNames.txt", 'r')

line = "B\xf6genschutz"
print line.decode("iso-8859-1") 
raw_input("Yeah")
   
   while 1:
  line = ingermanfile.readline()
  if not(line):
 break
  
  try:
  print line
  templine = line[:-1]
  temp = templine.decode("iso-8859-1")   
  print "'" + templine + "'"
  outfile.write(templine + "\n")
  except:
  raw_input("Here we are!")
  outfile.write(line)
  pass
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Encoding

2006-02-13 Thread Szabolcs Nagy
what about params='some data'.decode('utf8').encode('1250') ?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding

2006-02-13 Thread Martin v. Löwis
Lad wrote:
> I have a script that uses utf-8 encoding. Now I would like to send some
> data to an application ( on another server) that uses 1250 encoding.
> How can I tell the server that the coming data will be in utf-8 code?

You can't, normally. In theory, you should put

Content-type: text/plain;charset=utf-8

into the individual *part*. However, many implementations will ignore
the charset= argument on a part.

Instead, the convention is that the data must be transmitted in the
encoding of the *HTML page* which contained the form you are submitting.

So most likely, you need to recode your data.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding

2006-02-13 Thread Lad
Yes, 'some data'.decode('utf8').encode('windows-1250')
works great.
Thanks
L.B.

-- 
http://mail.python.org/mailman/listinfo/python-list


encoding problem

2006-03-03 Thread Yves Glodt
Hi list,


Playing with the great pysvn I get this problem:


Traceback (most recent call last):
   File "D:\avn\mail.py", line 80, in ?
 mailbody += diff
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 
10710: ordinal not in range(128)



It seems the pysvn.client.diff function returns "bytes" (as I read in 
the changelog of pysvn: http://svn.haxx.se/dev/archive-2005-10/0466.shtml)

How can I convert this string so that I can contatenate it to my 
"regular" string?


Best regards,
Yves
-- 
http://mail.python.org/mailman/listinfo/python-list


Character encoding

2006-11-07 Thread mp
I have html document titles with characters like >,  , and
‡. How do I decode a string with these values in Python?

Thanks

-- 
http://mail.python.org/mailman/listinfo/python-list


encoding characters

2007-03-21 Thread Valentina Marotto

First: excuse me for my very little english!

I have a problem, in my scripts python, with the encoding characters type
àâäéèêëïîôöûùüç

I use the twisted.web package.

The problem is about the insertion data into database.

one of errors is:

exceptions.UnicodeEncodeError: 'ascii' codec can't encode characters in
position 25-26: ordinal not in range(128)
<http://baghera.crs4.it:8080/sources#tbend> help me!
Thanks
-- 
http://mail.python.org/mailman/listinfo/python-list

encoding confusions

2007-03-29 Thread Tim Arnold
I have the contents of a file that contains French documentation.
I've iterated over it and now I want to write it out to a file.

I'm running into problems and I don't understand why--I don't get how the 
encoding works.
My first attempt was just this:
< snipped code for classes, etc; fname is string, codecs module loaded.>
< self.contents is the French file's contents as a single string >

tFile = codecs.open(fname,'w',encoding='latin-1', errors='ignore')
tFile.write(self.contents)
tFile.close()

ok, so that didn't work and I read some more and did this:
tFile.write(self.contents.encode('latin-1'))

but that gives me the same error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 48: 
ordinal not in range(128)

this is python2.4.1 (hpux)
sys.getdefaultencoding()
'ascii'

thanks,
--Tim Arnold


-- 
http://mail.python.org/mailman/listinfo/python-list


character encoding conversion

2004-12-11 Thread Dylan

Here's what I'm trying to do:

- scrape some html content from various sources

The issue I'm running to:

- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word

I've searched and read for many hours, but have not found a solution
for handling the case where the page author does not use the character
encoding that they have specified.

Things I have tried include encode()/decode(), and replacement lookup
tables (i.e. something like
http://groups-beta.google.com/group/comp.lang.python/browse_thread/thread/116158ad706dc7c1/11991de6ced3406b?q=python+html+parser+cp1252&_done=%2Fgroups%3Fq%3Dpython+html+parser+cp1252%26qt_s%3DSearch+Groups%26&_doneTitle=Back+to+Search&&d#11991de6ced3406b
) .  However, I am still unable to convert the characters to something
meaningful.  In the case of the lookup table, this failed as all of
the imporoperly encoded characters were returning as ? rather than
their original encoding.

I'm using urllib and htmllib to open, read, and parse the html
fragments, Python 2.3 on OS X 10.3 

Any ideas or pointers would be greatly appreciated.

-Dylan Schiemann
http://www.dylanschiemann.com/



-- 
http://mail.python.org/mailman/listinfo/python-list


unknown encoding problem

2005-04-08 Thread Uwe Mayer
Hi,

I need to read in a text file which seems to be stored in some unknown
encoding. Opening and reading the files content returns:

>>> f.read()
'\x00 \x00 \x00<\x00l\x00o\x00g\x00E\x00n\x00t\x00r\x00y\x00...

Each character has a \x00 prepended to it. I suspect its some kind of
unicode - how do I get rid of it? 

str.replace('\x00', '') "works" but is not really nice. I don't quite get
the hang of str.encode /str.decode

Any Ideas?
Thanks
Ciao
Uwe 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding Questions

2005-04-19 Thread Kent Johnson
[EMAIL PROTECTED] wrote:
1. I download a page in python using urllib and now want to convert and
keep it as utf-8? I already know the original encoding of the page.
What calls should I make to convert the encoding of the page to utf8?
For example, let's say the page is encoded in gb2312 (simple chinese)
and I want to keep it in utf-8?
Something like
data = urllib.url_open(...).read()
unicodeData = data.decode('gb2312')
utf8Data = unicodeData.encode('utf-8')
You may want to supply the errors parameter to decode() or encode(); see the 
docs for details.
http://docs.python.org/lib/string-methods.html
2. Is this a good approach? Can I keep any pages in any languages in
this way and return them when requested using utf-8 encoding?
Yes, as long as you know reliably what the encoding is for the source pages.
3. Does python 2.4 support all encodings?
I doubt it :-) but it supports many encodings. The list is at
http://docs.python.org/lib/standard-encodings.html
Kent
--
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding Questions

2005-04-19 Thread vincent wehren

<[EMAIL PROTECTED]> schrieb im Newsbeitrag 
news:[EMAIL PROTECTED]
| 1. I download a page in python using urllib and now want to convert and
| keep it as utf-8? I already know the original encoding of the page.
| What calls should I make to convert the encoding of the page to utf8?
| For example, let's say the page is encoded in gb2312 (simple chinese)
| and I want to keep it in utf-8?

Something like:

utf8_s = s.decode('gb2312').encode('utf-8')

- with s being the simplified chinese string - should work.

|
| 2. Is this a good approach? Can I keep any pages in any languages in
| this way and return them when requested using utf-8 encoding?
|
| 3. Does python 2.4 support all encodings?

See http://docs.python.org/lib/standard-encodings.html for an overview.

|
| By the way, I have set my default encoding in Python to utf8.
|

Why would you want to do that?

--

Vincent Wehren

|
| I appreciate any help.
|
| -JF
| 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding Questions

2005-04-19 Thread "Martin v. Löwis"
Kent Johnson wrote:
> Something like
> data = urllib.url_open(...).read()
> unicodeData = data.decode('gb2312')
> utf8Data = unicodeData.encode('utf-8')
> 
> You may want to supply the errors parameter to decode() or encode(); see
> the docs for details.
> http://docs.python.org/lib/string-methods.html

In addition, for an HTML page, you might need to update the META element
for the content-type HTTP header. For an XHTML page, you might need to
update/remove the XML declaration.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding Questions

2005-04-19 Thread jalil
thanks for the replies. As for why I set my default encoding to utf-8
in python, I did it a while ago and I think I did it because when I was
reading some strings from database in utf-8 it raised errors b/c there
were some chars it could recongnize in standard encoding. When I made
the change, the error didn't happen anymore.

Does it make sense? 

-JF

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding Questions

2005-04-19 Thread "Martin v. Löwis"
[EMAIL PROTECTED] wrote:
> thanks for the replies. As for why I set my default encoding to utf-8
> in python, I did it a while ago and I think I did it because when I was
> reading some strings from database in utf-8 it raised errors b/c there
> were some chars it could recongnize in standard encoding. When I made
> the change, the error didn't happen anymore.
> 
> Does it make sense? 

No. If reading the strings from the database already gives an exception
(i.e. without any processing of these strings), that is a bug in the
database. It is also unlikely that this is what actually happened.

More likely, you are reading the strings from the database, and then
combining them explicitly with Unicode strings. Instead of changing
the default encoding, you should tell your database adapter to return
the strings as Unicode objects; if this is not supported, you should
convert them to Unicode objects in the process of reading them.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


unicode encoding problem

2005-04-28 Thread garykpdx

Every time I think I understand unicode, I prove I don't.

I created a variable in interactive mode like this:
s = u'ä'
where this character is the a-umlaut
that worked alright. Then I encoded it like this:
s.encode( 'latin1')

and it printed out a sigma (totally wrong)

then I typed this:
s.encode( 'utf-8')

Then it gave me two weird characters +ñ

So how do I tell what encoding my unicode string is in, and how do I
retrieve that when I read it from a file?

--
http://mail.python.org/mailman/listinfo/python-list


utf encoding error

2005-05-04 Thread Timothy Smith
hi there, this one is in relation to my py2exe saga.

when i compile a package using py2exe i get the error msg below, if i 
just run the py files it doesn't error, so i assume pysvn is trying to 
use something thats not being included in the build. only i have no idea 
where to start looking.

Traceback (most recent call last):
  File "Main.pyc", line 819, in ValidateLogin
  File "Main.pyc", line 861, in ShowMainFrameItems
LookupError: unknown encoding: utf-8


and here is my setup.py for py2exe for good measure

from distutils.core import setup
import py2exe
   
package_dir = ['c:\python23\reportlab', 'c:\python23\lib',
   'Z:\Projects\PubWare\trunk\python']
   
setup(windows=[ "Z:\\Projects\\PubWare\\trunk\\python\\PubWare.py"])
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble Encoding

2005-06-07 Thread deelan
[EMAIL PROTECTED] wrote:
> I'm using feedparser to parse the following:
> 
> Adv: Termite Inspections! Jenny Moyer welcomes
> you to her HomeFinderResource.com TM A "MUST See …
> 
> I'm receiveing the following error when i try to print the feedparser
> parsing of the above text:
> 
> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in
> position 86: ordinal not in range(256)
> 
> Why is this happening and where does the problem lie?

it seems that the unicode character 0x201c isn't part
of the latin-1 charset, see:

"LEFT DOUBLE QUOTATION MARK"


try to encode the feedparser output to UTF-8 instead, or
use the "replace" option for the encode() method.

 >>> c = u'\u201c'
 >>> c
u'\u201c'
 >>> c.encode('utf-8')
'\xe2\x80\x9c'
 >>> print c.encode('utf-8')

ok, let's try replace

 >>> c.encode('latin-1', 'replace')
'?'

using "replace" will not throw an error, but it will replace
the offending characther with a question mark.

HTH.

-- 
deelan 




-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble Encoding

2005-06-07 Thread fingermark
why is it even trying latin-1 at all?  I don't see it anywhere in
feedparser.py or my code.

deelan wrote:
> [EMAIL PROTECTED] wrote:
> > I'm using feedparser to parse the following:
> >
> > Adv: Termite Inspections! Jenny Moyer welcomes
> > you to her HomeFinderResource.com TM A "MUST See …
> >
> > I'm receiveing the following error when i try to print the feedparser
> > parsing of the above text:
> >
> > UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in
> > position 86: ordinal not in range(256)
> >
> > Why is this happening and where does the problem lie?
>
> it seems that the unicode character 0x201c isn't part
> of the latin-1 charset, see:
>
> "LEFT DOUBLE QUOTATION MARK"
> 
>
> try to encode the feedparser output to UTF-8 instead, or
> use the "replace" option for the encode() method.
>
>  >>> c = u'\u201c'
>  >>> c
> u'\u201c'
>  >>> c.encode('utf-8')
> '\xe2\x80\x9c'
>  >>> print c.encode('utf-8')
>
> ok, let's try replace
>
>  >>> c.encode('latin-1', 'replace')
> '?'
>
> using "replace" will not throw an error, but it will replace
> the offending characther with a question mark.
> 
> HTH.
> 
> -- 
> deelan 

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble Encoding

2005-06-07 Thread Jarek Zgoda
[EMAIL PROTECTED] napisał(a):

> why is it even trying latin-1 at all?  I don't see it anywhere in
> feedparser.py or my code.

Check your site.py or sitecustomize.py module, you can have non-standard 
default encoding set there.

-- 
Jarek Zgoda
http://jpa.berlios.de/ | http://www.zgodowie.org/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble Encoding

2005-06-07 Thread John Roth
<[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]
> I'm using feedparser to parse the following:
>
> Adv: Termite Inspections! Jenny Moyer welcomes
> you to her HomeFinderResource.com TM A "MUST See &hellip;
>
> I'm receiveing the following error when i try to print the feedparser
> parsing of the above text:
>
> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in
> position 86: ordinal not in range(256)
>
> Why is this happening and where does the problem lie?

Several different things are going on here. First, when you try to
print a unicode string using str() or a similar function, Python is going to
use the default encoding to render it. The default encoding is usually
ASCII-7. Why it's trying to use Latin-1 in this case is somewhat
of a mystery.

The quote in front of the word MUST is a "smart quote", that is a
curly quote, and it is not a valid character in either ASCII or
Latin-1. Use Windows-1252 explicitly, and it should render
properly. Alternatively use UTF-8, as one of the other posters
suggested. Then it's up to whatever software you use to actually
put the ink on the paper to render it properly, but that's a different
issue.

John Roth
>
> thanks
> 

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble Encoding

2005-06-07 Thread Kent Johnson
John Roth wrote:
> <[EMAIL PROTECTED]> wrote in message 
> news:[EMAIL PROTECTED]
> 
>> I'm using feedparser to parse the following:
>>
>> Adv: Termite Inspections! Jenny Moyer welcomes
>> you to her HomeFinderResource.com TM A "MUST See &hellip;
>>
>> I'm receiveing the following error when i try to print the feedparser
>> parsing of the above text:
>>
>> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in
>> position 86: ordinal not in range(256)
>>
>> Why is this happening and where does the problem lie?
> 
> 
> Several different things are going on here. First, when you try to
> print a unicode string using str() or a similar function, Python is 
> going to
> use the default encoding to render it. The default encoding is usually
> ASCII-7. Why it's trying to use Latin-1 in this case is somewhat
> of a mystery.

Actually I believe it will use sys.stdout.encoding for this, which is 
presumably latin-1 on fingermark's machine.

Kent
-- 
http://mail.python.org/mailman/listinfo/python-list


Detect character encoding

2005-12-04 Thread Michal
Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in 
different charset (iso-8859-2, cp1250, etc). I want to detect it, and 
encode it to utf-8 (with string function encode).

Thank you for any answer
Regards
Michal
-- 
http://mail.python.org/mailman/listinfo/python-list


python encoding bug?

2005-12-30 Thread garabik-news-2005-05

I was playing with python encodings and noticed this:

[EMAIL PROTECTED]:~$ python2.4
Python 2.4 (#2, Dec  3 2004, 17:59:05)
[GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> unicode('\x9d', 'iso8859_1')
u'\x9d'
>>>

U+009D is NOT a valid unicode character (it is not even a iso8859_1
valid character)

The same happens if I use 'latin-1' instead of 'iso8859_1'.

This caught me by surprise, since I was doing some heuristics guessing 
string encodings, and 'iso8859_1' gave no errors even if the input
encoding was different.

Is this a known behaviour, or I discovered a terrible unknown bug in python 
encoding
implementation that should be immediately reported and fixed? :-)


happy new year,

-- 
 ---
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__garabik @ kassiopeia.juls.savba.sk |
 ---
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Encoding sniffer?

2006-01-05 Thread garabik-news-2005-05
Andreas Jung <[EMAIL PROTECTED]> wrote:
> [-- text/plain, encoding quoted-printable, charset: us-ascii, 6 lines --]
> 
> Does anyone know of a Python module that is able to sniff the encoding of 
> text? Please: I know that there is no reliable way to do this but I need 
> something that works for most of the case...so please no discussion about 
> the sense of such a module and approach.
> 

depends on what exactly you need
one approach is pyenca

the other is:

def try_encoding(s, encodings):
    "try to guess the encoding of string s, testing encodings given in second 
parameter"

for enc in encodings:
try:
test = unicode(s, enc)
return enc
except UnicodeDecodeError:
pass

return None

print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']


depending on what language and encodings you expects the text to be in,
the first or second approach is better


-- 
 ---
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__garabik @ kassiopeia.juls.savba.sk |
 ---
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Encoding sniffer?

2006-01-05 Thread Diez B. Roggisch
> print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 
> 'macroman']

I've fallen into that trap before - it won't work after the iso8859_1. 
The reason is that an eight-bit encoding have all 256 code-points 
assigned (usually, there are exceptions but you have to be lucky to have 
a string that contains a value not assigned in one of them - which is 
highly unlikely)

AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that 
in your example.


Regards,

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding sniffer?

2006-01-05 Thread garabik-news-2005-05
Diez B. Roggisch <[EMAIL PROTECTED]> wrote:
>> print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 
>> 'macroman']
> 
> I've fallen into that trap before - it won't work after the iso8859_1. 
> The reason is that an eight-bit encoding have all 256 code-points 
> assigned (usually, there are exceptions but you have to be lucky to have 
> a string that contains a value not assigned in one of them - which is 
> highly unlikely)
> 
> AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that 
> in your example.

I pasted from a wrong file :-)
See my previous posting (a few days ago) - what I did was to implement
iso8859_1_ncc encoding (iso8859_1 without control codes) and
the line should have been 
try_encodings(text, ['ascii', 'utf-8', 'iso8859_1_ncc', 'cp1252', 'macroman']

where iso8859_1_ncc.py is the same as iso8859_1.py from python
distribution, with this line different:

decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32,256))


-- 
 ---
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__garabik @ kassiopeia.juls.savba.sk |
 ---
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Encoding sniffer?

2006-01-05 Thread Diez B. Roggisch
[EMAIL PROTECTED] schrieb:
> Diez B. Roggisch <[EMAIL PROTECTED]> wrote:
> 
>>>print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 
>>>'macroman']
>>
>>I've fallen into that trap before - it won't work after the iso8859_1. 
>>The reason is that an eight-bit encoding have all 256 code-points 
>>assigned (usually, there are exceptions but you have to be lucky to have 
>>a string that contains a value not assigned in one of them - which is 
>>highly unlikely)
>>
>>AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that 
>>in your example.
> 
> 
> I pasted from a wrong file :-)
> See my previous posting (a few days ago) - what I did was to implement
> iso8859_1_ncc encoding (iso8859_1 without control codes) and
> the line should have been 
> try_encodings(text, ['ascii', 'utf-8', 'iso8859_1_ncc', 'cp1252', 'macroman']
> 
> where iso8859_1_ncc.py is the same as iso8859_1.py from python
> distribution, with this line different:
> 
> decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32,256))

Ok, I can see that. But still, there would be quite a few overlapping 
codepoints.

I think what the OP (and many more people) wants would be something that 
tries and guesses encodings based on probabilities for certain trigrams 
containing an umlaut for example.

There seems to be a tool called "konwert" out there that does such 
things, and recode has some guessing stuff too, AFAIK - but I  haven't 
seen any special python modules so far.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding sniffer?

2006-01-05 Thread skip

Andreas> Does anyone know of a Python module that is able to sniff the
Andreas> encoding of text?

I have such a beast.  Search here:

http://orca.mojam.com/~skip/python/

for "decode".

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding sniffer?

2006-01-06 Thread Andreas Jung

Thanks!

--On 5. Januar 2006 18:21:39 -0600 [EMAIL PROTECTED] wrote:


http://orca.mojam.com/~skip/python/






pgpyF17uM2CTT.pgp
Description: PGP signature
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Encoding sniffer?

2006-01-06 Thread Ralf Muschall
Diez B. Roggisch wrote:

> AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
> in your example.

IIRC the range 128-159 (i.e. control codes with the high bit set)
are unused.

Ralf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding sniffer?

2006-01-06 Thread Neil Hodgson
Ralf Muschall:

> Diez B. Roggisch wrote:
>>AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
>>in your example.
> IIRC the range 128-159 (i.e. control codes with the high bit set)
> are unused.

ISO 8859-1 and ISO-8859-1 (extra hyphen) differ in that ISO-8859-1 
includes the control codes in 128-159 (as well as the low control codes) 
as defined by ISO 6429. ISO 6429 is not freely available online but the 
equivalent ECMA standard ECMA 48 is:
http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf

Neil
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding sniffer?

2006-01-11 Thread Stuart Bishop
[EMAIL PROTECTED] wrote:
> Andreas> Does anyone know of a Python module that is able to sniff the
> Andreas> encoding of text?
> 
> I have such a beast.  Search here:
> 
> http://orca.mojam.com/~skip/python/
> 
> for "decode".
> 
> Skip

We have similar code. It looks functionally the same except that we also:

Check if the string starts with a BOM.
Detects probable ISO-8859-15 using a set of characters common
is ISO-8859-15 but uncommon in ISO-8859-1
Doctests :-)

# Detect BOM
_boms = [
(codecs.BOM_UTF16_BE, 'utf_16_be'),
(codecs.BOM_UTF16_LE, 'utf_16_le'),
(codecs.BOM_UTF32_BE, 'utf_32_be'),
    (codecs.BOM_UTF32_LE, 'utf_32_le'),
]

try:
for bom, encoding in _boms:
    if s.startswith(bom):
return unicode(s[len(bom):], encoding)
except UnicodeDecodeError:
pass

[...]

# If we have characters in this range, it is probably ISO-8859-15
if re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", s) is not None:
try:
return unicode(s, 'ISO-8859-15')
except UnicodeDecodeError:
pass

Feel free to update your available code. Otherwise, I can probably post ours
somewhere if necessary.

-- 
Stuart Bishop <[EMAIL PROTECTED]>
http://www.stuartbishop.net/


signature.asc
Description: OpenPGP digital signature
-- 
http://mail.python.org/mailman/listinfo/python-list

urllib2 chunked encoding

2006-01-26 Thread Sean Harper
I am trying to download some html using mechanize:br=Browser()r = br.open("url")b = r.read()print bI have used this code successfully many times and now I have run across an instance where it fails. It opens the url without error but then prints nothing.
Looking at the http headers when it works fine it has a "Content-Length: X"And when it does not work it has no Content-Length and it has Transfer-Encoding: chunkedIt could be that I am doing something else wrong (like maybe that site needs some different headers or something), but I had never seen this before so it made me suspicious.
Does urllib2 and mechanize deal with chunked encoding properly?  Is there something else that I need to do to make it work?Thanks,Sean
-- 
http://mail.python.org/mailman/listinfo/python-list

cElementTree encoding woes

2006-02-20 Thread Diez B. Roggisch
Hi,

I've got to deal with a pretty huge XML-document, and to do so I use the
cElementTree.iterparse functionality. Working great.

Only trouble: The guys creating that chunk of XML - well, lets just say they
are "encodingly challanged", so they don't produce utf-8, but only cp1252
instead, together with some weird name (Windows-1252) for that. That is not
part of the standard codecs module. cp1252 is, of course.

But that won't work for iterparse. So currently, I manually change the
encoding given to utf-8, and use a stream-recoder. 

However, I was wondering if I could teach cElementTree about that encoding
name. I tried to register cp1252 under the name Windows-1252, but had no
luck - cET won't buy it.

Any suggestions?

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2006-03-03 Thread Sebastjan Trepca
I think you are trying to concatenate a unicode string with regular
one so when it tries to convert the regular string to unicode with
ASCII(default one) encoding it fails. First find out which of these
strings is regular and how it was encoded, then you can decode it like
this(if regular string is diff):

mailbody +=diff.decode('')

Sebastjan

On 3/3/06, Yves Glodt <[EMAIL PROTECTED]> wrote:
> Hi list,
>
>
> Playing with the great pysvn I get this problem:
>
>
> Traceback (most recent call last):
>File "D:\avn\mail.py", line 80, in ?
>  mailbody += diff
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position
> 10710: ordinal not in range(128)
>
>
>
> It seems the pysvn.client.diff function returns "bytes" (as I read in
> the changelog of pysvn: http://svn.haxx.se/dev/archive-2005-10/0466.shtml)
>
> How can I convert this string so that I can contatenate it to my
> "regular" string?
>
>
> Best regards,
> Yves
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list


id3 encoding testing

2006-03-03 Thread ianaré
hey all,

anybody know where i can download some mp3's and/or other file formats
that use id3 tags in different encodings? working on an app that
retrieves id3 info, and all the mp3's i have come across so far use
iso-8859-1 encoding. i would like to test:

'Eastern Europe  (iso-8859-2)'
'Cyrillic  (iso-8859-5)'
'Arabic  (iso-8859-6)'
'Greek  (iso-8859-7)'
'Hebrew  (iso-8859-8)'
'Turkish  (iso-8859-9)'
'Nordic  (iso-8859-10)'
'Japanese  (iso2022_jp)'
'Korean  (iso2022_kr)'
'utf_8'
'utf_16'

thanks!

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2006-03-08 Thread Yves Glodt
Sebastjan Trepca wrote:
> I think you are trying to concatenate a unicode string with regular
> one so when it tries to convert the regular string to unicode with
> ASCII(default one) encoding it fails. First find out which of these
> strings is regular and how it was encoded, then you can decode it like
> this(if regular string is diff):
> 
> mailbody +=diff.decode('')

Thanks I'll look into that...

It seems in general I have trouble with special characters...
What is the python way to deal with éàè öäü etc...

print 'é' fails here,
print u'é' as well :-(

How am I supposed to print non-ascii characters the correct way?


best regards,
Yves

> Sebastjan
> 
> On 3/3/06, Yves Glodt <[EMAIL PROTECTED]> wrote:
>> Hi list,
>>
>>
>> Playing with the great pysvn I get this problem:
>>
>>
>> Traceback (most recent call last):
>>File "D:\avn\mail.py", line 80, in ?
>>  mailbody += diff
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position
>> 10710: ordinal not in range(128)
>>
>>
>>
>> It seems the pysvn.client.diff function returns "bytes" (as I read in
>> the changelog of pysvn: http://svn.haxx.se/dev/archive-2005-10/0466.shtml)
>>
>> How can I convert this string so that I can contatenate it to my
>> "regular" string?
>>
>>
>> Best regards,
>> Yves
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2006-03-09 Thread Martin v. Löwis
Yves Glodt wrote:
> It seems in general I have trouble with special characters...
> What is the python way to deal with éàè öäü etc...
> 
> print 'é' fails here,
> print u'é' as well :-(
> 
> How am I supposed to print non-ascii characters the correct way?

The second form should be used, but not in interactive mode.
In a Python script, make sure you properly declare the encoding
of your script, e.g.

# -*- coding: iso-8859-1 -*-
print u'é'

That should work. If not, give us your Python version, operating
system name, and mode of operation.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2006-03-09 Thread Scott David Daniels
Yves Glodt wrote:
> It seems in general I have trouble with special characters...
> What is the python way to deal with éàè öäü etc...
> 
> print 'é' fails here,
This should probably stay true.

> print u'é' as well :-(
This is an issue with how your output is connected.
What OS, what code page, what application?
I'm using Win2K, Python 2.4.2
Using Idle, I can do:
print u'élève'
And get what I expect
I can also do:
print repr(u'élève')
which gives me:
 u'\xe9l\xe8ve'
and:
 print u'\xe9l\xe8ve'
Also shows me élève.


With cmd.exe (the command line):
 c:\ python
 >>> print u'\xe9l\xe8ve'
shows me élève, but I can't type in:
 >>> print u'lve'
is what I get when I paste in the print u'élève' (beeps during paste).
What do you get if you put in:
 >>> print repr('élève')

--Scott David Daniels
[EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list


encoding of sys.argv ?

2006-10-23 Thread Jiba
Hi all,

I am desperately searching for the encoding of sys.argv.

I use a Linux box, with French UTF-8 locales and an UTF-8 filesystem. 
sys.getdefaultencoding() is "ascii" and sys.getfilesystemencoding() is "utf-8". 
However, sys.argv is neither in ASCII (since I can pass French accentuated 
character), nor in UTF-8. It seems to be encoded in "latin-1", but why ?

Jiba
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Character encoding

2006-11-07 Thread i80and
I would suggest using string.replace.  Simply replace ' ' with ' '
for each time it occurs.  It doesn't take too much code.

On Nov 7, 1:34 pm, "mp" <[EMAIL PROTECTED]> wrote:
> I have html document titles with characters like >,  , and
> ‡. How do I decode a string with these values in Python?
> 
> Thanks

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Character encoding

2006-11-07 Thread mp
I'd prefer a more generalized solution which takes care of all possible
ampersand characters. I assume that there is code already written which
does this.

Thanks

i80and wrote:
> I would suggest using string.replace.  Simply replace ' ' with ' '
> for each time it occurs.  It doesn't take too much code.
>
> On Nov 7, 1:34 pm, "mp" <[EMAIL PROTECTED]> wrote:
> > I have html document titles with characters like >,  , and
> > ‡. How do I decode a string with these values in Python?
> > 
> > Thanks

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Character encoding

2006-11-07 Thread Gabriel Genellina

At Tuesday 7/11/2006 17:10, mp wrote:


I'd prefer a more generalized solution which takes care of all possible
ampersand characters. I assume that there is code already written which
does this.


Try the htmlentitydefs module


--
Gabriel Genellina
Softlab SRL 


__
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ¡gratis! 
¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Character encoding

2006-11-08 Thread [EMAIL PROTECTED]

Dennis Lee Bieber wrote:
> On 7 Nov 2006 11:34:32 -0800, "mp" <[EMAIL PROTECTED]> declaimed the
> following in comp.lang.python:
>
> > I have html document titles with characters like >,  , and
> > ‡. How do I sddecode a string with these values in Python?
> >
>
>   Wouldn't HTMLParser be suited for such activity?
> --
>   WulfraedDennis Lee Bieber   KD6MOG
>   [EMAIL PROTECTED]   [EMAIL PROTECTED]
>   HTTP://wlfraed.home.netcom.com/
>   (Bestiaria Support Staff:   [EMAIL PROTECTED])
>   HTTP://www.bestiaria.com/

Use htmlentitydefs and SGMLParser to re-generate it .

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Character encoding

2006-11-08 Thread Frederic Rentsch
mp wrote:
> I have html document titles with characters like >,  , and
> ‡. How do I decode a string with these values in Python?
>
> Thanks
>
>   
This is definitely the most FAQ. It comes up about once a week.

The stream-editing way is like this:

 >>> import SE
 >>> HTM_Decoder = SE.SE ('htm2iso.se') # Include path

>>> test_string = '''I have html document titles with characters like >, 
>>>  , and
‡. How do I decode a string with these values in Python?'''
>>> print HTM_Decoder (test_string)
I have html document titles with characters like >,  , and
‡. How do I decode a string with these values in Python?

An SE object does files too.

>>> HTM_Decoder ('with_codes.txt', 'translated_codes.txt')  # Include path

You could download SE from -> http://cheeseshop.python.org/pypi/SE/2.3. The 
translation definitions file "htm2iso.se" is included. If you open it in your 
editor, you can see how to write your own definition files for other 
translation tasks you may have some other time.

Regards

Frederic



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding characters

2007-03-21 Thread Thinker
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Valentina Marotto wrote:
>
>
> The problem is about the insertion data into database.
>
> one of errors is:
>
> exceptions.UnicodeEncodeError: 'ascii' codec can't encode
> characters in position 25-26: ordinal not in range(128)
>
>  help me! Thanks
More information, please.
For example, what DBMS are used. Snippet run into error!

- --
Thinker Li - [EMAIL PROTECTED] [EMAIL PROTECTED]
http://heaven.branda.to/~thinker/GinGin_CGI.py
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGATYX1LDUVnWfY8gRAtkgAKC9VV4GrraeL9+f3WcHuoUIvZQsLwCg4vPq
qTMx1Rbr5vUGGwmO5hGn/hU=
=zWkT
-END PGP SIGNATURE-

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding confusions

2007-03-29 Thread Marc 'BlackJack' Rintsch
In <[EMAIL PROTECTED]>, Tim Arnold wrote:

> I have the contents of a file that contains French documentation.
> I've iterated over it and now I want to write it out to a file.
> 
> I'm running into problems and I don't understand why--I don't get how the 
> encoding works.
> My first attempt was just this:
> < snipped code for classes, etc; fname is string, codecs module loaded.>
> < self.contents is the French file's contents as a single string >

What is the type of `self.contents`, `str` or `unicode`?  You *decode*
strings to unicode objects and you *encode* unicode objects to strings. 
It doesn't make sense to encode a string in 'latin-1' because it must be
decoded first and the "automatic" decoding assumes ASCII and barfs if
there's something non-ascii in the string.

Ciao,
Marc 'BlackJack' Rintsch

-- 
http://mail.python.org/mailman/listinfo/python-list


Encoding / decoding strings

2007-01-05 Thread [EMAIL PROTECTED]
Hey Everyone,

Was just wondering if anyone here could help me. I want to encode (and
subsequently decode) email addresses to use in URLs. I believe that
this can be done using MD5.

I can find documentation for encoding the strings, but not decoding
them. What should I do to encode =and= decode strings with MD5?

Many Thanks in Advance,
Oliver Beattie

-- 
http://mail.python.org/mailman/listinfo/python-list


de/-encoding problem?

2006-09-07 Thread seppl43
Hello,

I'm writing a SOAP-client using SOAPpy for a JAVA-Application (JIRA).
When I try to send attachments, the incoming files are not decoded
correctly. They have about 1.5% more bytes.

What I'm doing is this:

file_obj = file(,'rb')
cont = file_obj.read()
cont64 = base64.encodestring(cont)
chararray1 = array.array('c')
chararray1.fromstring(cont64)
file_obj.close()

# using the method of jira's SOAP-interface
add =
jira_soap.addAttachmentsToIssue(auth,issue_id,[],[chararray1])


This is the WSDL-description of that method:


  
  
  
  
   

Does anybody know, why the files are not de-/encoded correctly?
Thanks for help.

Seppl

-- 
http://mail.python.org/mailman/listinfo/python-list


subprocess stdin encoding

2007-02-04 Thread yc
I have a encoding problem during using of subprocess. The input is a
string with UTF-8 encoding.

the code is:

tokenize =
subprocess.Popen(tok_command,stdin=subprocess.PIPE,stdout=subprocess.PIPE,close_fds=True,shell=True)

(tokenized_text,errs) = tokenize.communicate(t)

the error is:
  File "/usr/local/python/lib/python2.5/subprocess.py", line 651, in
communicate
return self._communicate(input)
  File "/usr/local/python/lib/python2.5/subprocess.py", line 1115, in
_communicate
bytes_written = os.write(self.stdin.fileno(), input[:512])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in
position 204: ordinal not in range(128)


How I change the default encoding from "ascii" to "utf-8"?

Ying Chen

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python Huffman encoding

2004-11-30 Thread Guyon Morée
Wow Paul!

thanks a lot for your comments! I learned a lot already only by reading
them, I will implement them and see what the speed gains are.

Keep an eye on my blog, I will post an update on it soon!


thanks,
guyon

ps. love this group


"Paul McGuire" <[EMAIL PROTECTED]> wrote in message

>
http://gumuz.looze.net/wordpress/index.php/archives/2004/11/25/huffman-encoding/

> Great first step at some self-learning with a non-trivial test program.
> Here are some performance speedups that will help you out.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: character encoding conversion

2004-12-12 Thread "Martin v. Löwis"
Dylan wrote:
Things I have tried include encode()/decode()
This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then
  htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
will give you a file that contains only ASCII characters, and
character references for everything else.
Now, how should you guess the encoding? Here is a strategy:
1. use the encoding that was sent through the HTTP header. Be
   absolutely certain to not ignore this encoding.
2. use the encoding in the XML declaration (if any).
3. use the encoding in the http-equiv meta element (if any)
4. use UTF-8
5. use Latin-1, and check that there are no characters in the
   range(128,160)
6. use cp1252
7. use Latin-1
In the order from 1 to 6, check whether you manage to decode
the input. Notice that in step 5, you will definitely get successful
decoding; consider this a failure if you have get any control
characters (from range(128, 160)); then try in step 7 latin-1
again.
When you find the first encoding that decodes correctly, encode
it with ascii and xmlcharrefreplace, and you won't need to worry
about the encoding, anymore.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: character encoding conversion

2004-12-12 Thread Christian Ergh
Martin v. Löwis wrote:
Dylan wrote:
Things I have tried include encode()/decode()

This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then
  htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
will give you a file that contains only ASCII characters, and
character references for everything else.
Now, how should you guess the encoding? Here is a strategy:
1. use the encoding that was sent through the HTTP header. Be
   absolutely certain to not ignore this encoding.
2. use the encoding in the XML declaration (if any).
3. use the encoding in the http-equiv meta element (if any)
4. use UTF-8
5. use Latin-1, and check that there are no characters in the
   range(128,160)
6. use cp1252
7. use Latin-1
In the order from 1 to 6, check whether you manage to decode
the input. Notice that in step 5, you will definitely get successful
decoding; consider this a failure if you have get any control
characters (from range(128, 160)); then try in step 7 latin-1
again.
When you find the first encoding that decodes correctly, encode
it with ascii and xmlcharrefreplace, and you won't need to worry
about the encoding, anymore.
Regards,
Martin
I have a similar problem, with characters like äöüAÖÜß and so on. I am 
extracting some content out of webpages, and they deliver whatever, 
sometimes not even giving any encoding information in the header. But 
your solution sounds quite good, i just do not know if
- it works with the characters i mentioned
- what encoding do you have in the end
- and how exactly are you doing all this? All with somestring.decode() 
or... Can you please give an example for these 7 steps?
Thanx in advance for the help
Chris
--
http://mail.python.org/mailman/listinfo/python-list


Re: character encoding conversion

2004-12-12 Thread "Martin v. Löwis"
Christian Ergh wrote:
- it works with the characters i mentioned
It does.
- what encoding do you have in the end
US-ASCII
- and how exactly are you doing all this? All with somestring.decode() 
or... Can you please give an example for these 7 steps?
I could, but I don't have the time - just try to come up with some
code, and I try to comment on it.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: character encoding conversion

2004-12-13 Thread Steven Bethard
Christian Ergh wrote:
flag = true
for char in data:
if 127 < ord(char) < 128:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass
A little OT, but (assuming I got your indentation right[1]) this kind of 
loop is exactly what the else clause of a for-loop is for:

for char in data:
if 127 < ord(char) < 128:
break
else:
try:
data = data.encode('latin-1')
except:
pass
Only saves you one line of code, but you don't have to keep track of a 
'flag' variable.  Generally, I find that when I want to set a 'flag' 
variable, I can usually do it with a for/else instead.

Steve
[1] Messed up indentation happens in a lot of clients if you have tabs 
in your code.  If you can replace tabs with spaces before posting, this 
usually solves the problem.
--
http://mail.python.org/mailman/listinfo/python-list



Re: character encoding conversion

2004-12-13 Thread Peter Otten
Steven Bethard wrote:

> Christian Ergh wrote:
>> flag = true
>> for char in data:
>> if 127 < ord(char) < 128:
>> flag = false
>> if flag:
>> try:
>> data = data.encode('latin-1')
>> except:
>> pass
> 
> A little OT, but (assuming I got your indentation right[1]) this kind of
> loop is exactly what the else clause of a for-loop is for:
> 
> for char in data:
>  if 127 < ord(char) < 128:
>  break
> else:
>  try:
>  data = data.encode('latin-1')
>  except:
>  pass
> 
> Only saves you one line of code, but you don't have to keep track of a
> 'flag' variable.  Generally, I find that when I want to set a 'flag'
> variable, I can usually do it with a for/else instead.
> 
> Steve
> 
> [1] Messed up indentation happens in a lot of clients if you have tabs
> in your code.  If you can replace tabs with spaces before posting, this
> usually solves the problem.

Even more off-topic:

>>> for char in data:
... if 127 < ord(char) < 128:
... break
...
>>> print char
127.5

:-)

Peter

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Martin v. Löwis wrote:
Dylan wrote:
Things I have tried include encode()/decode()

This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then
  htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
will give you a file that contains only ASCII characters, and
character references for everything else.
Now, how should you guess the encoding? Here is a strategy:
1. use the encoding that was sent through the HTTP header. Be
   absolutely certain to not ignore this encoding.
2. use the encoding in the XML declaration (if any).
3. use the encoding in the http-equiv meta element (if any)
4. use UTF-8
5. use Latin-1, and check that there are no characters in the
   range(128,160)
6. use cp1252
7. use Latin-1
In the order from 1 to 6, check whether you manage to decode
the input. Notice that in step 5, you will definitely get successful
decoding; consider this a failure if you have get any control
characters (from range(128, 160)); then try in step 7 latin-1
again.
When you find the first encoding that decodes correctly, encode
it with ascii and xmlcharrefreplace, and you won't need to worry
about the encoding, anymore.
Regards,
Martin
Something like this?
Chris
import urllib2
url = 'www.someurl.com'
f = urllib2.urlopen(url)
data = f.read()
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = ???
xmlencoding  = 'whatever i parsed out of the file'
htmlmetaencoding = 'whatever i parsed out of the metatag'
f.close()
try:
data = data.decode(pageencoding)
except:
try:
data = data.decode(xmlencoding)
except:
try:
data = data.decode(htmlmetaencoding)
except:
try:
data = data.encode('UTF-8')
except:
flag = true
for char in data:
if 127 < ord(char) < 128:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass
try:
data = data.encode('cp1252')
except:
pass
try:
data = data.encode('latin-1')
except:
pass:
data = data.encode("ascii", "xmlcharrefreplace")
--
http://mail.python.org/mailman/listinfo/python-list


Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Peter Otten wrote:
Steven Bethard wrote:

Christian Ergh wrote:
flag = true
for char in data:
   if 127 < ord(char) < 128:
   flag = false
if flag:
   try:
   data = data.encode('latin-1')
   except:
   pass
A little OT, but (assuming I got your indentation right[1]) this kind of
loop is exactly what the else clause of a for-loop is for:
for char in data:
if 127 < ord(char) < 128:
break
else:
try:
data = data.encode('latin-1')
except:
pass
Only saves you one line of code, but you don't have to keep track of a
'flag' variable.  Generally, I find that when I want to set a 'flag'
variable, I can usually do it with a for/else instead.
Steve
[1] Messed up indentation happens in a lot of clients if you have tabs
in your code.  If you can replace tabs with spaces before posting, this
usually solves the problem.

Even more off-topic:

for char in data:
... if 127 < ord(char) < 128:
... break
...
print char
127.5
:-)
Peter
Well yes, that happens when doing a quick hack and not reviewing it, 128 
has to be 160 of course...
--
http://mail.python.org/mailman/listinfo/python-list


Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Once more, indention should be correct now, and the 128 is gone too. So, 
something like this?
Chris

import urllib2
url = 'www.someurl.com'
f = urllib2.urlopen(url)
data = f.read()
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = '???'
xmlencoding  = 'whatever i parsed out of the file'
htmlmetaencoding = 'whatever i parsed out of the metatag'
f.close()
try:
 data = data.decode(pageencoding)
except:
try:
data = data.decode(xmlencoding)
except:
try:
data = data.decode(htmlmetaencoding)
except:
try:
data = data.encode('UTF-8')
except:
flag = true
for char in data:
if 127 < ord(char) < 160:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass
try:
data = data.encode('cp1252')
except:
pass
try:
data = data.encode('latin-1')
except:
pass
data = data.encode("ascii", "xmlcharrefreplace")
--
http://mail.python.org/mailman/listinfo/python-list


Re: character encoding conversion

2004-12-13 Thread Christian Ergh
- snip -
def get_encoded(st, encodings):
"Returns an encoding that doesn't fail"
    for encoding in encodings:
try:
st_encoded = st.decode(encoding)
return st_encoded, encoding
except UnicodeError:
pass
-snip-
This works fine, but after this you have three possible encodings (or 
even more, looking at the data in the net you'll see a lot of 
encodings...)- what we need is just one for all.
Chris
--
http://mail.python.org/mailman/listinfo/python-list


Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Dylan wrote:
Here's what I'm trying to do:
- scrape some html content from various sources
The issue I'm running to:
- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word
Finally: For me this works, all inside my own class, and the module has 
a logger, for reuse you would need to fix this stuff... Im am updating a 
postgreSQL Database, in case someone wonders about the __setattr__, and 
my class inherits from SQLObject.

def doDecode(self, st):
    "Returns an encoding that doesn't fail"
for encoding in encodings:
try:
stEncoded = st.decode(encoding)
return stEncoded
except UnicodeError:
pass
def setAttribute(self, name, data):
import HTMLFilter
data = self.doDecode(data)
try:
data = data.encode('ascii', "xmlcharrefreplace")
except:
log.warn('new method did not fit')
try:
if '&#' in data:
data = HTMLFilter.HTMLDecode(data)
except UnicodeDecodeError:
log.debug('HTML decoding failed!!!')
try:
data = data.encode('utf-8')
except:
log.warn('new utf 8 method did not fit')
try:
self.__setattr__(name, data)
except:
log.debug('1. try failed: ')
log.warning(type(data))
log.debug(data)
log.warning('Some unicode error while updating')
--
http://mail.python.org/mailman/listinfo/python-list


Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Forgot a part... You need the encoding list:
encodings = [
'utf-8',
'latin-1',
'ascii',
'cp1252',
]
Christian Ergh wrote:
Dylan wrote:
Here's what I'm trying to do:
- scrape some html content from various sources
The issue I'm running to:
- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word
Finally: For me this works, all inside my own class, and the module has 
a logger, for reuse you would need to fix this stuff... Im am updating a 
postgreSQL Database, in case someone wonders about the __setattr__, and 
my class inherits from SQLObject.

    def doDecode(self, st):
    "Returns an encoding that doesn't fail"
for encoding in encodings:
try:
stEncoded = st.decode(encoding)
return stEncoded
except UnicodeError:
pass
def setAttribute(self, name, data):
import HTMLFilter
data = self.doDecode(data)
try:
data = data.encode('ascii', "xmlcharrefreplace")
except:
log.warn('new method did not fit')
try:
if '&#' in data:
data = HTMLFilter.HTMLDecode(data)
except UnicodeDecodeError:
log.debug('HTML decoding failed!!!')
try:
data = data.encode('utf-8')
except:
log.warn('new utf 8 method did not fit')
try:
self.__setattr__(name, data)
except:
log.debug('1. try failed: ')
log.warning(type(data))
log.debug(data)
log.warning('Some unicode error while updating')
--
http://mail.python.org/mailman/listinfo/python-list


Re: character encoding conversion

2004-12-13 Thread "Martin v. Löwis"
Christian Ergh wrote:
Once more, indention should be correct now, and the 128 is gone too. So, 
something like this?
Yes, something like this. The tricky part is of, course, then the
fragments which you didn't implement.
Also, it might be possible to do this in a for loop, e.g.
for encoding in (pageencoding, xmlencoding, htmlmetaencoding,
 "UTF-8", "Latin-1-no-controls", "cp1252", "Latin-1"):
try:
   data = data.encode(encoding)
   break;
except UnicodeError:
   pass
You then just need to add the Latin-1-no-controls codec, or you need
to special-case this in the loop.
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = '???'
You need to remember the HTTP connection that you got the HTML file
from. The webserver may have sent a Content-Type header.
xmlencoding  = 'whatever i parsed out of the file'
htmlmetaencoding = 'whatever i parsed out of the metatag'
Depending on the library you use, these aren't that trivial, either.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: character encoding conversion

2004-12-13 Thread "Martin v. Löwis"
Max M wrote:
A smiple way to try out different encodings in a given order:
The loop is fine - although ('UTF-8', 'Latin-1', 'ASCII') is
somewhat redundant. The 'ASCII' case is never considered, since
Latin-1 effectively works as a catch-all encoding (as all byte
sequences can be considered Latin-1 - whether they are meaningful
data is a different question).
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: character encoding conversion

2004-12-13 Thread Max M
Christian Ergh wrote:
A smiple way to try out different encodings in a given order:
# -*- coding: latin-1 -*-
def get_encoded(st, encodings):
"Returns an encoding that doesn't fail"
    for encoding in encodings:
try:
st_encoded = st.decode(encoding)
return st_encoded, encoding
except UnicodeError:
pass
st = 'Test characters æøå ÆØÅ'
encodings = ['utf-8', 'latin-1', 'ascii', ]
print get_encoded(st, encodings)
(u'Test characters \xe6\xf8\xe5 \xc6\xd8\xc5', 'latin-1')
--
hilsen/regards Max M, Denmark
http://www.mxm.dk/
IT's Mad Science
--
http://mail.python.org/mailman/listinfo/python-list


unicode encoding usablilty problem

2005-02-18 Thread aurora
I have long find the Python default encoding of strict ASCII frustrating.  
For one thing I prefer to get garbage character than an exception. But the  
biggest issue is Unicode exception often pop up in unexpected places and  
only when a non-ASCII or unicode character first found its way into the  
system.

Below is an example. The program may runs fine at the beginning. But as  
soon as an unicode character u'b' is introduced, the program boom out  
unexpectedly.

sys.getdefaultencoding()
'ascii'
a='\xe5'
# can print, you think you're ok
... print a
å
b=u'b'
a==b
Traceback (most recent call last):
  File "", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0:  
ordinal not in range(128)


One may suggest the correct way to do it is to use decode, such as
  a.decode('latin-1') == b
This brings up another issue. Most references and books focus exclusive on  
entering unicode literal and using the encode/decode methods. The fallacy  
is that string is such a basic data type use throughout the program, you  
really don't want to make a individual decision everytime when you use  
string (and take a penalty for any negligence). The Java has a much more  
usable model with unicode used internally and encoding/decoding decision  
only need twice when dealing with input and output.

I am sure these errors are a nuisance to those who are half conscious to  
unicode. Even for those who choose to use unicode, it is almost impossible  
to ensure their program work correctly.
--
http://mail.python.org/mailman/listinfo/python-list


Source Encoding GBK/GB2312

2005-02-23 Thread steven
When I specify an source encoding such as:

# -*- coding: GBK -*-
or
# -*- coding: GB2312 -*-

as the first line of source, I got the following error:

SyntaxError: 'unknown encoding: GBK'


Does this mean Python does not support GBK/GB2312?  What do I do?

-
narke

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unknown encoding problem

2005-04-08 Thread Peter Otten
Uwe Mayer wrote:

> I need to read in a text file which seems to be stored in some unknown
> encoding. Opening and reading the files content returns:
> 
>>>> f.read()
> '\x00 \x00 \x00<\x00l\x00o\x00g\x00E\x00n\x00t\x00r\x00y\x00...
> 
> Each character has a \x00 prepended to it. I suspect its some kind of
> unicode - how do I get rid of it?

Intermittent '\x00' bytes are a indeed strong evidence for unicode. Use
codecs.open() to access the data in such a file:

>>> import codecs
>>> f = codecs.open(filename, "r", "UTF-16-BE")
>>> f.read()
u'  >> _.encode("latin1")
'  http://mail.python.org/mailman/listinfo/python-list


Re: unknown encoding problem

2005-04-08 Thread Leif K-Brooks
Uwe Mayer wrote:
Hi,
I need to read in a text file which seems to be stored in some unknown
encoding. Opening and reading the files content returns:

f.read()
'\x00 \x00 \x00<\x00l\x00o\x00g\x00E\x00n\x00t\x00r\x00y\x00...
Each character has a \x00 prepended to it. I suspect its some kind of
unicode - how do I get rid of it? 
f.read().decode('utf16')
--
http://mail.python.org/mailman/listinfo/python-list


Re: unknown encoding problem

2005-04-08 Thread John Machin
On Fri, 08 Apr 2005 15:45:35 +0200, Uwe Mayer <[EMAIL PROTECTED]>
wrote:

>Hi,
>
>I need to read in a text file which seems to be stored in some unknown
>encoding. Opening and reading the files content returns:
>
>>>> f.read()
>'\x00 \x00 \x00<\x00l\x00o\x00g\x00E\x00n\x00t\x00r\x00y\x00...
>
>Each character has a \x00 prepended to it. I suspect its some kind of
>unicode - how do I get rid of it? 
>

Interesting attitude. Why do you want to "get rid of it"? Have you
considered investigating the source of this suspicious text? You never
know, there could be something really interesting in there, like
'\x00v\x00o\x00n\x00 \x04\x1c\x04>\x04A\x04:\x042\x040\x00
\x00m\x00i\x00t\x00 \x00L\x00i\x00e\x00b' :-)

> str.replace('\x00', '')

Why not go the whole hog:

''.join([c for c in foreign_text if 32 <= ord(c) <= 126 or c in
'\t\r\n'])

Alternatively, try embracing Unicode -- it's the way forward, and it's
not that difficult.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode encoding problem

2005-04-28 Thread "Martin v. Löwis"
[EMAIL PROTECTED] wrote:
> So how do I tell what encoding my unicode string is in, and how do I
> retrieve that when I read it from a file?

In interactive mode, you best avoid non-ASCII characters in a Unicode
literal.

In theory, Python should look at sys.stdin.encoding when processing
the interactive source. In practice, various Python releases ignore
sys.stdin.encoding, and just assume it is Latin-1. What is
sys.stdin.encoding on your system?

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: utf encoding error

2005-05-04 Thread Mike Thompson
Timothy Smith wrote:
> hi there, this one is in relation to my py2exe saga.
> 
> when i compile a package using py2exe i get the error msg below, if i 
> just run the py files it doesn't error, so i assume pysvn is trying to 
> use something thats not being included in the build. only i have no idea 
> where to start looking.
> 
> Traceback (most recent call last):
>  File "Main.pyc", line 819, in ValidateLogin
>  File "Main.pyc", line 861, in ShowMainFrameItems
> LookupError: unknown encoding: utf-8
> 
> 
> and here is my setup.py for py2exe for good measure
> 
> from distutils.core import setup
> import py2exe
>   package_dir = ['c:\python23\reportlab', 'c:\python23\lib',
>   'Z:\Projects\PubWare\trunk\python']
>   setup(windows=[ 
> "Z:\\Projects\\PubWare\\trunk\\python\\PubWare.py"])


Py2exe has a Wiki which answers a lot of these questions. In general, 
you might find it useful to look there.

For your problem: you need to explicitly include the encodings module 
for utf8. Using static analysis Py2exe can't known to include it in the 
built exe. Something like this should work for you ...

# setup.py

from distutils.core import setup
import py2exe

explicitIncludes = [
 "encodings.utf_8",

]

opts = {
 "py2exe": {
 "includes": explicitIncludes,
#   "optimize" : 2
 }
}

setup(
 windows = ["XX.py"],
 options=opts,
)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode encoding problem

2005-05-11 Thread TZOTZIOY
On Thu, 28 Apr 2005 23:53:02 +0200, rumours say that "Martin v. Löwis"
<[EMAIL PROTECTED]> might have written:

>In theory, Python should look at sys.stdin.encoding when processing
>the interactive source. In practice, various Python releases ignore
>sys.stdin.encoding, and just assume it is Latin-1. What is
>sys.stdin.encoding on your system?

The difference between theory and practice is that in theory there is no
difference.
-- 
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
-- 
http://mail.python.org/mailman/listinfo/python-list


sqlite utf8 encoding error

2005-11-17 Thread Greg Miller
I have an application that uses sqlite3 to store job/error data.  When
I log in as a German user the error codes generated are translated into
German.  The error code text is then stored in the db.  When I use the
fetchall() to retrieve the data to generate a report I get the
following error:

Traceback (most recent call last):
  File "c:\Pest3\Glosser\baseApp\reportGen.py", line 199, in
OnGenerateButtonNow
self.OnGenerateButton(event)
  File "c:\Pest3\Glosser\baseApp\reportGen.py", line 243, in
OnGenerateButton
warningresult = messagecursor1.fetchall()
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 13-18:
unsupported Unicode code range

does anyone have any idea on what could be going wrong?  The string
that I store in the database table is:

'Keinen Text für Übereinstimmungsfehler gefunden'

I thought that all strings were stored in unicode in sqlite.

Greg Miller

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Detect character encoding

2005-12-04 Thread Scott David Daniels
Michal wrote:
> Hello,
> is there any way how to detect string encoding in Python?
> 
> I need to proccess several files. Each of them could be encoded in 
> different charset (iso-8859-2, cp1250, etc). I want to detect it, and 
> encode it to utf-8 (with string function encode).
> 
> Thank you for any answer
> Regards
> Michal
The two ways to detect a string's encoding are:
   (1) know the encoding ahead of time
   (2) guess correctly

This is the whole point of Unicode -- an encoding that works for _lots_
of languages.

--Scott David Daniels
[EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Detect character encoding

2005-12-04 Thread Diez B. Roggisch
Michal wrote:
> Hello,
> is there any way how to detect string encoding in Python?
> 
> I need to proccess several files. Each of them could be encoded in 
> different charset (iso-8859-2, cp1250, etc). I want to detect it, and 
> encode it to utf-8 (with string function encode).

You can only guess, by e.g. looking for words that contain e.g. umlauts. 
Recode might be of help here, it has such heuristics built in AFAIK.

But there is _no_ way to be absolutely sure. 8bit are 8bit, so each file 
is "legal" in all encodings.


Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Detect character encoding

2005-12-04 Thread Mike Meyer
"Diez B. Roggisch" <[EMAIL PROTECTED]> writes:
> Michal wrote:
>> is there any way how to detect string encoding in Python?
>> I need to proccess several files. Each of them could be encoded in
>> different charset (iso-8859-2, cp1250, etc). I want to detect it,
>> and encode it to utf-8 (with string function encode).
> But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
> file is "legal" in all encodings.

Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.

  http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Detect character encoding

2005-12-04 Thread Nemesis
Mentre io pensavo ad una intro simpatica "Michal" scriveva:

> Hello,
> is there any way how to detect string encoding in Python?
> I need to proccess several files. Each of them could be encoded in 
> different charset (iso-8859-2, cp1250, etc). I want to detect it, and 
> encode it to utf-8 (with string function encode).
> Thank you for any answer

Hi,
As you already heard you can't be sure but you can guess.

I use a method like this:

def guess_encoding(text):
for best_enc in guess_list:
try:
unicode(text,best_enc,"strict")
except:
pass
else:
break
return best_enc

'guess_list' is an ordered charset name list like this:

['us-ascii','iso-8859-1','iso-8859-2',...,'windows-1250','windows-1252'...]

of course you can remove charsets you are sure you'll never find.
-- 
Questa potrebbe davvero essere la scintilla che fa traboccare la
goccia.
 
 |\ |   |HomePage   : http://nem01.altervista.org
 | \|emesis |XPN (my nr): http://xpn.altervista.org

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Detect character encoding

2005-12-04 Thread B Mahoney
You may want to look at some Python Cookbook recipes, such as
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52257
"Auto-detect XML encoding"  by Paul Prescod

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Detect character encoding

2005-12-04 Thread Martin P. Hellwig
Mike Meyer wrote:
> "Diez B. Roggisch" <[EMAIL PROTECTED]> writes:
>> Michal wrote:
>>> is there any way how to detect string encoding in Python?
>>> I need to proccess several files. Each of them could be encoded in
>>> different charset (iso-8859-2, cp1250, etc). I want to detect it,
>>> and encode it to utf-8 (with string function encode).
>> But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
>> file is "legal" in all encodings.
> 
> Not quite. Some encodings don't use all the valid 8-bit characters, so
> if you encounter a character not in an encoding, you can eliminate it
> from the list of possible encodings. This doesn't really help much by
> itself, though.
> 
> http://mail.python.org/mailman/listinfo/python-list


Re: Detect character encoding

2005-12-04 Thread skip
Martin> I read or heard (can't remember the origin) that MS IE has a
Martin> quite good implementation of guessing the language en character
    Martin> encoding of web pages when there not or falsely specified.

Gee, that's nice.  Too bad the source isn't available... <0.5 wink>

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Detect character encoding

2005-12-04 Thread Diez B. Roggisch
Mike Meyer wrote:
> "Diez B. Roggisch" <[EMAIL PROTECTED]> writes:
> 
>>Michal wrote:
>>
>>>is there any way how to detect string encoding in Python?
>>>I need to proccess several files. Each of them could be encoded in
>>>different charset (iso-8859-2, cp1250, etc). I want to detect it,
>>>and encode it to utf-8 (with string function encode).
>>
>>But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
>>file is "legal" in all encodings.
> 
> 
> Not quite. Some encodings don't use all the valid 8-bit characters, so
> if you encounter a character not in an encoding, you can eliminate it
> from the list of possible encodings. This doesn't really help much by
> itself, though.


- test.py
for enc in ["cp1250", "latin1", "iso-8859-2"]:
 print enc
 try:
str.decode("".join([chr(i) for i in xrange(256)]), enc)
 except UnicodeDecodeError, e:
print e
-

192:~ deets$ python2.4 /tmp/test.py
cp1250
'charmap' codec can't decode byte 0x81 in position 129: character maps 
to 
latin1
iso-8859-2

So cp1250 doesn't have all codepoints defined - but the others have. 
Sure, this helps you to eliminate 1 of the three choices the OP wanted 
to choose between - but how many texts you have that have a 129 in them?

Regards,

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Detect character encoding

2005-12-04 Thread François Pinard
[Diez B. Roggisch]
>Michal wrote:

>> is there any way how to detect string encoding in Python?

>Recode might be of help here, it has such heuristics built in AFAIK.

If we are speaking about the same Recode ☺, there are some built in 
tools that could help a human to discover a charset, but this requires 
work and time, and is far from fully automated as one might dream.  
While some charsets could be guessed almost correctly by automatic 
means, most are difficult to recognise.  The whole problem is not easy.

-- 
François Pinard   http://pinard.progiciels-bpi.ca
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Detect character encoding

2005-12-04 Thread Martin v. Löwis
Martin P. Hellwig wrote:
>  From what I can remember is that they used an algorithm to create some 
> statistics of the specific page and compared that with statistic about 
> all kinds of languages and encodings and just mapped the most likely.

More hearsay: I believe language-based heuristics are common. You first
guess an encoding based on the bytes you see, then guess a language of 
the page. If you then get a lot of characters that should not appear
in texts of the language (e.g. a lot of umlaut characters in a French
page), you know your guess was wrong, and you try a different language
for that encoding. If you run out of languages, you guess a different
encoding.

Mozilla can guess the encoding if you tell it what the language is,
which sounds like a similar approach.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Detect character encoding

2005-12-04 Thread Martin v. Löwis
Diez B. Roggisch wrote:
> So cp1250 doesn't have all codepoints defined - but the others have. 
> Sure, this helps you to eliminate 1 of the three choices the OP wanted 
> to choose between - but how many texts you have that have a 129 in them?

For the iso8859 ones, you should assume that the characters in
range(128, 160) really aren't used. If you get one of these, and it is
not utf-8, it is a Windows code page.

UTF-8 can be recognized pretty reliable: even though it allows all bytes
to appear, it is very constraint in what sequences of bytes it allows.
E.g. you can't have a single byte >127 in UTF-8; you need atleast two
of them subsequent, and they need to meet more constraints.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Detect character encoding

2005-12-05 Thread Michal
Thanks everybody for helpfull advices.

Michal
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Detect character encoding

2005-12-05 Thread Kent Johnson
Martin P. Hellwig wrote:
> I read or heard (can't remember the origin) that MS IE has a quite good 
> implementation of guessing the language en character encoding of web 
> pages when there not or falsely specified.

Yes, I think that's right. In my experience MS Word does a very good job 
of guessing the encoding of text files.

Kent
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Detect character encoding

2005-12-05 Thread jepler
Perhaps this project's code or ideas could be of service:
http://freshmeat.net/projects/enca/

Jeff


pgpYyDfS0xrTp.pgp
Description: PGP signature
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Detect character encoding

2005-12-05 Thread The new guy
Michal wrote:

> Hello,
> is there any way how to detect string encoding in Python?
> 
> I need to proccess several files. Each of them could be encoded in
> different charset (iso-8859-2, cp1250, etc). I want to detect it, and
> encode it to utf-8 (with string function encode).

Well, about how to detect it in Python, I can't help. My first guess,
though, would be to have a look at the source code of the "file" utility.
This is an example of what it does:

# ls
de.i18n  en.i18n
# file *
de.i18n: ISO-8859 text, with very long lines
en.i18n: ISO-8859 English text, with very long lines

cheers
-- 
http://mail.python.org/mailman/listinfo/python-list


Encoding of file names

2005-12-08 Thread utabintarbo
Here is my situation:

I am trying to programatically access files created on an IBM AIX
system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
on a Win32 system. Not confused? OK, let's move on... ;-)

When I ask for an os.listdir() of a relevant directory, I get filenames
with embedded escaped characters (ex.
'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')
which will read as "False" when applying an os.path.isfile() to it. I
wish to apply some  operations to these files, but am unable, since
python (on Win32, at least) does not recognize this as a valid
filename.

Help me, before my thin veneer of genius is torn from my boss's eyes!
;-)

-- 
http://mail.python.org/mailman/listinfo/python-list


valide html - Encoding/Decoding

2005-12-19 Thread rabby
hello world!
how to get the string "/ + ( ) + \ + [ ] + : + äöü" converted into
valide html.
"...".decode("html") does not run :(
thank you for help

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python encoding bug?

2005-12-31 Thread Vincent Wehren
<[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]
|
| I was playing with python encodings and noticed this:
|
| [EMAIL PROTECTED]:~$ python2.4
| Python 2.4 (#2, Dec  3 2004, 17:59:05)
| [GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
| Type "help", "copyright", "credits" or "license" for more information.
| >>> unicode('\x9d', 'iso8859_1')
| u'\x9d'
| >>>
|
| U+009D is NOT a valid unicode character (it is not even a iso8859_1
| valid character)

That statement is not entirely true. If you check the current 
UnicodeData.txt (on http://www.unicode.org/Public/UNIDATA/)  you'll find:

009D;;Cc;0;BN;N;OPERATING SYSTEM COMMAND

Regards,

Vincent Wehren

|
| The same happens if I use 'latin-1' instead of 'iso8859_1'.
|
| This caught me by surprise, since I was doing some heuristics guessing
| string encodings, and 'iso8859_1' gave no errors even if the input
| encoding was different.
|
| Is this a known behaviour, or I discovered a terrible unknown bug in 
python encoding
| implementation that should be immediately reported and fixed? :-)
|
|
| happy new year,
|
| -- 
| ---
|| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
|| __..--^^^--..__garabik @ kassiopeia.juls.savba.sk |
| ---
| Antivirus alert: file .signature infected by signature virus.
| Hi! I'm a signature virus! Copy me into your signature file to help me 
spread! 


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python encoding bug?

2005-12-31 Thread Benjamin Niemann
[EMAIL PROTECTED] wrote:

> 
> I was playing with python encodings and noticed this:
> 
> [EMAIL PROTECTED]:~$ python2.4
> Python 2.4 (#2, Dec  3 2004, 17:59:05)
> [GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> unicode('\x9d', 'iso8859_1')
> u'\x9d'
>>>>
> 
> U+009D is NOT a valid unicode character (it is not even a iso8859_1
> valid character)

It *IS* a valid unicode and iso8859-1 character, so the behaviour of the
python decoder is correct. The range U+0080 - U+009F is used for various
control characters. There's rarely a valid use for these characters in
documents, so you can be pretty sure that a document using these characters
is windows-1252 - it is valid iso-8859-1, but for a heuristic guess it's
probably saver to assume windows-1252.

If you want an exception to be thrown, you'll need to implement your own
codec, something like 'iso8859_1_nocc' - mmm.. I could try this myself,
because I do such a test in one of my projects, too ;)

> The same happens if I use 'latin-1' instead of 'iso8859_1'.
> 
> This caught me by surprise, since I was doing some heuristics guessing
> string encodings, and 'iso8859_1' gave no errors even if the input
> encoding was different.
> 
> Is this a known behaviour, or I discovered a terrible unknown bug in
> python encoding implementation that should be immediately reported and
> fixed? :-)
> 
> 
> happy new year,
> 

-- 
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/
-- 
http://mail.python.org/mailman/listinfo/python-list


encoding during elementtree serialization

2006-02-08 Thread Chris McDonough
ElementTree's XML serialization routine implied by tree._write(file, 
node, encoding, namespaces looks like this (elided):

 def _write(self, file, node, encoding, namespaces):
 # write XML to file
 tag = node.tag
 if tag is Comment:
 file.write("" % _escape_cdata(node.text, encoding))
 elif tag is ProcessingInstruction:
 file.write("" % _escape_cdata(node.text, encoding))
 else:
...
 file.write("<" + _encode(tag, encoding))
 if items or xmlns_items:
 items.sort() # lexical order

Note that "_escape_cdata" (which also performs encoding) and "_encode" 
are called for pcdata (and attribute values) only, but not for the tag 
literals like "<" and "".

In some profiling I've done, I believe encoding during recursion makes 
serialization slightly slower than it could be if we could get away with 
not encoding any pcdata or attribute values during recursion.

Instead, we might be able to get away with encoding everything just once 
at the end.  But I don't know if this is kosher.  Is there any reason to 
not also encode tag literals and quotation marks that are attribute 
containers, just once, at the end of serialization?

Even if that's not acceptable in general because tag literals cannot be 
encoded, would it be acceptable for "ascii-compatible" encodings like 
utf-8, latin-1, and friends?

Something like:

def _escape_cdata(text, encoding=None, replace=string.replace):
 # doesn't do any encoding
 text = replace(text, "&", "&")
 text = replace(text, "<", "<")
 text = replace(text, ">", ">")
 return text

class _ElementInterface:

 ...

 def write(self, file, encoding="us-ascii"):
 assert self._root is not None
 if not hasattr(file, "write"):
 file = open(file, "wb")
     if not encoding:
 encoding = "us-ascii"
 elif encoding != "utf-8" and encoding != "us-ascii":
 file.write("\n" % encoding)
 tmp = StringIO()
 self._write(tmp, self._root, encoding, {})
 file.write(tmp.getvalue().encode(encoding))


 def _write(self, file, node, encoding, namespaces):
 # write XML to file
 tag = node.tag
 if tag is Comment:
 file.write("" % _escape_cdata(node.text, encoding))
 elif tag is ProcessingInstruction:
 file.write("" % _escape_cdata(node.text, encoding))
 else:
 items = node.items()
 xmlns_items = [] # new namespaces in this scope
 try:
 if isinstance(tag, QName) or tag[:1] == "{":
 tag, xmlns = fixtag(tag, namespaces)
 if xmlns: xmlns_items.append(xmlns)
 except TypeError:
 _raise_serialization_error(tag)
 file.write("<" + tag)


I smell the mention of a Byte Order Mark coming on. ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


  1   2   3   4   5   >