date:20130518

Re: [Tutor] Retrieving data from a web site

2013-05-18 Thread Phil

My apatite having been whetted I'm now stymied because of a Ubuntu 
dependency problem during the installation of urllib3. This is listed as 
a bug. Has anyone overcome this problem?


Perhaps there's another library that I can use to download data from a 
web page?


--
Regards,
Phil
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] model methods in Django

2013-05-18 Thread eryksun

On Sat, May 18, 2013 at 10:22 PM, Dave Angel  wrote:
> The pub_date is probably an instance attribute of either the Poll class or
> the models.Model class.  It should probably be defined in the appropriate
> __init__ method.  In any case it's not a method attribute.

Django uses function attributes as metadata. The names "boolean" and
"short_description" are self-documenting. "admin_order_field" is
explained here:

https://docs.djangoproject.com/en/dev/ref/contrib/admin/
#django.contrib.admin.ModelAdmin.list_display

Usually, elements of list_display that aren’t actual database
fields can’t be used in sorting (because Django does all the
sorting at the database level). However, if an element of
list_display represents a certain database field, you can
indicate this fact by setting the admin_order_field attribute
of the item.

The Poll model is part of the tutorial, "Writing your first Django app":

https://docs.djangoproject.com/en/1.5/intro

The function attributes are added in "Customize the admin change
list", in part 2.

> Perhaps you didn't realize that a function can have attributes, and that
> they can be added to the function at any time after the function is created.
> Being a method doesn't change that.

In a class definition, from a conceptual point of view, you're adding
a 'method'. But technically it's a function object. When accessed as
an attribute, the function's __get__ descriptor is used to create a
method on the fly.

The instancemethod type has a custom __getattribute__ that first
checks the method object's attributes such as __self__. If the lookup
on the method object fails, it proxies the __getattribute__ of the
wrapped __func__. For example:

class Spam(object):
def __repr__(self):
return 'eggs'

>>> meth = Spam().__repr__
>>> type(meth)

>>> meth.__self__
eggs

A method doesn't have a __dict__ for setting dynamic attributes:

>>> meth.boolean = False
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'instancemethod' object has no attribute 'boolean'

But you can set attributes on the underlying function object:

>>> type(meth.__func__)

>>> meth.__func__.boolean = False

The method will proxy them:

>>> meth.boolean
False

But not for assignment:

>>> meth.boolean  = True
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'instancemethod' object has no attribute 'boolean'
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] creating import directories

2013-05-18 Thread Steven D'Aprano


On 19/05/13 14:08, Jim Mooney wrote:

I'm a bit unclear about creating import directories.


The terminology you want here is "import packages". You cannot import 
directories. You can only import modules and packages.

"Module" in Python has two meanings:

1) a module is a file containing Python code, such as:

- source code in a .py file

- pre-compiled byte code in a .pyc file

- C libraries in a .dll or .so file

- and various others.

2) a module is an object that exists in memory, created by Python when you 
import a module as defined above, *or* a package as defined below.


A package, on the other hand, is a way of collecting multiple modules 
(definition 1 above) into a single directory, so that it becomes a 
self-contained group of files. The way to do this is by setting up the 
directory in a way that Python understands as a package:


1) The directory must be somewhere that Python will see it, no different from a 
single file module. That means, in the current directory, or your PYTHONPATH.

2) The directory name must be legal as a module name. That means, it must 
follow the same rules as names for modules, except without the .py extension.

"mymodule" is good
"my module" (note space) is bad

3) Inside the module, you MUST put in a special file called "__init__.py" for 
Python to recognise it as a package. This is critical. This file can be empty, or it can 
contain Python code, but it must exist.

4) Any other modules inside the package follow the usual naming rules.

5) Last, and optional, if you want to run the package as if it were a script, you give it 
a file called "__main__.py" containing the code to run.

So, if I create the following directory structure where Python can see it:


mypackage/
  +-- __init__.py
  +-- spam.py
  +-- eggs.py
  +-- math.py


then I can do this:

import mypackage

which will read the file mypackage/__init__.py and create a module object (definition 2 
above) called "mypackage".


import mypackage.spam

will *first* read mypackage/__init__.py (if it hasn't already read it), and 
*then* read mypackage/spam.py. Once this has done, you can use:

result = mypackage.spam.function()

to call the function inside the spam module.

Similarly for mypackage.eggs and mypackage.math, which is especially 
interesting because it means you don't have to worry about your package's 
math.py module shadowing (hiding) the standard math module that Python uses.

You can learn more about packages from this historical document:

http://www.python.org/doc/essays/packages.html

Even though this goes back to the Dark Ages of Python 1.3 (!) most of the 
details of packages have not changed.


One more trick: starting from Python 2.7, or optionally in 2.5 and 2.6, there 
is a special form of the import statement that tells Python to only look inside 
the *current* package. If I write

import math

inside (say) mypackage.spam, I will get the standard math module. But if I 
write:

from . import math

I will get the custom math module inside the package. I can even get them both 
at the same time, with a bit of renaming:

import math
from . import math as mymaths

Google on "absolute and relative imports" for more information on that.




Also, I noticed Wing 101 is sometimes creating a same-named pyc
program alongside my py program, but I don't see an option to do that.



This is normal behaviour when you import a module.

When you import a module, Python performs two versions of caching:

import spam


will first look in sys.modules for an entry called "spam", and if found, it will use 
that. If there is no such entry, Python looks for a .py file called "spam". If it finds 
one, it compiles it to byte code. To speed up that process for the next time, it caches that byte 
code in the a file called spam.pyc.

Of course Python also checks the timestamps on the source file, and won't use 
old byte code when the source code has changed.

This only occurs when you import a module. Just running the module will not 
cache the byte code.



--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

[Tutor] creating import directories

2013-05-18 Thread Jim Mooney

I'm a bit unclear about creating import directories.

I can import something like the json directory from Lib with a simple
 import json

So I tried putting my own directory in Lib, called bagofries, as a
test, and put a simple printstuff.py program in it. But when I try to
import bagofries, I get "no module named bagofries"

Is there some reason I can put a program right under Lib and it
imports, which it does but not a directory?

Also, I noticed Wing 101 is sometimes creating a same-named pyc
program alongside my py program, but I don't see an option to do that.

-- 
Jim Mooney

Today is the day that would have been tomorrow if yesterday was today.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

[Tutor] WAMP stack

2013-05-18 Thread Jim Mooney

I noticed someone mentioned WAMP stacks but didn't see the answer. I
use a WAMP stack a lot. It has PHP. But I used it mainly with PHP CMS.
How do I install Python into my WAMP stack, and does that make any
sense for just raw Python, or only if I want to load Django or
something web-oriented and test it? (Assume I know nothing about
Django other than it's a Py CMS, since I don't ;')

Jim Mooney

"Since True * True * True == True, what I tell you three times is
true." --The Hunting of the Snark
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] model methods in Django

2013-05-18 Thread Dave Angel


On 05/18/2013 03:16 PM, Matthew Ngaha wrote:

  im following the official docs and after learning Python im sure of
how methods work, but the model example on the beginners guide


which official docs?  URLs please?
which beginners guide?  URL please?


has me
really confused.



I don't know Django, so if this is really Django specific, I can't help.


The model definition is omitted but can anyone explain how this methed
(was_published_recently) is given these attributes:


Which attributes are you confused about?
The admin_order_field, boolean, ande short_description attributes of the 
method are bound in class code, immediately after the method is defined.


The pub_date is probably an instance attribute of either the Poll class 
or the models.Model class.  It should probably be defined in the 
appropriate __init__ method.  In any case it's not a method attribute.




class Poll(models.Model):
 # ...
 def was_published_recently(self):
 return self.pub_date >= timezone.now() - datetime.timedelta(days=1)
 was_published_recently.admin_order_field = 'pub_date'
 was_published_recently.boolean = True
 was_published_recently.short_description = 'Published recently?'

are the names of the attributes already attached to these
functions/methods, or are they being created on the fly with whatever
name you want? As i am unable to comprehend what is going on, i dont
really have a clue as to what each definition is doing and how it
affects the model, even after reading this section of the docs over
and over again im still lost.



This fragment isn't big enough for much else to be told.  But I don't 
really understand what aspect is confusing you.


Perhaps you didn't realize that a function can have attributes, and that 
they can be added to the function at any time after the function is 
created. Being a method doesn't change that.



--
DaveA
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] why is unichr(sys.maxunicode) blank?

2013-05-18 Thread Steven D'Aprano


On 19/05/13 02:45, Albert-Jan Roskam wrote about locales:


It is pretty sick that all these things can be adjusted separately (what is the 
use of having: danish collation, russian case conversion, english decimal sign, 
japanese codepage ;-)


Well obviously there is no point to such a mess, but the ability to make a mess 
comes from having the flexibility to have less silly combinations.

By the way, I'm not sure what you mean by "pretty sick", since in Australian slang "sick" can mean 
"fantastic, excellent", as in "Mate, that's a pretty sick sub-woofer!".

See http://www.youtube.com/watch?v=iRv7IE6T4gQ

(warning: ethnic stereotypes, low-brow humour)



[...]

  Isn't UCS-2 the internal unicode encoding for CPython (narrow builds)?


Narrow builds create UTF-16 surrogate pairs from \U literals, but
these aren't treated as an atomic unit for slicing, iteration, or
string length.


That is a nice way of putting it. So if you slice a multibyte char "mb", mb[0] 
will return the first byte? That is annoying.


Correct. You can easily break apart surrogate pairs in Python narrow builds, 
which leads to invalid strings. The solution is to either use a wide build, or 
upgrade to Python 3.3 which no longer has this problem:


# Python 3.2, narrow build:
py> len(chr(0x101001))
2

# Python 3.3
py> len(chr(0x101001))
1


--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Retrieving data from a web site

2013-05-18 Thread Phil


On 18/05/13 22:44, Peter Otten wrote:

You can use a tool like lxml that "understands" html (though in this case
you'd need a javascript parser on top of that) -- or hack something together
with string methods or regular expressions. For example:

import urllib2
import json

s = urllib2.urlopen("http://*/goldencasket";).read()
s = s.partition("latestResults_productResults")[2].lstrip(" =")
s = s.partition(";")[0]
data = json.loads(s)
lotto = data["GoldLottoSaturday"]

print lotto["drawDayDateNumber"]
print map(int, lotto["primaryNumbers"])
print map(int, lotto["secondaryNumbers"])

While this is brittle I've found that doing it "right" is usually not
worthwhile as it won't survive the next website redesign eighter.

PS: 
has links to zipped csv files with the results. Downloading, inflating and
reading these should be the simplest and best way to get your data.


Thanks again Peter and Walter,

The results download link points to a historical file of past results 
although the latest results are included at the bottom of the file. The 
file is quite large and it's zipped so I imagine unzipping would another 
problem. I've come across Beautiful Soup and it may also offer a simple 
solution.


Thanks for your response Walter, I'd like to download the Australian 
Lotto results and there isn't a simple way, as far as I can see, to do 
this. I'll read up on curl, maybe I can use it.


I'll experiment with the Peter's code and Beautiful Soup and see what I 
can come up with. Maybe unzipping the file could be the best solution, 
I'll experiment with that option as well.


--
Regards,
Phil
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Python web script to run a command line expression

2013-05-18 Thread William Ranvaud



I'm not sure if this is what you are looking for or if this will work on 
WAMP but python has a virtual terminal emulator called Vte or 
python-vte. I use it to display the terminal and run commands.

I'm using it on Linux by adding "from gi.repository import Vte".
Hope it helps.



On 18-05-2013 04:20, Ahmet Anil Dindar wrote:


Hi,
I have a WAMP running in my office computer. I wonder how I can 
implement a python script that runs within WAMP and execute a command 
line expression. By this way, I will able to run my command line 
expressions through web page in intranet.


I appreciate your suggestions.

++Ahmet



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

[Tutor] model methods in Django

2013-05-18 Thread Matthew Ngaha

 im following the official docs and after learning Python im sure of
how methods work, but the model example on the beginners guide has me
really confused.

The model definition is omitted but can anyone explain how this methed
(was_published_recently) is given these attributes:

class Poll(models.Model):
# ...
def was_published_recently(self):
return self.pub_date >= timezone.now() - datetime.timedelta(days=1)
was_published_recently.admin_order_field = 'pub_date'
was_published_recently.boolean = True
was_published_recently.short_description = 'Published recently?'

are the names of the attributes already attached to these
functions/methods, or are they being created on the fly with whatever
name you want? As i am unable to comprehend what is going on, i dont
really have a clue as to what each definition is doing and how it
affects the model, even after reading this section of the docs over
and over again im still lost.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] why is unichr(sys.maxunicode) blank?

2013-05-18 Thread eryksun

On Sat, May 18, 2013 at 12:45 PM, Albert-Jan Roskam  wrote:
>
> It seems that the result of str.isalpha() and str.isdigit() *might* be 
> different depending
> on the setting of locale.C_CTYPE.

Yes, str() in 2.x uses the locale predicates from :

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/ctype.h.html

However, 2.x bytearray uses the bytes_methods from 3.x, which use pyctype:

2.7.5 source:
http://hg.python.org/cpython/file/ab05e7dd2788/Include/pyctype.h
http://hg.python.org/cpython/file/ab05e7dd2788/Python/pyctype.c
http://hg.python.org/cpython/file/ab05e7dd2788/Include/bytes_methods.h
http://hg.python.org/cpython/file/ab05e7dd2788/Objects/stringlib/ctype.h

Note that the table in pyctype.c is only defined for ASCII.

> It is pretty sick that all these things can be adjusted separately (what is 
> the use of having:
> danish collation, russian case conversion, english decimal sign, japanese 
> codepage ;-)

Here's a non-sick example. A system in the US might customize
LC_MEASUREMENT to use SI units and LC_TIME to have Monday as the first
day of the week.

> That one is the clearest IMHO. Oh no, now I see the possible impact on 
> regexes. The
> meaning of e.g. "\s+" might change depending on the locale.C_CTYPE setting!!

The re module has the re.L flag to enable limited locale support. It
only affects the alphanumeric category and word boundaries. You're
probably better off using re.U and the Unicode database.

>> Narrow builds create UTF-16 surrogate pairs from \U literals, but
>> these aren't treated as an atomic unit for slicing, iteration, or
>> string length.
>
> That is a nice way of putting it. So if you slice a multibyte char "mb", 
> mb[0] will return the
> first byte? That is annoying.

It's 2 bytes, not one. If you use a non-BMP \U escape on a narrow
build it creates a surrogate pair.  Each surrogate has a 10-bit range
in a 2-byte code. The lead surrogate is in the range 0xD800-0xDBFF,
and the trail is in the range 0xDC00-0xDFFF.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Retrieving data from a web site

2013-05-18 Thread Walter Prins

Hi

Just a minor observation:

On 18 May 2013 13:44, Peter Otten <__pete...@web.de> wrote:

> Phil wrote:
>
> > On 18/05/13 19:25, Peter Otten wrote:
> >>
> >> Are there alternatives that give the number as plain text?
> >
> > Further investigation shows that the numbers are available if I view the
> > source of the page. So, all I have to do is parse the page and extract
> > the drawn numbers. I'm not sure, at the moment, how I might do that but
> > I have something to work with.
>
> You can use a tool like lxml that "understands" html (though in this case
> you'd need a javascript parser on top of that) -- or hack something
> together
> with string methods or regular expressions. For example:

You don't need javascript, in this case, assuming the reference is to the
UK lotto --  A simple curl test confirms that (for the UK lottery at least)
the numbers can be retrieved simply without the involvedment of javascript,
so Python will be able to do the same. (URL:
https://www.national-lottery.co.uk/player/p/results.ftl  Apologies if this
is about some other lottery and I've missed it...)

Best,

Walter
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] why is unichr(sys.maxunicode) blank?

2013-05-18 Thread Albert-Jan Roskam



> 
>>  East Asian languages. But later on Joel Spolsky's "standard" 
> page about unicode
>>  I read that it goes to 6 bytes. That's what I implied when I mentioned 
> "utf8".
> 
> Each surrogate in a UTF-16 surrogate pair is 10 bits, for a total of
> 20-bits. Thus UTF-16 sets the upper bound on the number of code points
> at 2**20 + 2**16 (BMP). UTF-8 only needs 4 bytes for this number of
> codes.
> 
>>  A certain locale implies a certain codepage (on Windows), but where does 
> the locale
>>  category LC_CTYPE fit in this story?
> 
> LC_CTYPE is the locale category that classifies characters. In Debian
> Linux, the English-language locales copy LC_CTYPE from the i18n
> (internationalization) locale:
 
Thanks for the links. Without examples it remains pretty abstract, but I think 
I know is meant by this locale category now.. "The LC_CTYPE category shall 
define character classification, case conversion, and other character 
attributes. So if you switch from one locale to another, certain attributes of 
a character set might change". A switch from locale A to locale B might affect 
an attribute "casing", therefore, the mapping from lower- to uppercase *might* 
differ by locale. In stupid country X  "a".upper() may return "B".

It seems that the result of str.isalpha() and str.isdigit() *might* be 
different depending on the setting of locale.C_CTYPE. 

It is pretty sick that all these things can be adjusted separately (what is the 
use of having: danish collation, russian case conversion, english decimal sign, 
japanese codepage ;-)

 
> The i18n locale is defined by the ISO/IEC technical report 14652, as
> an instance of an upward compatible extension to the POSIX locale
> specification called the FDCC-set (i.e. Set of Formal Definitions of
> Cultural Conventions). Here it is in all its glory, if you like
> reading technical reports:
> 
> http://www.open-std.org/jtc1/sc22/wg20/docs/n972-14652ft.pdf

> If that's not enough, here's the POSIX 1003.1 locale spec:
> 
> short: http://goo.gl/aOJUx
> http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html


That one is the clearest IMHO. Oh no, now I see the possible impact on regexes. 
The meaning of e.g. "\s+"
might change depending on the locale.C_CTYPE setting!!


>>  Isn't UCS-2 the internal unicode encoding for CPython (narrow builds)?
> 
> Narrow builds create UTF-16 surrogate pairs from \U literals, but
> these aren't treated as an atomic unit for slicing, iteration, or
> string length.

That is a nice way of putting it. So if you slice a multibyte char "mb", mb[0] 
will return the first byte? That is annoying.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] why is unichr(sys.maxunicode) blank?

2013-05-18 Thread eryksun

On Sat, May 18, 2013 at 6:01 AM, Albert-Jan Roskam  wrote:
>
> East Asian languages. But later on Joel Spolsky's "standard" page about 
> unicode
> I read that it goes to 6 bytes. That's what I implied when I mentioned "utf8".

Each surrogate in a UTF-16 surrogate pair is 10 bits, for a total of
20-bits. Thus UTF-16 sets the upper bound on the number of code points
at 2**20 + 2**16 (BMP). UTF-8 only needs 4 bytes for this number of
codes.

> A certain locale implies a certain codepage (on Windows), but where does the 
> locale
> category LC_CTYPE fit in this story?

LC_CTYPE is the locale category that classifies characters. In Debian
Linux, the English-language locales copy LC_CTYPE from the i18n
(internationalization) locale:

short: http://goo.gl/Hs8RD
http://www.eglibc.org/cgi-bin/viewvc.cgi/trunk/libc/localedata/locales/i18n?view=markup

Here's the mapping between the symbolic Unicode names in the latter
(e.g. ) and UTF-8:

short: http://goo.gl/cZ3dS
http://www.eglibc.org/cgi-bin/viewvc.cgi/trunk/libc/localedata/charmaps/UTF-8?view=markup

The i18n locale is defined by the ISO/IEC technical report 14652, as
an instance of an upward compatible extension to the POSIX locale
specification called the FDCC-set (i.e. Set of Formal Definitions of
Cultural Conventions). Here it is in all its glory, if you like
reading technical reports:

http://www.open-std.org/jtc1/sc22/wg20/docs/n972-14652ft.pdf

If that's not enough, here's the POSIX 1003.1 locale spec:

short: http://goo.gl/aOJUx
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html

> Isn't UCS-2 the internal unicode encoding for CPython (narrow builds)?

Narrow builds create UTF-16 surrogate pairs from \U literals, but
these aren't treated as an atomic unit for slicing, iteration, or
string length.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Retrieving data from a web site

2013-05-18 Thread Peter Otten

Phil wrote:

> On 18/05/13 19:25, Peter Otten wrote:
>>
>> Are there alternatives that give the number as plain text?
> 
> Further investigation shows that the numbers are available if I view the
> source of the page. So, all I have to do is parse the page and extract
> the drawn numbers. I'm not sure, at the moment, how I might do that but
> I have something to work with.

You can use a tool like lxml that "understands" html (though in this case 
you'd need a javascript parser on top of that) -- or hack something together 
with string methods or regular expressions. For example:

import urllib2
import json

s = urllib2.urlopen("http://*/goldencasket";).read()
s = s.partition("latestResults_productResults")[2].lstrip(" =")
s = s.partition(";")[0]
data = json.loads(s)
lotto = data["GoldLottoSaturday"]

print lotto["drawDayDateNumber"]
print map(int, lotto["primaryNumbers"])
print map(int, lotto["secondaryNumbers"])

While this is brittle I've found that doing it "right" is usually not 
worthwhile as it won't survive the next website redesign eighter.

PS: 
has links to zipped csv files with the results. Downloading, inflating and 
reading these should be the simplest and best way to get your data.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] why is unichr(sys.maxunicode) blank?

2013-05-18 Thread Steven D'Aprano

On 18/05/13 20:01, Albert-Jan Roskam wrote:

Thanks for all your replies. I knew about code points, but to represent the
unicode string (code point) as a utf-8 byte string (bytes), characters 0-127
are 1 byte (of 8 bits), then 128-255 (accented chars)
are 2 bytes, and so on up to 4 bytes for East Asian languages. But later on Joel Spolsky's
"standard" page about unicode I read that it goes to 6 bytes. That's what I implied when
I mentioned "utf8".

The UTF-8 data structure was originally designed to go up to 6 bytes, but since
Unicode itself is limited to 1114111 code points, no more than 4 bytes are
needed for UTF-8.

Also, it is wrong to say that the 4-byte UTF-8 values are "East Asian languages". The full Unicode
range contains 17 "planes" of 65,536 code points. The first such plane is called the "Basic
Multilingual Plane", and it includes all the code points that can be represented in 1 to 3 UTF-8 bytes.
The BMP includes in excess of 13,000 East Asian code points, e.g.:

py> import unicodedata as ud
py> c = '\u3050'
py> print(c, ud.name(c), c.encode('utf-8'))
P HIRAGANA LETTER GU b'\xe3\x81\x90'

The 4-byte UTF-8 values are in the second and subsequent planes, called
"Supplementary Multilingual Planes". They include historical character sets
such as Egyptian hieroglyphs, cuneiform, musical and mathematical symbols, Emoji, gaming
symbols, Ancient Arabic and Persian, and many others.

http://en.wikipedia.org/wiki/Plane_(Unicode)

I always viewed the codepage as "the bunch of chars on top of ascii", e.g.
cp1252 (latin-1) is ascii (0-127) + another 128 characters that are used in Europe (euro
sign, Scandinavian and Mediterranean (Spanish), but not Slavian chars).

Well, that's certainly common, but not all legacy encodings are supersets of
ASCII. For example:

http://en.wikipedia.org/wiki/Big5

although I see that Python's implementation of Big5 is *technically* incorrect,
although *practically* useful, as it does include ASCII.

A certain locale implies a certain codepage (on Windows), but where does the
locale category LC_CTYPE fit in this story?

No idea :-)

UTF-8
UTF-16
UTF-32 (also sometimes known as UCS-4)

plus at least one older, obsolete encoding, UCS-2.

Isn't UCS-2 the internal unicode encoding for CPython (narrow builds)? Or maybe
this is a different abbreviation. I read about bit multilingual plane (BMP) and
surrogate pairs and all. The author suggested that messing with surrogate pairs
is a topic to dive into in case one's nail bed is being derusted. I
wholeheartedly agree.

UCS-2 is a fixed-width encoding that is identical to UTF-16 for code points up
to U+. It differs from UTF-16 in that it *cannot* encode code points
U+1 and higher, in other words, it does not support surrogate pairs. So
UCS-2 is obsolete in the sense it doesn't include the whole set of Unicode
characters.

In Python 3.2 and older, Python has a choice between a *narrow build* that uses
UTF-16 (including surrogates) for strings in memory, or a *wide build* that
uses UTF-32. The choice is made when you compile the Python interpreter. Other
programming languages may use other systems.

Python 3.3 uses a different, more flexible scheme for keeping strings in
memory. Depending on the largest code point in a string, the string will be
stored in either Latin-1 (one byte per character), UCS-2 (two bytes per
character, and no surrogates) or UTF-32 (four bytes per character). This means
that there is no longer a need for surrogate pairs, but only strings that
*need* four bytes per character will use four bytes.

- "big endian", where the most-significant (largest) byte is on the left
(lowest address);
- "little endian", where the most-significant (largest) byte is on the right.

Why is endianness relevant only for utf-32, but not for utf-8 and utf16? Is "utf-8" a
shorthand for saying "utf-8 le"?

Endianness is relevant for UTF-16 too.

It is not relevant for UTF-8 because UTF-8 defines the order that multiple
bytes must appear. UTF-8 is defined in terms of *bytes*, not multi-byte words.
So the code point U+3050 is encoded into three bytes, *in this order*:

0xE3 0x81 0x90

There's no question about which byte comes first, because the order is set. But
UTF-16 defines the encoding in terms of double-byte words, so the question of
how words are stored becomes relevant. A 16-bit word can be laid out in memory
in at least two ways:

[most significant byte] [least significant byte]

[least significant byte] [most significant byte]

so U+3050 could legitimately appear as bytes 0x3050 or 0x5030 depending on the
machine you are using.

It's hard to talk about endianness without getting confused, or at least for me
it is :-) Even though I've written down 0x3050 and 0x5030, it is important to
understand that they both have the same numeric value of 12368 in decimal. The
difference is just in how the bytes are laid out in memory. By analogy, Arabic
numerals used in En

Re: [Tutor] Unsubscribe

2013-05-18 Thread Dave Angel


On 05/16/2013 12:58 PM, Stafford Baines wrote:

I only intend this to be temporary. I'm going away for a couple of weeks and 
don't want my mailbox overflowing when I return.😃 Thanks for all the help

Sent from my iPhone
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor



At the bottom of every message is a link to a web page to "change 
subscription options."  At the bottom of that page is a button that can 
unsubscribe you.


--
DaveA
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Retrieving data from a web site

2013-05-18 Thread Phil


On 18/05/13 19:25, Peter Otten wrote:


Are there alternatives that give the number as plain text?


Further investigation shows that the numbers are available if I view the 
source of the page. So, all I have to do is parse the page and extract 
the drawn numbers. I'm not sure, at the moment, how I might do that but 
I have something to work with.


--
Regards,
Phil
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Retrieving data from a web site

2013-05-18 Thread Phil


On 18/05/13 19:25, Peter Otten wrote:


What's the url of the page?


http://tatts.com/goldencasket


Are there alternatives that give the number as plain text?


Not that I can find. A Google search hasn't turned up anything.


If not, do the images have names like whatever0.jpg, whatever1.jpg,
whatever2.jpg, ...? Then you could infer the value from the name.

If not, is a digit always represented by the same image? Then you could map
the image urls to the digits.


Good point Peter, I'll investigate.

--
Regards,
Phil
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] use python to change the webpage content?

2013-05-18 Thread Albert-Jan Roskam



>There is a online simulator about a physic project I'm doing and I want to use 
>the data the simulator generates on that website. I can get data using 
>urllib.request and regular expression but I also want to change some of the 
>input values and then get different sets of data. However, if I change the 
>inputs, the address of the webpage wouldn't change, so I couldn't get data 
>with different initial conditions.I'm wondering how I can implement this.

Maybe this? 
http://www.pythonforbeginners.com/cheatsheet/python-mechanize-cheat-sheet/

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] why is unichr(sys.maxunicode) blank?

2013-05-18 Thread Albert-Jan Roskam



- Original Message -

> From: eryksun 
> To: tutor@python.org
> Cc: 
> Sent: Saturday, May 18, 2013 5:28 AM
> Subject: Re: [Tutor] why is unichr(sys.maxunicode) blank?
> 
> On Fri, May 17, 2013 at 11:06 PM, Dave Angel  wrote:
>>  One tool that can help is the name function in module unicodedata
>> 
>>   >>> import unicodedata
>>   >>> unicodedata.name(u'\xb0')
>>  'DEGREE SIGN'
>> 
>>  If you try that on the values near sys.maxunicode you get an exception:
>>  ValueError: no such name
> 
> There's no name since the code point isn't assigned, but the category
> is defined:
> 
>     >>> unicodedata.category(u'\U0010FFFD')
>     'Co'
>     >>> unicodedata.category(u'\U0010FFFE')
>     'Cn'
>     >>> unicodedata.category(u'\U0010')
>     'Cn'
> 
> 'Co' is the private use category, and 'Cn' is for codes that 
> aren't assigned.

Thank you. That unicodedata module is very handy sometimes (and crucial for 
regexes, sometimes). I rarely use it but I should have remembered it.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] why is unichr(sys.maxunicode) blank?

2013-05-18 Thread Albert-Jan Roskam



>> I was curious what the "high" four-byte ut8 unicode characters look like.
>

>By the way, your sentence above reflects a misunderstanding. Unicode 
>characters (strictly speaking, code points) are not "bytes", four or 
>otherwise. They are abstract entities represented by a number between 0 and 
>1114111, or in hex, 0x10. Code points can represent characters, or parts 
>of characters (e.g. accents, diacritics, combining characters and similar), or 
>non-characters.



Thanks for all your replies. I knew about code points, but to represent the 
unicode string (code point) as a utf-8 byte string (bytes), characters 0-127 
are 1 byte (of 8 bits), then 128-255 (accented chars) 
are 2 bytes, and so on up to 4 bytes for East Asian languages. But later on 
Joel Spolsky's "standard" page about unicode I read that it goes to 6 bytes. 
That's what I implied when I mentioned "utf8".



>Much confusion comes from conflating bytes and code points, or bytes and 
>characters. The first step to being a Unicode wizard is to always keep them 
>distinct in your mind. By analogy, the floating point number 23.42 is stored 
>in memory or on disk as a bunch of bytes, but there is nothing to be gained 
>from confusing the number 23.42 from the bytes 0xEC51B81E856B3740, which is 
>how it is stored as a C double.
>
>Unicode code points are abstract entities, but in the real world, they have to 
>be stored in a computer's memory, or written to disk, or transmitted over a 
>wire, and that requires *bytes*. So there are three Unicode schemes for 
>storing code points as bytes. These are called *encodings*. Only encodings 
>involve bytes, so it is nonsense to talk about "four-byte" unicode characters, 
>since it conflates the abstract Unicode character set with one of various 
>concrete encodings.


I would admit it if otherwise, but that's what I meant ;-)



>There are three standard Unicode encodings. (These are not to be confused with 
>the dozens of "legacy encodings", a.k.a. code pages, used prior to the Unicode 
>standard. They do not cover the entire range of Unicode, and are not part of 
>the Unicode standard.) These encodings are:



I always viewed the codepage as "the bunch of chars on top of ascii", e.g. 
cp1252 (latin-1) is ascii (0-127) +  another 128 characters that are used in 
Europe (euro sign, Scandinavian and Mediterranean (Spanish), but not Slavian 
chars). A certain locale implies a certain codepage (on Windows), but where 
does the locale category LC_CTYPE fit in this story?



>
>UTF-8
>UTF-16
>UTF-32 (also sometimes known as UCS-4)
>
>plus at least one older, obsolete encoding, UCS-2.

Isn't UCS-2 the internal unicode encoding for CPython (narrow builds)? Or maybe 
this is a different abbreviation. I read about bit multilingual plane (BMP) and 
surrogate pairs and all. The author suggested that messing with surrogate pairs 
is a topic to dive into in case one's nail bed is being derusted. I 
wholeheartedly agree.



>UTF-32 is the least common, but simplest. It simply maps every code point to 
>four bytes. In the following, I will follow this convention:
>
>- code points are written using the standard Unicode notation, U+ where 
>the x's are hexadecimal digits;
>
>- bytes are written in hexadecimal, using a leading 0x.
>
>Code point U+ -> bytes 0x
>Code point U+0001 -> bytes 0x0001
>Code point U+0002 -> bytes 0x0002
>...
>Code point U+10 -> bytes 0x0010
>
>
>It is simple because the mapping is trivially simple, and uncommon because for 
>typical English-language text, it wastes a lot of memory.
>
>The only complication is that UTF-32 depends on the endianess of your system. 
>In the above examples I glossed over this factor. In fact, there are two 
>common ways that bytes can be stored:
>
>- "big endian", where the most-significant (largest) byte is on the left 
>(lowest address);
>- "little endian", where the most-significant (largest) byte is on the right.


Why is endianness relevant only for utf-32, but not for utf-8 and utf16? Is 
"utf-8" a shorthand for saying "utf-8 le"?



>So in a little-endian system, we have this instead:
>
>Code point U+ -> bytes 0x
>Code point U+0001 -> bytes 0x0100
>Code point U+0002 -> bytes 0x0200
>...
>Code point U+10 -> bytes 0x1000
>
>(Note that little-endian is not merely the reverse of big-endian. It is the 
>order of bytes that is reversed, not the order of digits, or the order of bits 
>within each byte.)
>
>So when you receive a bunch of bytes that you know represents text encoded 
>using UTF-32, you can bunch the bytes in groups of four and convert them to 
>Unicode code points. But you need to know the endianess. One way to do that is 
>to add a Byte Order Mark at the beginning of the bytes. If you look at the 
>first four bytes, and it looks like 0xFEFF, then you have big-endian 
>UTF-32. But if it looks like 0xFFFE, then you have little-endian.

So each byte starts with a BOM? Or each file?

Re: [Tutor] Retrieving data from a web site

2013-05-18 Thread Peter Otten

Phil wrote:

> On 18/05/13 16:33, Alan Gauld wrote:
>> On 18/05/13 00:57, Phil wrote:
>>> I'd like to "download" eight digits from a web site where the digits are
>>> stored as individual graphics. Is this possible, using perhaps, one of
>>> the countless number of Python modules? Is this the function of a web
>>> scraper?
>>
>> In addition to Dave's points there is also the legality to consider.
>> Images are often copyrighted (although images of digits are less
>> likely!) and sites often have conditions of use that prohibit web
>> scraping. Such sites often include scripts that analyze user activity
>> and if they suspect you of being a robot may ban your computer from
>> accessing the site - including by browser.
>>
>> So be sure that you  are allowed to access the site robotically and that
>> you are allowed to download the content or you could find yourself
>> blacklisted and unable to access the site even with your browser.
>>
> 
> Thanks for the replies,
> 
> The site in question is the Lotto results page and the drawn numbers are
> not obscured. So I don't expect that there would be any legal or
> copyright problems.
> 
> I have written a simple program that checks the results, for an unlikely
> win, but I have to manually enter the drawn numbers. I thought the next
> step might be to automatically download the results.
> 
> I can see that this would be a relatively easy task if the digits were
> not displayed as graphics.

What's the url of the page? 

Are there alternatives that give the number as plain text? 

If not, do the images have names like whatever0.jpg, whatever1.jpg, 
whatever2.jpg, ...? Then you could infer the value from the name. 

If not, is a digit always represented by the same image? Then you could map 
the image urls to the digits.


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Retrieving data from a web site

2013-05-18 Thread Phil


On 18/05/13 16:33, Alan Gauld wrote:

On 18/05/13 00:57, Phil wrote:

I'd like to "download" eight digits from a web site where the digits are
stored as individual graphics. Is this possible, using perhaps, one of
the countless number of Python modules? Is this the function of a web
scraper?


In addition to Dave's points there is also the legality to consider.
Images are often copyrighted (although images of digits are less
likely!) and sites often have conditions of use that prohibit web
scraping. Such sites often include scripts that analyze user activity
and if they suspect you of being a robot may ban your computer from
accessing the site - including by browser.

So be sure that you  are allowed to access the site robotically and that
you are allowed to download the content or you could find yourself
blacklisted and unable to access the site even with your browser.



Thanks for the replies,

The site in question is the Lotto results page and the drawn numbers are 
not obscured. So I don't expect that there would be any legal or 
copyright problems.


I have written a simple program that checks the results, for an unlikely 
win, but I have to manually enter the drawn numbers. I thought the next 
step might be to automatically download the results.


I can see that this would be a relatively easy task if the digits were 
not displayed as graphics.


--
Regards,
Phil
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

[Tutor] Python web script to run a command line expression

2013-05-18 Thread Ahmet Anil Dindar

Hi,
I have a WAMP running in my office computer. I wonder how I can implement a
python script that runs within WAMP and execute a command line expression.
By this way, I will able to run my command line expressions through web
page in intranet.

I appreciate your suggestions.

++Ahmet
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Retrieving data from a web site

Re: [Tutor] model methods in Django

Re: [Tutor] creating import directories

[Tutor] creating import directories

[Tutor] WAMP stack

Re: [Tutor] model methods in Django

Re: [Tutor] why is unichr(sys.maxunicode) blank?

Re: [Tutor] Retrieving data from a web site

Re: [Tutor] Python web script to run a command line expression

[Tutor] model methods in Django

Re: [Tutor] why is unichr(sys.maxunicode) blank?

Re: [Tutor] Retrieving data from a web site

Re: [Tutor] why is unichr(sys.maxunicode) blank?

Re: [Tutor] why is unichr(sys.maxunicode) blank?

Re: [Tutor] Retrieving data from a web site

Re: [Tutor] why is unichr(sys.maxunicode) blank?

Re: [Tutor] Unsubscribe

Re: [Tutor] Retrieving data from a web site

Re: [Tutor] Retrieving data from a web site

Re: [Tutor] use python to change the webpage content?

Re: [Tutor] why is unichr(sys.maxunicode) blank?

Re: [Tutor] why is unichr(sys.maxunicode) blank?

Re: [Tutor] Retrieving data from a web site

Re: [Tutor] Retrieving data from a web site

[Tutor] Python web script to run a command line expression

25 matches

Site Navigation

Mail list logo

Footer information