Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Walter Dörwald
Am 04.10.2005 um 04:25 schrieb [EMAIL PROTECTED]:

> As the OP suggests, decoding with a codec like mac-roman or  
> iso8859-1 is very
> slow compared to encoding or decoding with utf-8.  Here I'm working  
> with 53k of
> data instead of 53 megs.  (Note: this is a laptop, so it's possible  
> that
> thermal or battery management features affected these numbers a  
> bit, but by a
> factor of 3 at most)
>
> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "u.encode('utf-8')"
> 1000 loops, best of 3: 591 usec per loop
> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')"
> 1000 loops, best of 3: 1.25 msec per loop
> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')"
> 100 loops, best of 3: 13.5 msec per loop
> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('iso8859-1')"
> 100 loops, best of 3: 13.6 msec per loop
>
> With utf-8 encoding as the baseline, we have
> decode('utf-8')  2.1x as long
> decode('mac-roman') 22.8x as long
> decode('iso8859-1') 23.0x as long
>
> Perhaps this is an area that is ripe for optimization.

For charmap decoding we might be able to use an array (e.g. a tuple  
(or an array.array?) of codepoints instead of dictionary.

Or we could implement this array as a C array (i.e. gencodec.py would  
generate C code).

Bye,
Walter Dörwald

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] unifying str and unicode

2005-10-04 Thread Fredrik Lundh
James Y Knight wrote:

> Your point would be much easier to stomach if the "str" type could
> *only* hold 7-bit ASCII.

why?  strings are not mutable, so it's not like an ASCII string will suddenly 
sprout
non-ASCII characters.  what ends up in a string is defined by the string 
source.  if
you cannot trust the source, your programs will never work.  after all, there's 
no-
thing in Python that keeps things like:

s = file.readline().decode("iso-8859-1")
s = elem.findtext("node")
s = device.read_encoded_data()

from returning integers instead of strings, or returning socket objects on odd 
fridays.
but if the interface spec says that they always return strings that adher to 
python's
text model (=unicode or things that can be mixed with unicode), you can trust 
them
as much as you can trust anything else in Python.

(this is of course also why we talk about file-like objects in Python, and 
sequences,
and iterators and iterables, and stuff like that.  it's not type(obj) that's 
important, it's
what you can do with obj and how it behaves when you do it)

 



___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 343 and __with__

2005-10-04 Thread Michael Hudson
"Phillip J. Eby" <[EMAIL PROTECTED]> writes:

> At 07:02 PM 10/3/2005 +0100, Michael Hudson wrote:
>>"Phillip J. Eby" <[EMAIL PROTECTED]> writes:
>>
>> > Since the PEP is accepted and has patches for both its implementation 
>> and a
>> > good part of its documentation, a major change like this would certainly
>> > need a better rationale.
>>
>>Though given the amount of interest said patch has attracted (none at
>>all)
>
> Actually, I have been reading the patch and meant to comment on it.  

Oh, good.

> I was perplexed by the odd stack behavior of the new opcode until I
> realized that it's try/finally that's weird.  :)

:)

> I was planning to look into whether that could be cleaned up as
> well, when I got distracted and didn't go back to it.

I see.

I don't know whether trying to clean up the stack protocol around
exceptions is worth the about of pain it causes in the head (anyone
still thinking about removing the block stack?).

>>  perhaps noone cares very much and the proposal should be dropped.
>
> I care an awful lot, as 'with' is another framework-dissolving tool that 
> makes it possible to do more things in library form, without needing to 
> resort to template methods.  It also enables more context-sensitive 
> programming, in that "global" states can be set and restored in a 
> structured fashion.  It may take a while to feel the effects, but it's 
> going to be a big improvement to Python, maybe as big as new-style classes, 
> and certainly bigger than decorators.

I think 'as big as new-style classes' is probably an exaggeration, but
I'm glad my troll caught a few people :)

Cheers,
mwh

-- 
  Those who have deviant punctuation desires should take care of their
  own perverted needs.  -- Erik Naggum, comp.lang.lisp
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 343 and __with__

2005-10-04 Thread Nick Coghlan
Michael Hudson wrote:
> I think 'as big as new-style classes' is probably an exaggeration, but
> I'm glad my troll caught a few people :)

I was planning on looking at your patch too, but I was waiting for an answer 
from Guido about the fate of the ast-branch for Python 2.5. Given that we have 
patches for PEP 342 and PEP 343 against the trunk, but ast-branch still isn't 
even passing the Python 2.4 test suite, I'm wondering if it should be bumped 
from the feature list again.

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
 http://boredomandlaziness.blogspot.com
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 343 and __with__

2005-10-04 Thread Nick Coghlan
Jason Orendorff wrote:
> Phillip J. Eby writes:
> 
>>You didn't offer any reasons why this would be useful and/or good.
> 
> 
> It makes it dramatically easier to write Python classes that correctly
> support 'with'.  I don't see any simple way to do this under PEP 343;
> the only sane thing to do is write a separate @contextmanager
> generator, as all of the examples do.

Hmm, it's kind of like the iterable/iterator distinction. Being able to do:

   class Whatever(object):
   def __iter__(self):
   for item in self.stuff:
   yield item

is a very handy way of defining "this is how you iterate over this class". The 
only cost is that actual iterators then need to define an __iter__ method that 
returns 'self' (which isn't much of a cost, and is trivial to do even for 
iterators written in C).

If there was a __with__ slot, then we could consider that as identifying a 
"manageable context", with three methods to identify an actual context manager:
   __with__ that returns self
   __enter__
   __exit__


Then the explanation of what a with statement does would simply look like:

 abc = EXPR.__with__() # This is the only change
 exc = (None, None, None)
 VAR = abc.__enter__()
 try:
 try:
 BLOCK
 except:
 exc = sys.exc_info()
 raise
 finally:
 abc.__exit__(*exc)


And the context management for decimal.Context would look like:
  class Context:
  ...
  @contextmanager
  def __with__(self):
  old = decimal.getcontext()
  new = self.copy() # Make this nesting and thread safe
  decimal.setcontext(new)
  try:
  yield new
  finally:
  decimal.setcontext(old)

And for threading.Lock would look like:
  class Lock:
  ...
  def __with__(self):
  return self
  def __enter__(self):
  self.acquire()
  return self
  def __exit__(self):
  self.release()

Also, any class could make an existing independent context manager (such as 
'closing') its native context manager as follows:

  class SomethingCloseable:
  ...
  def __with__(self):
  return closing(self)

> As for the second proposal, I was thinking we'd have one mental model
> for context managers (block template generators), rather than two
> (generators vs. enter/exit methods).  Enter/exit seemed superfluous,
> given the examples in the PEP.

Try to explain the semantics of the with statement without referring to the 
__enter__ and __exit__ methods, and then see if you still think they're 
superfluous ;)

The @contextmanager generator decorator is just syntactic sugar for writing 
duck-typed context managers - the semantics of the with statement itself can 
only be explained in terms of the __enter__ and __exit__ methods. Indeed, 
explaining how the @contextmanager decorator itself works requires recourse to 
the __enter__ and __exit__ methods of the actual context manager object the 
decorator produces.

However, I think the idea of have a distinction between manageable contexts 
and context managers similar to the distinction between iterables and 
iterators is one well worth considering.

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
 http://boredomandlaziness.blogspot.com
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 343 and __with__

2005-10-04 Thread Guido van Rossum
On 10/4/05, Nick Coghlan <[EMAIL PROTECTED]> wrote:
> I was planning on looking at your patch too, but I was waiting for an answer
> from Guido about the fate of the ast-branch for Python 2.5. Given that we have
> patches for PEP 342 and PEP 343 against the trunk, but ast-branch still isn't
> even passing the Python 2.4 test suite, I'm wondering if it should be bumped
> from the feature list again.

What do you want me to say about the AST branch? It's not my branch, I
haven't even checked it out, I'm just patiently waiting for the folks
who started it to finally finish it.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 343 and __with__

2005-10-04 Thread Jason Orendorff
The argument I am going to try to make is that Python coroutines need
a more usable API.

> Try to explain the semantics of the with statement without referring to the
> __enter__ and __exit__ methods, and then see if you still think they're
> superfluous ;)
>
> The @contextmanager generator decorator is just syntactic sugar [...]
> [T]he semantics of the with statement itself can
> only be explained in terms of the __enter__ and __exit__ methods.

That's not true.  It can certainly use the coroutine API instead.

Now... as specified in PEP 342, the coroutine API can be used to
implement 'with', but it's ugly.  I think this is a problem with the
coroutine API, not the idea of using coroutines per se.  Actually I
think 'with' is a pretty tame use case for coroutines.  Other Python
objects (dicts, lists, strings) have convenience methods that are
strictly redundant but make them much easier to use.  Coroutines
should, too.

This:

with EXPR as VAR:
BLOCK

expands to this under PEP 342:

_cm = contextmanager(EXPR)
VAR = _cm.next()
try:
BLOCK
except:
try:
_cm.throw(*sys.exc_info())
except:
pass
raise
finally:
try:
_cm.next()
except StopIteration:
pass
except:
raise
else:
raise RuntimeError

Blah.  But it could look like this:

_cm = (EXPR).__with__()
VAR = _cm.start()
try:
BLOCK
except:
_cm.throw(*excinfo)
else:
_cm.finish()

I think that looks quite nice.

Here is the proposed specification for start() and finish():

class coroutine:  # pseudocode
...
def start(self):
""" Convenience method -- exactly like next(), but
assert that this coroutine hasn't already been started.
"""
if self.__started:
raise ValueError  # or whatever
return self.next()

def finish(self):
""" Convenience method -- like next(), but expect the
coroutine to complete without yielding again.
"""
try:
self.next()
except (StopIteration, GeneratorExit):
pass
else:
raise RuntimeError("coroutine didn't finish")

Why is this good?

  - Makes coroutines more usable for everyone, not just for
implementing 'with'.
  - For example, if you want to feed values to a coroutine, call
start() first and then send() repeatedly.  Quite sensible.
  - Single mental model for 'with' (always uses a coroutine or
lookalike object).
  - No need for "contextmanager" wrapper.
  - Harder to implement a context manager object incorrectly
(it's quite easy to screw up with __begin__ and __end__).

-j
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 343 and __with__

2005-10-04 Thread Jason Orendorff
Right after I sent the preceding message I got a funny feeling I'm
wasting everybody's time here.  I apologize.  Guido's original concern
about speedy C implementation for locks stands.  I don't see a good
way around it.

By the way, my expansion of 'with' using coroutines (in previous
message) was incorrect.  The corrected version is shorter; see below.

-j


This:

with EXPR as VAR:
BLOCK

would expand to this under PEP 342 and my proposal:

_cm = (EXPR).__with__()
VAR = _cm.next()
try:
BLOCK
except:
_cm.throw(*sys.exc_info())
finally:
try:
_cm.next()
except (StopIteration, GeneratorExit):
pass
else:
raise RuntimeError("coroutine didn't finish")
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 343 and __with__

2005-10-04 Thread Guido van Rossum
On 10/4/05, Jason Orendorff <[EMAIL PROTECTED]> wrote:
> This:
>
> with EXPR as VAR:
> BLOCK
>
> expands to this under PEP 342:
>
> _cm = contextmanager(EXPR)
> VAR = _cm.next()
> try:
> BLOCK
> except:
> try:
> _cm.throw(*sys.exc_info())
> except:
> pass
> raise
> finally:
> try:
> _cm.next()
> except StopIteration:
> pass
> except:
> raise
> else:
> raise RuntimeError

Where in the world do you get this idea? The translation is as
follows, according to PEP 343:

abc = EXPR
exc = (None, None, None)
VAR = abc.__enter__()
try:
try:
BLOCK
except:
exc = sys.exc_info()
raise
finally:
abc.__exit__(*exc)

PEP 342 doesn't touch on the expansion of with-statements at all.

I think I know where you're coming from, but please do us a favor and
don't misrepresent the PEPs.  If anything, your proposal is more
complicated; it requires four new APIs instead of two, and requires an
extra call to set up (__with__() followed by start()).

Proposals like yours (and every other permutation) were brought up
during the initial discussion. We picked one. Don't create more churn
by arguing for a different variant. Spend your efforts on implementing
it so you can actually use it and see how bad it is (I predict it
won't be bad at all).

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 343 and __with__

2005-10-04 Thread Guido van Rossum
On 10/4/05, Jason Orendorff <[EMAIL PROTECTED]> wrote:
> Right after I sent the preceding message I got a funny feeling I'm
> wasting everybody's time here.  I apologize.  Guido's original concern
> about speedy C implementation for locks stands.  I don't see a good
> way around it.

OK. Our messages crossed, so you can ignore my response. Let's spend
our time implementing the PEPs as they stand, then see what else we
can do with the new APIs.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Martin v. Löwis
Walter Dörwald wrote:
> For charmap decoding we might be able to use an array (e.g. a tuple  
> (or an array.array?) of codepoints instead of dictionary.

This array would have to be sparse, of course. Using an array.array
would be more efficient, I guess - but we would need a C API for arrays
(to validate the type code, and to get ob_item).

> Or we could implement this array as a C array (i.e. gencodec.py would  
> generate C code).

For decoding, we would not get any better than array.array, except for
startup cost.

For encoding, having a C trie might give considerable speedup. _codecs
could offer an API to convert the current dictionaries into
lookup-efficient structures, and the conversion would be done when
importing the codec.

For the trie, two levels (higher and lower byte) would probably be
sufficient: I believe most encodings only use 2 "rows" (256 code
point blocks), very few more than three.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread M.-A. Lemburg
Walter Dörwald wrote:
> Am 04.10.2005 um 04:25 schrieb [EMAIL PROTECTED]:
> 
> 
>>As the OP suggests, decoding with a codec like mac-roman or  
>>iso8859-1 is very
>>slow compared to encoding or decoding with utf-8.  Here I'm working  
>>with 53k of
>>data instead of 53 megs.  (Note: this is a laptop, so it's possible  
>>that
>>thermal or battery management features affected these numbers a  
>>bit, but by a
>>factor of 3 at most)
>>
>>$ timeit.py -s "s='a'*53*1024; u=unicode(s)" "u.encode('utf-8')"
>>1000 loops, best of 3: 591 usec per loop
>>$ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')"
>>1000 loops, best of 3: 1.25 msec per loop
>>$ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')"
>>100 loops, best of 3: 13.5 msec per loop
>>$ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('iso8859-1')"
>>100 loops, best of 3: 13.6 msec per loop
>>
>>With utf-8 encoding as the baseline, we have
>>decode('utf-8')  2.1x as long
>>decode('mac-roman') 22.8x as long
>>decode('iso8859-1') 23.0x as long
>>
>>Perhaps this is an area that is ripe for optimization.
> 
> 
> For charmap decoding we might be able to use an array (e.g. a tuple  
> (or an array.array?) of codepoints instead of dictionary.
> 
> Or we could implement this array as a C array (i.e. gencodec.py would  
> generate C code).

That would be a possibility, yes.

Note that the charmap codec was meant as faster replacement
for the old string transpose function. Dictionaries are used
for the mapping to avoid having to store huge (largely empty)
mapping tables - it's a memory-speed tradeoff.

Of course, a C version could use the same approach as
the unicodedatabase module: that of compressed lookup
tables...

http://aggregate.org/TechPub/lcpc2002.pdf

genccodec.py anyone ?

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 04 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Walter Dörwald
Am 04.10.2005 um 21:50 schrieb Martin v. Löwis:

> Walter Dörwald wrote:
>
>> For charmap decoding we might be able to use an array (e.g. a  
>> tuple  (or an array.array?) of codepoints instead of dictionary.
>>
>
> This array would have to be sparse, of course.

For encoding yes, for decoding no.

> Using an array.array
> would be more efficient, I guess - but we would need a C API for  
> arrays
> (to validate the type code, and to get ob_item).

For decoding it should be sufficient to use a unicode string of  
length 256. u"\ufffd" could be used for "maps to undefined". Or the  
string might be shorter and byte values greater than the length of  
the string are treated as "maps to undefined" too.

>> Or we could implement this array as a C array (i.e. gencodec.py  
>> would  generate C code).
>>
>
> For decoding, we would not get any better than array.array, except for
> startup cost.

Yes.

> For encoding, having a C trie might give considerable speedup. _codecs
> could offer an API to convert the current dictionaries into
> lookup-efficient structures, and the conversion would be done when
> importing the codec.
>
> For the trie, two levels (higher and lower byte) would probably be
> sufficient: I believe most encodings only use 2 "rows" (256 code
> point blocks), very few more than three.

This might work, although nobody has complained about charmap  
encoding yet. Another option would be to generate a big switch  
statement in C and let the compiler decide about the best data  
structure.

Bye,
Walter Dörwald

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Static builds on Windows (continued)

2005-10-04 Thread Marvin
Earlier references:
http://mail.python.org/pipermail/python-dev/2004-July/046499.html

I want to be able to create a version of python24.lib that is a static library,
suitable for creating a python.exe or other .exe using python's api.

So I did as the earlier poster suggested, using 2.4.1 sources.  I modified the
PCBuild/pythoncore and python .vcproj files as follows:

  General/ ConfigurationType/ Static library (was dynamic in pythoncore)
  c/C++ Code Generation RT Library /MT (was /MTD for mt DLL)
  c/c++/Precompiled/ Not Using Precompiled headers (based on some MSDN hints)
  librarian OutputFile .//python24.lib
  Preprocessor: added Py_NO_ENABLED_SHARED. Removed USE_DL_IMPORT

I built pythoncore and python. The resulting python.exe worked fine, but did
indeed fail when I tried to dynamically load anything (Dialog said: the
application terminated abnormally)

Now I am not very clueful about the dllimport/dllexport business.  But it seems
that I should be able to link MY program against a .lib somehow (a real lib),
and let the .EXE export the symbols somehow.

My first guess is to try to use /MTD, use Py_NO_ENABLE_SHARED when building
python24.lib, but then use PY_ENABLE_SHARED when compiling the python.c.  I'll
try that later, but anyone have more insight into the right way to do this?

marvin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Martin v. Löwis
Walter Dörwald wrote:
>> This array would have to be sparse, of course.
> 
> 
> For encoding yes, for decoding no.
[...]
> For decoding it should be sufficient to use a unicode string of  length 
> 256. u"\ufffd" could be used for "maps to undefined". Or the  string 
> might be shorter and byte values greater than the length of  the string 
> are treated as "maps to undefined" too.

Right. That's what I meant with "sparse": you somehow need to represent
"no value".

> This might work, although nobody has complained about charmap  encoding 
> yet. Another option would be to generate a big switch  statement in C 
> and let the compiler decide about the best data  structure.

I would try to avoid generating C code at all costs. Maintaining the 
build processes will just be a nightmare.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Static builds on Windows (continued)

2005-10-04 Thread Martin v. Löwis
Marvin wrote:
> I built pythoncore and python. The resulting python.exe worked fine, but did
> indeed fail when I tried to dynamically load anything (Dialog said: the
> application terminated abnormally)

Not sure what you are trying to do here. In your case, dynamic loading 
simply cannot work. The extension modules all link with python24.dll, 
which you don't have. It may find some python24.dll, which then gives 
conflicts with the Python interpreter that is already running.

So what you really should do is disable dynamic loading entirely. To do
so, remove dynload_win from your project, and #undef 
HAVE_DYNAMIC_LOADING in PC/pyconfig.h.

Not sure if anybody has recently tested whether this configuration
actually works - if you find that it doesn't, please post your patches
to sf.net/projects/python.

If you really want to provide dynamic loading of some kind, you should
arrange the extension modules to import the symbols from your .exe.
Linking the exe should generate an import library, and you should link
the extensions against that.

HTH,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Tony Nelson
At 9:37 AM +0200 10/4/05, Walter Dörwald wrote:
>Am 04.10.2005 um 04:25 schrieb [EMAIL PROTECTED]:
>
>>As the OP suggests, decoding with a codec like mac-roman or iso8859-1 is
>>very slow compared to encoding or decoding with utf-8. Here I'm working
>>with 53k of data instead of 53 megs. (Note: this is a laptop, so it's
>>possible that thermal or battery management features affected these
>>numbers a bit, but by a factor of 3 at most)
>>
>> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "u.encode('utf-8')"
>> 1000 loops, best of 3: 591 usec per loop
>> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')"
>> 1000 loops, best of 3: 1.25 msec per loop
>> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')"
>> 100 loops, best of 3: 13.5 msec per loop
>> $ timeit.py -s "s='a'*53*1024; u=unicode(s)" "s.decode('iso8859-1')"
>> 100 loops, best of 3: 13.6 msec per loop
>>
>> With utf-8 encoding as the baseline, we have
>> decode('utf-8')  2.1x as long
>> decode('mac-roman') 22.8x as long
>> decode('iso8859-1') 23.0x as long
>>
>> Perhaps this is an area that is ripe for optimization.
>
>For charmap decoding we might be able to use an array (e.g. a tuple
>(or an array.array?) of codepoints instead of dictionary.
>
>Or we could implement this array as a C array (i.e. gencodec.py would
>generate C code).

Fine -- as long as it still allows changing code points.  I add the missing
"Apple logo" code point to mac-roman in order to permit round-tripping
(0xF0 <=> 0xF8FF, per Apple docs).  (New bug #1313051.)

If an all-C implementation wouldn't permit changing codepoints, I suggest
instead just /caching/ the translation in C arrays stored with the codec
object.  The cache would be invalidated on any write to the codec's mapping
dictionary, and rebuilt the next time anything was translated.  This would
maintain the present semantics, work with current codecs, and still provide
the desired speed improvement.

But is there really no way to say this fast in pure Python?  The way a
one-to-one byte mapping can be done with "".translate()?

TonyN.:'   
  '  
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Tony Nelson
[Recipient list not trimmed, as my replies must be vetted by a moderator,
which seems to delay them. :]

At 11:48 PM +0200 10/4/05, Walter Dörwald wrote:
>Am 04.10.2005 um 21:50 schrieb Martin v. Löwis:
>
>> Walter Dörwald wrote:
>>
>>> For charmap decoding we might be able to use an array (e.g. a
>>> tuple  (or an array.array?) of codepoints instead of dictionary.
>>>
>>
>> This array would have to be sparse, of course.
>
>For encoding yes, for decoding no.
>
>> Using an array.array would be more efficient, I guess - but we would
>> need a C API for arrays (to validate the type code, and to get ob_item).
>
>For decoding it should be sufficient to use a unicode string of
>length 256. u"\ufffd" could be used for "maps to undefined". Or the
>string might be shorter and byte values greater than the length of
>the string are treated as "maps to undefined" too.

With Unicode using more than 64K codepoints now, it might be more forward
looking to use a table of 256 32-bit values, with no need for tricky
values.  There is no need to add any C code to the codecs; just add some
more code to the existing C function (which, if I have it right, is
PyUnicode_DecodeCharmap() in unicodeobject.c).

 ...
>> For encoding, having a C trie might give considerable speedup. _codecs
>> could offer an API to convert the current dictionaries into
>> lookup-efficient structures, and the conversion would be done when
>> importing the codec.
>>
>> For the trie, two levels (higher and lower byte) would probably be
>> sufficient: I believe most encodings only use 2 "rows" (256 code
>> point blocks), very few more than three.
>
>This might work, although nobody has complained about charmap
>encoding yet. Another option would be to generate a big switch
>statement in C and let the compiler decide about the best data
>structure.

I'm willing to complain. :)  I might allow saving of my (53 MB) MBox file.
(Not that editing received mail makes as much sense as searching it.)

Encoding can be made fast using a simple hash table with external chaining.
There are max 256 codepoints to encode, and they will normally be well
distributed in their lower 8 bits.  Hash on the low 8 bits (just mask), and
chain to an area with 256 entries.  Modest storage, normally short chains,
therefore fast encoding.


At 12:08 AM +0200 10/5/05, Martin v. Löwis wrote:

>I would try to avoid generating C code at all costs. Maintaining the
>build processes will just be a nightmare.

I agree; also I don't think the generated codecs need to be changed at all.
All the changes can be made to the existing C functions, by adding caching
per a reply of mine that hasn't made it to the list yet.  Well, OK,
something needs to hook writes to the codec's dictionary, but I /think/
that only needs Python code.  I say:

>...I suggest instead just /caching/ the translation in C arrays stored
>with the codec object.  The cache would be invalidated on any write to the
>codec's mapping dictionary, and rebuilt the next time anything was
>translated.  This would maintain the present semantics, work with current
>codecs, and still provide the desired speed improvement.

Note that this caching is done by new code added to the existing C
functions (which, if I have it right, are in unicodeobject.c).  No
architectural changes are made; no existing codecs need to be changed;
everything will just work, and usually work faster, with very modest memory
requirements of one 256 entry array of 32-bit Unicode values and a hash
table with 256 1-byte slots and 256 chain entries, each having a 4 byte
Unicode value, a byte output value, a byte chain index, and probably 2
bytes of filler, for a hash table size of 2304 bytes per codec.

TonyN.:'   
  '  
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Martin v. Löwis
Tony Nelson wrote:
>>For decoding it should be sufficient to use a unicode string of
>>length 256. u"\ufffd" could be used for "maps to undefined". Or the
>>string might be shorter and byte values greater than the length of
>>the string are treated as "maps to undefined" too.
> 
> 
> With Unicode using more than 64K codepoints now, it might be more forward
> looking to use a table of 256 32-bit values, with no need for tricky
> values.

You might be missing the point. \ufffd is REPLACEMENT CHARACTER,
which would indicate that the byte with that index is really unused
in that encoding.

> Encoding can be made fast using a simple hash table with external chaining.
> There are max 256 codepoints to encode, and they will normally be well
> distributed in their lower 8 bits.  Hash on the low 8 bits (just mask), and
> chain to an area with 256 entries.  Modest storage, normally short chains,
> therefore fast encoding.

This is what is currently done: a hash map with 256 keys. You are 
complaining about the performance of that algorithm. The issue of
external chaining is likely irrelevant: there likely are no collisions,
even though Python uses open addressing.

>>...I suggest instead just /caching/ the translation in C arrays stored
>>with the codec object.  The cache would be invalidated on any write to the
>>codec's mapping dictionary, and rebuilt the next time anything was
>>translated.  This would maintain the present semantics, work with current
>>codecs, and still provide the desired speed improvement.

That is not implementable. You cannot catch writes to the dictionary.

> Note that this caching is done by new code added to the existing C
> functions (which, if I have it right, are in unicodeobject.c).  No
> architectural changes are made; no existing codecs need to be changed;
> everything will just work

Please try to implement it. You will find that you cannot. I don't
see how regenerating/editing the codecs could be avoided.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Martin v. Löwis
Tony Nelson wrote:
 > But is there really no way to say this fast in pure Python?  The way a
 > one-to-one byte mapping can be done with "".translate()?

Well, .translate isn't exactly pure Python. One-to-one between bytes
and Unicode code points simply can't work. Just try all alternatives
yourself and see if you can get any better than charmap_decode.

Some would argue that charmap_decode *is* fast.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com