On Jun 25, 2010, at 5:02 PM, Guido van Rossum wrote:

> But you'd still have to validate it, right? You wouldn't want to go on
> using what you thought was wrapped UTF-8 if it wasn't actually valid
> UTF-8 (or you'd be worse off than in Python 2). So you're really just
> worried about space consumption.

So, yes, I am mainly worried about memory consumption, but don't underestimate 
the pure CPU cost of doing all the copying.  It's quite a bit faster to simply 
scan through a string than to scan and while you're scanning, keep faulting out 
the L2 cache while you're accessing some other area of memory to store the copy.

Plus, If I am decoding with the surrogateescape error handler (or its effective 
equivalent), then no, I don't need to validate it in advance; interpretation 
can be done lazily as necessary.  I realize that this is just GIGO, but I 
wouldn't be doing this on data that didn't have an explicitly declared or 
required encoding in the first place.

> I'd like to see a lot of hard memory profiling data before I got overly 
> worried about that.


I know of several Python applications that are already constrained by memory.  
I don't have a lot of hard memory profiling data, but in an environment where 
you're spawning as many processes as you can in order to consume _all_ the 
physically available RAM for string processing, it stands to reason that 
properly decoding everything and thereby exploding everything out into 4x as 
much data (or 2x, if you're lucky) would result in a commensurate decrease in 
throughput.

I don't think I could even reasonably _propose_ that such a project stop 
treating textual data as bytes, because there's no optimization strategy once 
that sort of architecture has been put into place. If your function says "this 
takes unicode", then you just have to bite the bullet and decode it, or rewrite 
it again to have a different requirement.

So, right now, I don't know where I'd get the data with to make the argument in 
the first place :).  If there were some abstraction in the core's treatment of 
strings, though, and I could decode things and note their encoding without 
immediately paying this cost (or alternately, paying the cost to see if it's so 
bad, but with the option of managing it or optimizing it separately).  This is 
why I'm asking for a way for me to implement my own string type, and not for a 
change of behavior or an optimization in the stdlib itself: I could be wrong, I 
don't have a particularly high level of certainty in my performance estimates, 
but I think that my concerns are realistic enough that I don't want to embark 
on a big re-architecture of text-handling only to have it become a performance 
nightmare that needs to be reverted.

As Robert Collins pointed out, they already have performance issues related to 
encoding in Bazaar.  I know they've done a lot of profiling in that area, so I 
hope eventually someone from that project will show up with some data to 
demonstrate it :).  And I've definitely heard many, many anecdotes (some of 
them in this thread) about people distorting their data structures in various 
ways to avoid paying decoding cost in the ASCII/latin1 case, whether it's 
*actually* a significant performance issue or not.  I would very much like to 
tell those people "Just call .decode(), and if it turns out to actually be a 
performance issue, you can always deal with it later, with a custom string 
type."  I'm confident that in *most* cases, it would not be.

Anyway, this may be a serious issue, but I increasingly feel like I'm veering 
into python-ideas territory, so perhaps I'll just have to burn this bridge when 
I come to it.  Hopefully after the moratorium.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to