Re: How do I display unicode value stored in a string variable using ord()

Steven D'Aprano Sun, 19 Aug 2012 06:33:15 -0700

On Sun, 19 Aug 2012 01:04:25 -0700, Paul Rubin wrote:

> Steven D'Aprano <[email protected]> writes:


>> This standard data structure is called UCS-2 ... There's an extension
>> to UCS-2 called UTF-16
> 
> My own understanding is UCS-2 simply shouldn't be used any more. 

Pretty much. But UTF-16 with lax support for surrogates (that is, 
surrogates are included but treated as two characters) is essentially 
UCS-2 with the restriction against surrogates lifted. That's what Python 
currently does, and Javascript.

http://mathiasbynens.be/notes/javascript-encoding

The reality is that support for the Unicode supplementary planes is 
pretty poor. Even when applications support it, most fonts don't have 
glyphs for the characters. Anything which makes handling of Unicode 
supplementary characters better is a step forward.


>> * Variable-byte formats like UTF-8 and UTF-16 mean that basic string
>> operations are not O(1) but are O(N). That means they are slow, or
>> buggy, pick one.
> 
> This I don't see.  What are the basic string operations?

The ones I'm specifically referring to are indexing and copying 
substrings. There may be others.


> * Examine the first character, or first few characters ("few" = "usually
>   bounded by a small constant") such as to parse a token from an input
>   stream.  This is O(1) with either encoding.

That's actually O(K), for K = "a few", whatever "a few" means. But we 
know that anything is fast for small enough N (or K in this case).


> * Slice off the first N characters.  This is O(N) with either encoding
>   if it involves copying the chars.  I guess you could share references
>   into the same string, but if the slice reference persists while the
>   big reference is released, you end up not freeing the memory until
>   later than you really should.

As a first approximation, memory copying is assumed to be free, or at 
least constant time. That's not strictly true, but Big Oh analysis is 
looking at algorithmic complexity. It's not a substitute for actual 
benchmarks.


> Meanwhile, an example of the 393 approach failing: I was involved in a
> project that dealt with terabytes of OCR data of mostly English text.

I assume that this wasn't one giant multi-terrabyte string.

> So
> the chars were mostly ascii, but there would be occasional non-ascii
> chars including supplementary plane characters, either because of
> special symbols that were really in the text, or the typical OCR
> confusion emitting those symbols due to printing imprecision.  That's a
> natural for UTF-8 but the PEP-393 approach would bloat up the memory
> requirements by a factor of 4.

Not necessarily. Presumably you're scanning each page into a single 
string. Then only the pages containing a supplementary plane char will be 
bloated, which is likely to be rare. Especially since I don't expect your 
OCR application would recognise many non-BMP characters -- what does 
U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software 
doesn't recognise it, you can't get it in your output. (If you do, the 
OCR software has a nasty bug.)

Anyway, in my ignorant opinion the proper fix here is to tell the OCR 
software not to bother trying to recognise Imperial Aramaic, Domino 
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't 
expecting them in your source material. Not only will the scanning go 
faster, but you'll get fewer wrong characters.


[...]
> I realize the folks who designed and implemented PEP 393 are very smart
> cookies and considered stuff carefully, while I'm just an internet user
> posting an immediate impression of something I hadn't seen before (I
> still use Python 2.6), but I still have to ask: if the 393 approach
> makes sense, why don't other languages do it?

There has to be a first time for everything.


> Ropes of UTF-8 segments seems like the most obvious approach and I
> wonder if it was considered.

Ropes have been considered and rejected because while they are 
asymptotically fast, in common cases the added complexity actually makes 
them slower. Especially for immutable strings where you aren't inserting 
into the middle of a string.

http://mail.python.org/pipermail/python-dev/2000-February/002321.html

PyPy has revisited ropes and uses, or at least used, ropes as their 
native string data structure. But that's ropes of *bytes*, not UTF-8.
 
http://morepypy.blogspot.com.au/2007/11/ropes-branch-merged.html


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

Reply via email to