[issue41622] Add support for emoji-data.txt and emoji-variation-sequences.txt to unicodedata

2020-09-01 Thread STINNER Victor


Change by STINNER Victor :


--
nosy:  -vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41622] Add support for emoji-data.txt and emoji-variation-sequences.txt to unicodedata

2020-08-29 Thread Terry J. Reedy

Terry J. Reedy  added the comment:

Base facts: The Unicode Character Database, UCD, is defined in Tech Report 44, 
https://www.unicode.org/reports/tr44/.  The latest files (now for 13.0) are at 
https://www.unicode.org/Public/UCD/latest/ and in particular, in the ucd 
subdirectory. ucd/UnicodeData.txt has a sequential list of current codepoints, 
including emoji codepoints.

Version 13 added subdirectly ucd/emoji with the 2 files listed above.  
emoji-variation-sequences.txt comprises 177 highly redundant pairs of lines 
like this:
0023 FE0E  ; text style;  # (1.1) NUMBER SIGN
0023 FE0F  ; emoji style; # (1.1) NUMBER SIGN
The only difference between the lines is 'FE0E; text' versus 'FE0F; emoji', 
'TEXT PRESENTATION SELECTOR' versus 'EMOJI PRESENTATION SELECTOR'.

tr51 does not explicitly say that every line is paired, but perusal suggests 
that this is true, making the file highly redundant.  The 177 characters 
include some non-emoji symbols, like #, and omits most emoji, including SNAKE, 
'\U0001f40d', '' (colored coiled snake). And yet, here, at least in Firefox, 
is the supposedly invalid text snake, '\U0001f40d\ufe0e': '︎' (a flat 
black-only, uncoiled wiggling snake head).  I don't know how '#\ufe0f' might be 
different from plain '#'.

Our UCD copy is accessed via 13 functions in the unicodedata module.  Support 
for the file could consist of a new function, such as 'emoji_text'.  The 
implementation could be 'chr in emoji_text_set', where the latter is the set of 
177 characters.  But given the accidental experiment above with an unauthorized 
sequence, I don't know how useful it would be.

--
nosy: +terry.reedy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41622] Add support for emoji-data.txt and emoji-variation-sequences.txt to unicodedata

2020-08-23 Thread jack1142


New submission from jack1142 :

`emoji-data.txt` and `emoji-variation-sequences.txt` files were formally pulled 
into the UCD as of Version 13.0 [1] so I think that unicodedata as a package 
providing access to UCD could support those as well.
In particular:
- `emoji-data.txt` lists character properties for emoji characters [2] 
- `emoji-variation-sequences.txt` lists valid text and emoji presentation 
sequences [3]

Data from `emoji-variation-sequences.txt` can be used to ensure consistent 
rendering of emoji characters across devices [4] (`StandardizedVariants.txt` 
has a similar purpose for non-emoji characters).
I'm not entirely sure of the use cases for `emoji-data.txt`, but because it's 
also newly added in UCD 13.0.0, I figured I at least shouldn't omit it when 
making this issue.


[1] https://www.unicode.org/reports/tr44/#Change_History - Changes in Unicode 
13.0.0, "Emoji Data" section
[2] https://www.unicode.org/reports/tr51/#Emoji_Properties_and_Data_Files
[3] https://www.unicode.org/reports/tr51/#Emoji_Variation_Sequences
[4] https://unicode.org/faq/vs.html#1

--
components: Unicode
messages: 375826
nosy: ezio.melotti, jack1142, vstinner
priority: normal
severity: normal
status: open
title: Add support for emoji-data.txt and emoji-variation-sequences.txt to 
unicodedata
type: enhancement
versions: Python 3.10, Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com