Re: metric for block coverage

2018-02-18 Thread Leonardo Boiko via Unicode
The most useful feature for me (Debian user, linguist) would be a search
system where I can provide a string, and filter fonts to those who include
glyphs for all characters; ideally if I could also combine it with other
search criteria, like OTF features (true small caps, etc.).  I often write
academic texts where I use specialized characters not really classifiable
by language, script or block (say, 'ǎ/ǚ' for pīnyīn, plus IPA tone marks,
plus multiple combining diacritics like 'ā́', all in the same running
text).  I then need visual inspection to choose a font that actually looks
halfway decent, typographically speaking, and to check for bugs in IPA
kerning, etc.  For a long time now, I've been using a simple Python script
to filter fonts in this manner (it just straightforwardly renders the
provided characters, then uses `pango.Layout.get_unknown_glyphs_count()` to
remove fonts lacking them, and displays all the rest for inspection).

2018-02-18 22:39 GMT+01:00 David Starner via Unicode :

> On Sun, Feb 18, 2018 at 3:42 AM Adam Borowski  wrote:
>
>> I probably used a bad example: scripts like Cyrillic (not even Supplement)
>> include both essential letters and those which are historic only or used
>> by
>> old folks in a language spoken by 1000, who use Russian (or English...)
>> for
>> all computer use anyway -- all within one block.
>>
>> What I'm thinking, is that a beautiful font that covers Russian,
>> Ukrainian,
>> Serbian, Kazakh, Mongolian cyr, etc., should be recommended to users
>> before
>> one whose only grace is including every single codepoint.
>>
>
> I'm not sure what your goal is. Opening up gucharmap shows me that
> FreeSerif and Noto Serif both have complete coverage of Cyrillic and
> Cyrillic Supplemental. We have reasonable fonts to offer users that cover
> everything Cyrillic, or pretty much any script in use. I'm not sure where
> and how you're trying to cut a line between a beautiful multilingual font
> and a workable full font.
>
> Ultimately, when I look at fonts, I look for Esperanto support. I'd be a
> little surprised if it didn't come with Polish support, but it's unlikely
> to be my problem. A useful feature for a font selector for me would be able
> to select English, German, and Esperanto and get just the fonts that
> support those languages (in an extended sense, including the extra-ASCII
> punctuation and accents English needs, for example.) It does me absolutely
> no good to know that it has "good, but not complete" Latin-A support.
> Likewise, if you're a Persian speaker, knowing that the Arabic block has
> "good, but not complete" support is worthless.
>
> For single language ancient scripts, like Ancient Greek, then virtually
> any font with decent coverage should cover the generally useful stuff. For
> more complex ancient scripts, it pretty much has to be on a per language
> matter. For some ancient scripts, like Runic and Old Italic, I understand
> that after unifying the various writings, most people feel a
> language-specific font is necessary for any serious work.
>
> The ultimate problem is that the question is will it support my needs.
> Language can often be used as a proxy, but names can often foil that. And
> symbols are worse; € is the only character from Currency Symbols that's
> used in an extended work in many, many instances, but so is ₪. Percentage
> of block support is minimally helpful. Miscellaneous symbols lives up to
> its name; ⛤, ⚇, ♷, ♕, and ☵ are all useful symbols, but not likely to be
> found in the same work. Again, recommend 100% coverage or do the manual
> work of separating them into groups and offering a specific font (game,
> occult, etc.) that covers it, but messing around with a beautiful font with
> less than 100% coverage versus a decent font with 100% coverage seems
> counterproductive.
>
> Not sure if I understand your advice right: you're recommending to ignore
>> all the complexity and going with just raw count of in-block coverage?
>> This could work: a released font probably has codepoints its author
>> considers important.
>>
>
> I guess separating out by language when you need to is going to be the way
> that helps people the most. Where that's most complex, I'm not sure why
> you're not just offering a decent 100% coverage font (which Debian has a
> decent selection of) and stepping back.
>


Text rendering of emojis (was: Re: First bonafide use (≠ mention) of emoji by an academic publisher?)

2017-07-24 Thread Leonardo Boiko via Unicode
Speaking of which—sorry if this is going off-topic, but I don't know where
else could I ask—I don't think there's a way to configure Linux or Android
systems to always prefer text rendering for emojis, is there? (I love text
emojis.)

2017-07-24 16:24 GMT+02:00 Christoph Päper via Unicode <unicode@unicode.org>
:

> Leonardo Boiko:
> >
> > It would just be more
> > satisfying for me if the blue books were encoded in the font as U+1F4D8s,
> > rather than U+F02Ds.  Or, if the colors are done at a CSS level, as  
> > U+1F4D5 CLOSED BOOKs or the like.  Same goes for the other icons in FA
> > which *do *have an emoji counterpart (which would be, I suspect, the
> > majority).
>
> This issue has been raised long ago with the developers of such symbol
> fonts:
>
> <https://github.com/FortAwesome/Font-Awesome/issues/222>
>
> The reason why this is not being done is the special treatment of emoji
> characters by vendors who always replace them by their custom images. Since
> such fonts are mostly used on the Web platform, the solution would be a CSS
> property to force `text` rendering of emojis:
>
> <https://github.com/w3c/csswg-drafts/issues/1144>
>
>


Re: First bonafide use (≠ mention) of emoji by an academic publisher?

2017-07-24 Thread Leonardo Boiko via Unicode
I don't have anything against that, in principle.  It would just be more
satisfying for me if the blue books were encoded in the font as U+1F4D8s,
rather than U+F02Ds.  Or, if the colors are done at a CSS level, as  
U+1F4D5 CLOSED BOOKs or the like.  Same goes for the other icons in FA
which *do *have an emoji counterpart (which would be, I suspect, the
majority).

The reasons I'd prefer such an encoding are, to be honest, purely æsthetic;
but they could also be argued on functional terms.  Consider Instagram's
fascinating results when applying word-vector models to emoji, for example (
https://engineering.instagram.com/emojineering-part-1-machine-learning-for-emoji-trendsmachine-learning-for-emoji-trends-7f5f9cb979ad
).  One never knows just *when *someone will want to interchange, convert,
or index characters; even emoji symbols can find valid, unexpected
applications.  Suppose a researcher in the future wants to investigate
early usage of academic emoji in the 21st century.  Or suppose something as
simple as trying to find out which emoji are used most frequently in a
field, country, or time period. Having the icon encoded as U+1F4D5 rather
than U+F02D would help this sort of interoperability, while causing no
problems for anyone (it's, after all, just a matter of choosing which
numbers you give to which icons; calling it #128213 is as easy as calling
it #61485).



2017-07-24 1:45 GMT+02:00 Doug Ewell via Unicode <unicode@unicode.org>:

> Leonardo Boiko wrote:
>
> To my boundless, heartbreaking disappointment, these emojis are not
>> U+1F4D8 BLUE BOOKs  from a custom @css font, but rather private-use
>> U+F02Ds, which index a book glyph in some icon pack called Font
>> Awesome <https://en.wikipedia.org/wiki/Font_Awesome>. At least they're
>> inserted via CSS :before-selectors, which means they'll be
>> automatically treated as decorations and seamlessly excluded from
>> copy-paste operations.
>>
>
> We use Font Awesome for my project at work, for symbols embedded in text
> which have no reason and no need to be interchanged, converted to other
> character sets, or indexed in search engines.
>
> Font Awesome also includes some symbols that, we think, won't ever be
> Unicode emoji, such as the Android, Apple, Bluetooth, and Windows logos.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>


Re: Curly Lips Code Point Proposal

2017-01-24 Thread Leonardo Boiko
Undoubtedly so.  That's why U+1F481 INFORMATION DESK PERSON  is listed
with the keyword "sassy" in the Unicode emoji table (besides "tipping
hand").  Which helps a lot, because the keywords are used by input methods
to search characters; if no one bothered to keep track of how people are
using emoji, then people would try looking for the "sassy" gesture and find
nothing, and they'd have to learn that it's called "information desk
person", even though no one uses it with this meaning.

Precisely because language (and symbolic systems like emoji) are in flux,
it's a good idea trying to document how it's used.


2017-01-25 2:35 GMT-02:00 Fritz Gheen <fgh...@gmail.com>:

> "There are indeed already many emoji misused here and there..."
>
> I'd venture to say most emoji are divorced from their original intent.
> Help Desk Lady is one of the most popular emoji...and I can't recall ever
> seeing someone use it for that reason.  I personally use Rocket emoji
> mostly to mean, "I'm taking-off from home."  And then there's aubergine =)
>
> I'd like to think no emoji is "misused."  People employ emoji outside of
> their original or intended meaning, and that's beautiful: language is
> fluid; it evolves.
>
>
>
>
>
> On Wed, Jan 25, 2017 at 12:39 AM, Andrea Giammarchi <
> andrea.giammar...@gmail.com> wrote:
>
>> I wouldn't stereotype "this community" already, as it's a single person
>> request and maybe a single person common use case.
>>
>> However, I have seen mostly on Twitter the usage of :3 to indicate
>> "engagement" in the sense of "interest", or "I'm digging it" but if there's
>> a meaning widely recognised already internationally, I guess there's no
>> point in using the proposed name, yet there's no code point to represent :3
>>
>> isn't it?
>>
>> Whatever it means, do we have a code point for it already?
>>
>> If we do, maybe that'd be already enough.
>>
>> There are indeed already many emoji misused here and there due different
>> visual meaning in different cultures (the triumph face, as example, the one
>> with steam from nose which is used as "furious face" in some culture)
>>
>> If there's no code point, being apparently this popular, should Unicode
>> consider one?
>>
>> Regards
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jan 24, 2017 at 5:12 PM, Leonardo Boiko <leobo...@namakajiri.net>
>> wrote:
>>
>>> I find it curious that this community defines the ":3" emoji as ""
>>> or "om nom nom".  In my circles it's quite the frequent emoticon/emoji, but
>>> I've never seen it used this way.  Instead, they usually employ it as "cat
>>> mouth" or "cat face", implying  the mood of cuteness, perkiness or
>>> mischievousness. (This is distinct from U+1F431 CAT FACE in that it
>>> represents a human making a cat-like mouth, not an actual cat.) Here are a
>>> few images found through a web search for "cat face":
>>>
>>>
>>>
>>> ​
>>> ​
>>>
>>>
>>>
>>> ​
>>> ​
>>> Here's the relevant TVTropes article:
>>> http://tvtropes.org/pmwiki/pmwiki.php/Main/CatSmile
>>>
>>> (TVTropes, incidentally, is one of the many web forums which has a :3
>>> textual emoji.)
>>>
>>> And the KnowYourMeme page:
>>> http://knowyourmeme.com/memes/3-cat-face
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2017-01-24 14:39 GMT-02:00 Andrea Giammarchi <
>>> andrea.giammar...@gmail.com>:
>>>
>>>> I'd like to bring to your attention a request, about a common emoticon,
>>>> that has apparently no equivalent yet in the Emoji standard.
>>>>
>>>> This was a PR to the Twemoji project:
>>>> https://github.com/twitter/twemoji/issues/199
>>>>
>>>> The author also created a proper PDF explaining all the reasons:
>>>> Proposal for CURLY LIPS Emoji.pdf
>>>> <https://github.com/twitter/twemoji/files/727077/Proposal.for.CURLY.LIPS.Emoji.pdf>
>>>>
>>>> I hope this can be considered in the near future as possible extra face.
>>>>
>>>> Thanks in advance for any sort of outcome.
>>>>
>>>> Best Regards
>>>>
>>>
>>>
>>
>


Re: Curly Lips Code Point Proposal

2017-01-24 Thread Leonardo Boiko
I find it curious that this community defines the ":3" emoji as "" or
"om nom nom".  In my circles it's quite the frequent emoticon/emoji, but
I've never seen it used this way.  Instead, they usually employ it as "cat
mouth" or "cat face", implying  the mood of cuteness, perkiness or
mischievousness. (This is distinct from U+1F431 CAT FACE in that it
represents a human making a cat-like mouth, not an actual cat.) Here are a
few images found through a web search for "cat face":



​
​



​
​
Here's the relevant TVTropes article:
http://tvtropes.org/pmwiki/pmwiki.php/Main/CatSmile

(TVTropes, incidentally, is one of the many web forums which has a :3
textual emoji.)

And the KnowYourMeme page:
http://knowyourmeme.com/memes/3-cat-face






2017-01-24 14:39 GMT-02:00 Andrea Giammarchi :

> I'd like to bring to your attention a request, about a common emoticon,
> that has apparently no equivalent yet in the Emoji standard.
>
> This was a PR to the Twemoji project:
> https://github.com/twitter/twemoji/issues/199
>
> The author also created a proper PDF explaining all the reasons:
> Proposal for CURLY LIPS Emoji.pdf
> 
>
> I hope this can be considered in the near future as possible extra face.
>
> Thanks in advance for any sort of outcome.
>
> Best Regards
>


Re: On the upcoming LATIN LETTER SMALL CAPITAL Q

2016-12-26 Thread Leonardo Boiko
2016-12-26 13:45 GMT-02:00 Yifán Wáng <747.neut...@gmail.com>:

> You may be under impression that the letter has something to do with
> morphology, but my argument is that the original "Letter for
> representation of morpheme in Japanese" is a misnomer and this letter
> is totally unrelated to morphological context.
>

I agree, and already said I agreed in my first email.  I know how Japanese
is represented in IPA and how /Q/ and /N/ are used.  My point is that I
don't think phonologists' small-caps Q has *more *justification to be in
Unicode than morphologists’ small-caps everything.


> For example, when you write ᴀᴅᴠ (all small
> capital), the letters still stand for ordinary A, D and V, for this is
> obviously the abbreviation of "adverb".  It's more like the whole
> sequence ADV made shrunken in "small caps" mode or style, which is a
> parallel operation to italicization or boldification.


Which is parallel to how bold and italics are used in mathematics, which
was the argument to get them into Unicode, as I've also pointed earlier.​


Re: On the upcoming LATIN LETTER SMALL CAPITAL Q

2016-12-26 Thread Leonardo Boiko
I meant that morphological glosses (such as the Leipzig standard) style
tags in small-caps. Like this:

yukkuri-ni yom-i-mas-i-ta
carefully-ADV read-CON-POL-CON-PRF

These are traditionally set in small-caps, not capitals. If the
phonologists are getting small-caps into plain text, why not the
morphologists? If the only argument for Q is that there is an  /ʀ/, why not
the full set, and then you can write any morphological tag? The chance of
confusing "CON" with a word is greater than that of /Q/ or [Q], if anything.

2016/12/26 3:28 "Yifán Wáng" <747.neut...@gmail.com>:

> Agreed with Yifán Wáng... But I wonder about the need for the character in
> the first place. Are we going to add a full small-caps set, too, given its
> use in morphological glosses? Isn't it enough to use a regular 'Q' in
> plain-text, and style to small caps in rich text?

No, it's not in "morphological glosses" but phonological notations
such as /yuQkuri/. In morphological discussions, phonological details
are usually ignored and they just write down the surface forms.

> I can see the rationale for mathematical bold, given that a regular-weight
> plain-text character would stand for a different thing in mathematical
> formulæ. But there's no way a capital Q would ever be confused as anything
> other than the phoneme, in a Japanese phonological transcription.

I don't think Q is, but it should be in unison with its fellows /ɴ/,
/ʀ/, /ʜ/ etc. Some books make all of them capitals, but others all
small capitals.
Making into small capitals avoids possible confusions with variables
like /C/ or /V/.

2016-12-26 5:03 GMT+09:00 Leonardo Boiko <leobo...@gmail.com>:
> Agreed with Yifán Wáng... But I wonder about the need for the character in
> the first place. Are we going to add a full small-caps set, too, given its
> use in morphological glosses? Isn't it enough to use a regular 'Q' in
> plain-text, and style to small caps in rich text?
>
> I can see the rationale for mathematical bold, given that a regular-weight
> plain-text character would stand for a different thing in mathematical
> formulæ. But there's no way a capital Q would ever be confused as anything
> other than the phoneme, in a Japanese phonological transcription.
>
> 2016/12/25 17:56 "Yifán Wáng" <747.neut...@gmail.com>:
>
> Please excuse my serial posting.
>
> I recently noticed the subhead given to the LATIN LETTER SMALL CAPITAL
> Q in the following document (at A7AF) is "Letter for representation of
> morpheme in Japanese".
> http://www.unicode.org/L2/L2016/16381-n4778r-pdam1-2-charts.pdf
>
> However, to my knowledge, the letter is required for describing a
> "phoneme" of Japanese that isn't tied to specific "morphemes" (~
> "words"). I have contacted the original writer of the proposal:
> http://www.unicode.org/L2/L2015/15241-small-cap-q.pdf
> and he agrees with me in this regard.
>
> Thus I suppose "Letter for Japanese phonology" would be more desired a
> heading for this character, though subheads are not normative. What
> are your thoughts?
>
>


Re: On the upcoming LATIN LETTER SMALL CAPITAL Q

2016-12-25 Thread Leonardo Boiko
Agreed with Yifán Wáng... But I wonder about the need for the character in
the first place. Are we going to add a full small-caps set, too, given its
use in morphological glosses? Isn't it enough to use a regular 'Q' in
plain-text, and style to small caps in rich text?

I can see the rationale for mathematical bold, given that a regular-weight
plain-text character would stand for a different thing in mathematical
formulæ. But there's no way a capital Q would ever be confused as anything
other than the phoneme, in a Japanese phonological transcription.

2016/12/25 17:56 "Yifán Wáng" <747.neut...@gmail.com>:

Please excuse my serial posting.

I recently noticed the subhead given to the LATIN LETTER SMALL CAPITAL
Q in the following document (at A7AF) is "Letter for representation of
morpheme in Japanese".
http://www.unicode.org/L2/L2016/16381-n4778r-pdam1-2-charts.pdf

However, to my knowledge, the letter is required for describing a
"phoneme" of Japanese that isn't tied to specific "morphemes" (~
"words"). I have contacted the original writer of the proposal:
http://www.unicode.org/L2/L2015/15241-small-cap-q.pdf
and he agrees with me in this regard.

Thus I suppose "Letter for Japanese phonology" would be more desired a
heading for this character, though subheads are not normative. What
are your thoughts?


Re: Manatee emoji?

2016-11-23 Thread Leonardo Boiko
I support the creation of manatee emoji, but only if it’s accompanied
by a new modifier for emoji size, coming in the varieties: TINY,
SMALL, LARGE, HUGE.

This would allow us to say "oh, the [HUGE MANATEE]" in emoji.

2016-11-23 13:15 GMT-02:00 James Kass :
> http://patch.com/florida/southtampa/petition-drive-aims-raise-manatee-awareness-adorable-way
>
> If enough people sign the petition, will Unicode add a manatee emoji?
> And, how about wolverines and lemmings?  Are any petitions underway
> for them?  How many signatures on a petition would be needed before
> Unicode would consider adding a non-existent character to the
> repertoire?
>
> Best regards,
>
> James Kass



Re: Emoji end goal

2016-10-12 Thread Leonardo Boiko
Yes, the end goal of the Unicode Consortium is media attention by way of
virtue signaling. For every online article about emoji modifiers, each
individual member of the Consortium earns a fifty-Euro bonus from our
masters, the global feminist cultural-Marxist Jewish conspiracy, for our
support in propagating political correctness and ultimately implementing
ONU's One World Government. In fact, the end goal for emoji (as originally
planned by Gramsci and Adorno in UAX #1922) is to be the mandatory
Newspeak-style writing system of the NWO, so as to brainwash citizens away
from scientific truths like race realism or the sociobiology of gender. As
soon as WOMAN+ ZWJ+President Hillary finish assassinating the last
remaining ASCII reactionaries, full emoji deployment will be in order, and
we'll indoctrinate every child to internalize standard Communist dogma such
as "all ethnicities deserve equal representation in media" or "all
combinations of genders and professions should be considered equally
valid". The lead experiments at Tumblr and Instagram were very successful,
proving that emoji have great potential as tools of indoctrination.

2016/10/12 10:02 "zelpa" :

> So what exactly is the end goal for emoji? First we had the fitzpatrick
> skin modifiers, now there's the proposal for gendered emoji sequences using
> ZWJ. There was even the proposal for the hair colour modifier in TR 53. So
> what is the true end goal? Will we one day be able to display our Fallout 4
> character with a single emoji and 60 modifiers? And honestly, who is asking
> for these additions? Does anybody WANT a hair colour modifier? Seems to me
> like the consortium might just be pandering to a few silly requests (by
> people who have no actual idea what unicode is) to get media attention.
>


Re: Noto unified font

2016-10-08 Thread Leonardo Boiko
That's not "his" definition of non-free.  Restrictions on selling copies
commercially violate the Free Software Foundation's definition of non-free:
https://www.gnu.org/philosophy/free-sw.html
https://www.gnu.org/licenses/license-list.html#NonFreeSoftwareLicenses

And also the Open Source Initiative's definition of non-free:
https://opensource.org/osd-annotated
 https://opensource.org/faq#commercial

And also the Debian project's definition of non-free:
https://www.debian.org/social_contract#guidelines

In short, every single major free software organization requires free
software to allow the user complete freedom of redistribution, commercial
or otherwise.  Otherwise the software isn't free in the sense of giving the
user freedom; it is merely free of charge.


2016-10-08 21:16 GMT-03:00 Shriramana Sharma :

> That's your definition of non-free then... If I were a font developer and
> of mind to release my font for use without charge, I wouldn't want anyone
> else to make money out of selling it when I myself - who put the effort
> into preparing it - don't make money from selling it. So it protects the
> moral rights of the developer.
>


Re: What happened to Unicode CLDR's site?

2016-10-04 Thread Leonardo Boiko
The Google error message felt a bit too harsh for a webhosting client who
merely exceeded their allotted bandwidth.  It made it sound like the
website was hosting something illegal.

2016-10-04 13:00 GMT-03:00 Philippe Verdy :

> It looks that an automated bot run by Google detected an excessive use of
> bandwidth and launch the block, waiting for another subcription or payment,
> even if the site was (possibly) donated by Google itself. That bit probably
> does not know what it does and acts like any other hosted site. (Google's
> own usage policy is probably more enforced now: you can host free websites
> but above some threshold it will be blocked).
>
> Note also that this is the webhosting which is blocked, not the domain
> name (hosted by Apple who probably offered it to the Consortium).
>
> There's probably been a lack of communication somewhere in Google, or an
> administrator error that removed an exception for a site that should have
> first been handled specially internally by a human hierarchy.
>
> If the usage limit was exhausted, may be this is because the site was
> harvested by some malwares and I think it's reasonnable to block it first
> before scanning, cleaning, restoring damaged parts from a safe backup, and
> investigating about which protection measures were missing or should be
> taken).
>
> There's certainly people looking for what happend precisely. I hope this
> is just an administrative measure that can be easily reversed and that no
> damage happend to CLDR data (and to private data there about CLDR surveyors
> or user authentication databases). I don't think there's damage on the
> released CLDR data, but there could be losses in some recent ongoing works.
>
> 2016-10-04 15:53 GMT+02:00 Steven R. Loomis :
>
>> Yes, the web content is hosted by google sites, a web hosting provider.
>>
>> As to it being down, i understand this is being looked into.
>>
>> Enviado desde nuestro iPhone.
>>
>> El oct. 4, 2016, a las 5:51 AM, Cristian Secară 
>> escribió:
>>
>> În data de Tue, 4 Oct 2016 19:50:05 +0800, gfb hjjhjh a scris:
>>
>> Why is the site suspended by Google and how to access it now?
>>
>>
>> Just curious: Unicode = Google ? (physically)
>>
>> I am asking this because by entering directly http://cldr.unicode.org
>> the error result belongs to Google and not to unicode.org.
>>
>> ?
>>
>> Cristi
>>
>> --
>> Cristian Secară
>> http://www.secărică.ro 
>>
>>
>


Re: Why incomplete subscript/superscript alphabet ?

2016-10-03 Thread Leonardo Boiko
2016-10-03 14:51 GMT-03:00 Jukka K. Korpela :

> They are not control or formatting characters. They are markup used at
> higher protocol levels – in different markup systems
>
>
That's exactly the point, yes.


Re: Why incomplete subscript/superscript alphabet ?

2016-10-03 Thread Leonardo Boiko
Besides, there are already control/formatting characters for such purposes
– several ones, even.  They look like this: , ^{}, \textsuperscript{},
\*{ \*} …

What's more, these powerful control/formatting characters allow one to
apply not only super/subscript and blackletter, but many more features to
any character as long as the font supports them, including bold, italics,
small-caps, optical size changes and countless others.  I heartily
recommend using these special control/formatting characters, as they can
considerably *enrich *any text.

2016-10-03 14:14 GMT-03:00 Doug Ewell :

> a.lukyanov wrote:
>
> > I think that the right thing to do would be to create several new
> > control/formatting characters, like this:
> >
> > "previous character is superscript"
> > "previous character is subscript"
> > "previous character is small caps (for use in phonetic transcription
> > only)"
> > "previous character is mathematical blackletter"
> > etc
> >
> > Then people will be able to apply this features on any character as
> > long as their font supports it.
>
> I happen to think this would be exactly the wrong thing to do,
> completely contrary to the principles of plain text that Unicode was
> founded upon. But you never know what might gain traction, so stay
> tuned.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>


Re: Why incomplete subscript/superscript alphabet ?

2016-09-30 Thread Leonardo Boiko
The Unicode codepoints are not intended as a place to store typographically
variant glyphs (much like the Unicode "italic" characters aren't designed
as a way of encoding italic faces). The correct thing here is that the
markup and the font-rendering systems *should* automatically work together
to choose the proper face—as they already do with italics or optical sizes,
and as they should do with true small-caps etc.

I agree that our current systems are typographically atrocious and an
abomination before the God of good taste, and I don't blame anyone for
resorting to Unicode tricks to work around that. But that's a crummy
stopgap at best, and legitimizing it would be counterproductive in the long
run—not to mention ethnocentric (unless you want Unicode sub- and
superscript codepoints for every single existing character ever, including
the full Han set).

Rather, let's bug the authors of font rendering systems, user interface
libraries, text editors, web browsers etc. for halfway decent typography.

2016/09/30 12:56 "Jukka K. Korpela" :

> 30.9.2016, 18:19, Philippe Verdy wrote:
>
> Note also that many tools generating documentation from source code
>> allow you to insert HTML comments, so you could as well use ,
>>
>
> Yes, but there’s a serious typographic pitfall with this, as well as with
> using e.g. subscript or superscript formatting in a word processor. The
> problem is that the rendering is almost always simplistic: letters (or
> other characters) of the current font are used in reduced size and in
> lowered or raised position. The result is that the glyphs have reduced
> stroke width too, and the position change very often causes line spacing to
> be uneven.
>
> The typographically correct implementation of such formatting or markup
> would use subscript or superscript glyphs from the font, designed by the
> font creator to match the style of the font. This is more difficult than
> the simplistic approach, and of course it is possible only when using a
> font that contains such glyphs.
>
> Using HTML, for example, the way to achieve that at present would be to
> use markup like ... (to avoid the problems caused
> by the default formatting of  and ) and to use a CSS style sheet
> that sets font-family suitably and uses OpenType font feature settings to
> select subscript or superscript glyphs. In practice, you would need to use
> @font-face to embed a suitable OpenType font. So it’s doable, but not
> trivial like just slapping  and  around some text.
>
> A practical conclusion is that if you need only e.g. 2 and 3 as
> superscripts (a rather general situation in general texts, where you just
> need m² or m³), it is much simpler to use the relevant Unicode superscript
> characters (instead of e.g. m2). This means using
> typographer-designer superscript glyphs in a simple and reliable way.
>
> Yucca
>
>
>


Emoji semantic drift

2016-09-02 Thread Leonardo Boiko
This isn't news, but I find it interesting how some emoji are being used in
ways that differ from their Unicode names, reflecting alternative
interpretations of common glyphs. I'll compare data from the Unicode chart
with interpretations taken from Emojipedia, which I think do reflect
real-world usage:

U+1F617 KISSING FACE 
Current keywords: face|kiss
→ whistle (= nonchalance ; happiness)
http://emojipedia.org/kissing-face/

U+1F481 INFORMATION DESK PERSON 
≊ person tipping hand
Keywords: hand | help | information | sassy | tipping
→ sassy ; hair flick
http://emojipedia.org/information-desk-person/

U+1F601 GRINNING FACE WITH SMILING EYES 
Keywords: face | grin
→ grimace (discomfort, pain)
http://emojipedia.org/grinning-face-with-smiling-eyes/

U+1F624 FACE WITH LOOK OF TRIUMPH 
≊ face with steam from nose
Keywords: face | triumph| won
→ angry; frustration; contemptuous
http://emojipedia.org/face-with-look-of-triumph/

I see that *some* of those alternative readings are registered in the
Unicode table as ≊ , while others are present in keywords, and still others
are absent.  Is there any criteria for that? Is someone trying to keep
track of emoji in use?

I think distributional methods are promising, as shown by Thomas Dimson:
http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji

By this we find that, for example, U+1F46F WOMAN WITH BUNNY EARS, also
marked as *≊ people partying*, has additional connotations of
sisterhood—specifically female friendship and loyalty ( #sistasista,
#sistersforlife, #sistersister, #bestiesforlife, yearsoffriendship,
#sisterfromanothermister, #morelikesisters, #bffl, #bestiesfortheresties,
#bestfriendsforever ). U+1F647 PERSON BOWING DEEPLY  is seeing use as a
marker of worry or shame ("late night thoughts", "deleting later", "in my
feelings", "laughing but very serious"); probably due to the emotion lines
drawn on most fontsets.



Re: I'm excited about the proposal to add a brontosaurus emoji codepoint

2016-08-29 Thread Leonardo Boiko
We obviously need an emoji for every species name listed within The
Official Registry of Zoological Nomenclature.

I propose a new set of Basic Latin characters, the Zoological Nomenclature
Indicator Symbols, to be used for spelling scientific names, which are then
rendered as cutesy colorful icons used as mood indicators.  A Zoological
Nomenclature Indicator Symbol Space must be included to separate name
components; sequences including one such separator are assumed to be
binomens, and two, trinomens.  For example, a cat emoji can be encoded with
the Zoological Nomenclature Indicator Symbols corresponding to
[FELIS␣CATUS] or, following modern practice, [FELIS␣SILVESTRIS␣CATUS]
(biological homonyms are to be treated as alternative encodings of the same
abstract emoji).

Notice that the current emoji set include such characters as CRYING CAT
FACE (U+1F63F)) and KISSING CAT FACE WITH CLOSED EYES (U+1F63D), in
addition to the default human (or, in a certain vendor, disgusting yellow
amoebæ) faces; but no such equivalents for, say, dogs or bunnies, which can
be a very dangerous political slight towards dog-people and bunny-people.
With some adjustment, Zoological Nomenclature Indicator Symbols can solve
the issue once for all, with perfect neutrality.  All of the current face
expression emoji are to be decomposed as FACE plus abstract combining
characters; for example, U+1F642 SLIGHTLY SMILING FACE will be considered a
compatibility variant of FACE + COMBINING SMILE + COMBINING SLIGHT FACIAL
EXPRESSION.  This would allow a dog version of U+1F63D encoded as:
[CANIS␣LUPUS␣FAMILIARIS] + COMBINING FACE + COMBINING KISSING FACIAL
EXPRESSION + COMBINING CLOSED EYES, and similarly for any species and
expression combination, like, say, a ring-tailed lemur rolling on the floor
laughing, or an okapi with tears of joy.  (Drawing all possible glyphs is
of course not Unicode's problem.)


2016-08-29 16:22 GMT-03:00 Leo Broukhis :

> It's new. Let's not tell Randall about the "completing the set" argument.
>
> Leo
>
> On Mon, Aug 29, 2016 at 12:08 PM, Karl Williamson  > wrote:
>
>> "I'm excited about the proposal to add a brontosaurus emoji codepoint
>> because it has the potential to bring together a half-dozen different
>> groups of pedantic people together"
>>
>> From http://xkcd.com/1726/
>>
>> I don't know if this is new, or I just never saw it before.
>>
>>
>


Re: Whitespace characters in Unicode

2016-08-04 Thread Leonardo Boiko
I'm sorry; I thought that, when you wanted to separate identifiers, it
might be interesting to follow existing regexps definitions; this way your
syntax would play along with already-existing tools (e.g. you'd be making
it easy for someone to pipe your language into grep -P "\p{Whitespace}").

But I was talking out of my depth; I've never worked with defining Unicode
identifiers, so I'm not really qualified to answer.  I'm sure Davis and the
others can give better answers to your questions.  Meanwhile, I see that
UAX #31 goes further into Unicode identifiers. It says that
Pattern_White_Space is stable (unlike Whitespace, perhaps?), and intended
for use in regexp-like "patterns" which mix literal characters, whitespace,
and syntax (special characters), where the latter two would e.g. require
quoting.  For example, Perl has a "/x" flag which makes unquoted
Pattern_White_Space characters be ignored in regexpes (so that you can make
then less illegible).

However, UAX #31 it also gives a Default Identifier Syntax, which bounds
identifiers not by Whitespace but by their start characters, identified by
ID_Start, defined like this:

> ID_Start characters are derived from the Unicode General_Category of
uppercase letters, lowercase letters, titlecase letters, modifier letters,
other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax
and Pattern_White_Space code points.

So it makes reference only to Pattern_White_Space and not Whitespace.  On
the other hand, I guess the listing above will exclude Whitespace
characters, since they don't count as any of letters, numbers, or
Other_ID_Start?

None of that is guaranteed to be stable, though.  UAX #31 includes a
separate definition for "Immutable identifiers", which are, and suggests
various compromises between them.


2016-08-04 17:44 GMT-03:00 Sean Leonard <lists+unic...@seantek.com>:

> I read through TR18...it mainly says that  == \s == \p{Whitespace}
> == property White_Space is true. Does it say anything else or more
> significant than that, that I'm missing?
>
> Sean
>
>
> On 8/4/2016 1:17 PM, Leonardo Boiko wrote:
>
> What Mark Davis said; also, depending on what you need, consider taking a
> look at the definitions used by Unicode regexpes, at
> http://unicode.org/reports/tr18/ .
>
> 2016-08-04 16:37 GMT-03:00 Sean Leonard <lists+unic...@seantek.com>:
>
>> Hi Unicode Folks:
>>
>> I am trying to come up with a sensible sets of characters that are
>> considered whitespace or newlines in Unicode, and to understand the
>> relative stability policy with respect to them. (This is for a formal
>> syntax where the definition of "whitespace" matters, e.g., to separate
>> identifiers, and I want to be as conservative as possible.) Please let me
>> know if the stuff below is correct, or needs work.
>>
>> The following characters / sequences are considered line breaking
>> characters, per UAX #14 and Section 5.8 of UNICODE:
>>
>> CRLF CR LF FF VT NEL LS PS
>>
>> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the combination
>> U+000D U+000A (treated as one line break). These characters / sequences are
>> called "newlines".
>>
>> There will not be any additional code points that are assigned to be line
>> breaks. (Correct?)
>>
>> CRLF, CR, LF, and NEL are also considered "newline functions" or NLF.
>> These are distinguished from other codes (above) that also mean line
>> breaks, mainly because of historical and widespread use of them.
>>
>> There are several formatting characters that affect word wrapping and
>> line breaking, as discussed in those documents...but they are not line
>> breaking characters.
>>
>> 
>>
>> The following characters are whitespaces: characters (code points) with
>> the property WSpace=Y (or White_Space). This is:
>>
>> newlines
>> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000
>>
>> Assigned characters that are not listed above, can never be whitespace
>> (according to Unicode). However, the set is not closed, so unassigned code
>> points *could* be assigned to whitespace. It is (unlikely? very unlikely?
>> Pretty much never going to happen?) that additional code points will be
>> assigned to whitespace.
>>
>> 
>>
>> There are some other characters that Unicode does not consider
>> whitespace, but deserve discussion:
>> U+180E MONGOLIAN VOWEL SEPARATOR: <https://codeblog.jonskeet.uk/
>> 2014/12/01/when-is-an-identifier-not-an-identifier-attack-
>> of-the-mongolian-vowel-separator/>
>> <https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-

Re: Whitespace characters in Unicode

2016-08-04 Thread Leonardo Boiko
What Mark Davis said; also, depending on what you need, consider taking a
look at the definitions used by Unicode regexpes, at
http://unicode.org/reports/tr18/ .

2016-08-04 16:37 GMT-03:00 Sean Leonard :

> Hi Unicode Folks:
>
> I am trying to come up with a sensible sets of characters that are
> considered whitespace or newlines in Unicode, and to understand the
> relative stability policy with respect to them. (This is for a formal
> syntax where the definition of "whitespace" matters, e.g., to separate
> identifiers, and I want to be as conservative as possible.) Please let me
> know if the stuff below is correct, or needs work.
>
> The following characters / sequences are considered line breaking
> characters, per UAX #14 and Section 5.8 of UNICODE:
>
> CRLF CR LF FF VT NEL LS PS
>
> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the combination
> U+000D U+000A (treated as one line break). These characters / sequences are
> called "newlines".
>
> There will not be any additional code points that are assigned to be line
> breaks. (Correct?)
>
> CRLF, CR, LF, and NEL are also considered "newline functions" or NLF.
> These are distinguished from other codes (above) that also mean line
> breaks, mainly because of historical and widespread use of them.
>
> There are several formatting characters that affect word wrapping and line
> breaking, as discussed in those documents...but they are not line breaking
> characters.
>
> 
>
> The following characters are whitespaces: characters (code points) with
> the property WSpace=Y (or White_Space). This is:
>
> newlines
> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000
>
> Assigned characters that are not listed above, can never be whitespace
> (according to Unicode). However, the set is not closed, so unassigned code
> points *could* be assigned to whitespace. It is (unlikely? very unlikely?
> Pretty much never going to happen?) that additional code points will be
> assigned to whitespace.
>
> 
>
> There are some other characters that Unicode does not consider whitespace,
> but deserve discussion:
> U+180E MONGOLIAN VOWEL SEPARATOR:  2014/12/01/when-is-an-identifier-not-an-identifier-
> attack-of-the-mongolian-vowel-separator/>
> 
> U+200B ZERO WIDTH SPACE
> U+200C ZERO WIDTH NON-JOINER
> U+200D ZERO WIDTH JOINER
> U+200E LEFT-TO-RIGHT MARK*
> U+200F RIGHT-TO-LEFT MARK*
> U+2060 WORD JOINER
> U+FEFF ZERO WIDTH NON-BREAKING SPACE
>
> *These appear in Pattern_White_Space, but Pattern_White_Space excludes
> U+2000-200A characters, which are obviously spaces. This is confusing and I
> would appreciate clarification *why* Pattern_White_Space is significantly
> disjoint from White_Space.
>
> 
> The borderline characters above are not considered WSpace=Y, but sometimes
> might have space-like properties. ZWP and ZWNBP are obviously "space"
> characters, but they never generate whitespace. I suppose that conversely
> LTRM and RTLM are obviously "not space" characters, but they could generate
> whitespace under certain circumstances. Ditto for other formatting
> characters in general (for which the class is much larger).
>
> Therefore I guess a Unicode definition of "whitespace" (or "space
> characters") is: an assigned code point that *always* (is supposed to)
> generates white space (empty space between graphemes).
>
> 
>
> Are there other standards that Unicode people recommend, that have
> addressed whether certain borderline characters are considered whitespace
> vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax
> component)?
>
> Regards,
>
> Sean
>


Re: Implementation of ideographic description characters

2016-08-04 Thread Leonardo Boiko
Hi,

the IDS provide too little information for rendering kanji properly. Take a
look into
https://en.m.wikipedia.org/wiki/Chinese_character_description_languages .

Hello,
As I read that it is possible for an implementation of Unicode that can
render those ideographic description characters into rendering the kanji it
describe, but is there any known/existing system or font or implementation
that would do exactly this?


Re: Re: Adding half-star to Unicode?

2016-06-24 Thread Leonardo Boiko
> My bet is that they'll prefer using whatever code they want, hacking
fonts as necessary to overtake another political symbol when they'll want.


They could liberate a code point from the private use area.


2016-06-24 14:10 GMT-03:00 Philippe Verdy :

> My bet is that they'll prefer using whatever code they want, hacking fonts
> as necessary to overtake another political symbol when they'll want. They
> could do that easily with Webfonts today (by designing a tiny webfont with
> just one glyph mapped to any code point, including some ASCII symbol such
> as the DOLLAR sign). They would even refuse any normalization and would not
> even use the codepoint proposed for them, or by remapping some ASCII-art
> string (the classic emoticons of Usenet; if we even attempt to define
> standard colors, or glyph design, they'll invent another incompatible
> design, will change colors, will rotate it, will change it into an
> exploding star...). However the historic anarchists symbol that was seen on
> walls and painted banners in Europe in the 19th and early 20th century was
> only black.
>
> And it was not really a star, but derived from the A letter in a circle,
> with the horizontal bar frequently replaced by some fire arm, or slnated
> and looking more like a thin arrow head slightly pointing upward (Various
> decorations could be added on top: a striker throwing a mollotov... or
> flowers; a plus sign; a "V" on top to mean "victory"). The strokes were
> most often very irregular, as if they were brushed very rapidly on a wall.
> More polished forms have been used where it is a standard A in an circle
> open at the bottom and a small curved leg. Not all of them want flags with
> colors. Other groups just use a red-filled standard 5-pointed star, over a
> plain black  background.
>
> In London still today, there's most often no star, just a red and black
> flag (color cut on the diagonal). The red side or black side may be
> attached on the hanging stem, but generally a black side is below the right
> side. The red color varies also (green, dark purple, pink, orange,
> white...) but the black color is seems to be always there (even if it's
> just the classic circle A, that black may be used to fill the glyph, or the
> background. There's no dedicated support, the symbols may be used
> everywhere, integrated in all sort of graphics, made with various materials.
>
> The flag may be raised in all positions. In Australia, this is a vertical
> rainbow over a black area.
>
> Other symbols of anarchism include a closed hand (fist) raised upward (in
> a sign of protest) with a venom snake. The anarchist movements have always
> been inventive and protecting against all sort of political regimes,
> democartic or not, in fact they protest against all forms of state
> government, and their official symbols.
>
> 2016-06-24 17:55 GMT+02:00 Garth Wallace :
>
>> But would anarchists even want their symbol to be encoded?
>>
>> On Fri, Jun 24, 2016 at 7:04 AM, "Jörg Knappen"  wrote:
>>
>>> Talking about fancy five stars, besides the vertically split ones there
>>> is the "Anarchist star" (a symbol for anarcho-syndicalism)
>>> with a diagonal split in a upper left red half and a lower left black
>>> half. Since there are political and ideological symbols encoded
>>> in UNicode, maybe this one is worth encoding as well (probably twice,
>>> once as a black and white plain symbol and once as a colourful Emoji).
>>>
>>> See here:
>>> https://commons.wikimedia.org/wiki/Category:Anarcho-Syndicalism#/media/File:Anarchist_star.svg
>>>
>>> FIVE PIONTED STAR WITH BLACK LOWER RIGHT HALF = anarchist star
>>> ANARCHIST STAR EMOJI
>>>
>>> --Jörg Knappen
>>>
>>> *Gesendet:* Freitag, 24. Juni 2016 um 14:12 Uhr
>>> *Von:* "Frédéric Grosshans" 
>>> *An:* unicode@unicode.org
>>> *Betreff:* Re: Adding half-star to Unicode?
>>> Le 24/06/2016 00:37, Leo Broukhis a écrit :
>>> > For a previous discussion on the topic, please see
>>> > the thread "Missing geometric shapes" around 11/12/12
>>> The thread starts here :
>>> http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0008.html
>>>
>>> It contains an example of half-filled star used in RTL (Hebrew) context,
>>> in an advertisement in Haaretz here
>>> http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0024.html
>>>
>>>
>>>
>>
>>
>


Re: non-breaking snakes

2016-05-04 Thread Leonardo Boiko
2016-05-04 4:14 GMT-03:00 Shriramana Sharma :
> Isn't there some Japanese orthography feature that already does
> something like this?

Japanese (and Chinese) vertical calligraphy can do arbitrary-length
stretching of lines (like the Arabic kashida under discussion, and
like most cursive scripts in the world, I guess). Notice e.g. the long
lines here: https://www.instagram.com/seiichirou_uemura/ . The
hiragana letter し、 in particular, often becomes a long vertical line.

However, traditionally this is used for æsthetic rhythm, not for
justification.  In fact, most kinds of Japanese calligraphy prize
variation in line length, not uniformity. And when uniformity is
sought (e.g. certain sutras), they don't use stretched lines, but just
fill a grid with non-cursive, block (kaisho) characters.

I'm not aware of similar features for typography. Because the script
doesn't separate words, justification is comparatively simple–you just
break lines mid-word, mostly wherever (with a few restrictions to
avoid hanging punctuation and so on.)



Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Leonardo Boiko
Yeah, I've stumbled upon this a lot in academic Japanese/Chinese
texts.  I try to copy some Chinese character, only to find out that
it's really a string of random ASCII characters.

Is there only one of those crap PDF pseudo-encodings? If so, I'll use
a conversor next time...

2016-03-17 14:57 GMT-03:00 "Jörg Knappen" :
> I inspected the pdf file, and its font encoding is termed "Identity-H". I
> couldn't reveal much about this encoding, but it seems to be a private
> encoding of Adobe used especially for Asian fonts.
>
> --Jörg Knappen
>
> Gesendet: Donnerstag, 17. März 2016 um 17:43 Uhr
> Von: "Don Osborn" 
> An: unicode@unicode.org
> Betreff: Joined "ti" coded as "Ɵ" in PDF
> Odd result when copy/pasting text from a PDF: For some reason "ti" in
> the (English) text of the document at
> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf
> is coded as "Ɵ". Looking more closely at the original text, it does
> appear that the glyph is a "ti" ligature (which afaik is not coded as
> such in Unicode).
>
> Out of curiosity, did a web search on "internaƟonal" and got over 11k
> hits, apparently all PDFs.
>
> Anyone have any idea what's going on? Am assuming this is not a
> deliberate choice by diverse people creating PDFs and wanting "ti"
> ligatures for stylistic reasons. Note the document linked above is
> current, so this is not (just) an issue with older documents.
>
> Don Osborn



Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Leonardo Boiko
The PDF *displays* correctly.  But try copying the string 'ti' from
the text another application outside of your PDF viewer, and you'll
see that the thing that *displays* as 'ti' is *coded* as Ɵ, as Don
Osborn said.


2016-03-17 14:26 GMT-03:00 Pierpaolo Bernardi :
> That document displays correctly for me using both the pdf viewer
> built into chrome and the standalone Acrobat reader v.11.  The problem
> could be in your PDF viewer?  What are you viewing the document with?
>
> On Thu, Mar 17, 2016 at 5:43 PM, Don Osborn  wrote:
>> Odd result when copy/pasting text from a PDF: For some reason "ti" in the
>> (English) text of the document at
>> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf
>> is coded as "Ɵ". Looking more closely at the original text, it does appear
>> that the glyph is a "ti" ligature (which afaik is not coded as such in
>> Unicode).
>>
>> Out of curiosity, did a web search on "internaƟonal" and got over 11k hits,
>> apparently all PDFs.
>>
>> Anyone have any idea what's going on? Am assuming this is not a deliberate
>> choice by diverse people creating PDFs and wanting "ti" ligatures for
>> stylistic reasons. Note the document linked above is current, so this is not
>> (just) an issue with older documents.
>>
>> Don Osborn
>



Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-10 Thread Leonardo Boiko
Isn't it better to use some sort of COMBINING ENCLOSING CIRCLE?
2016/03/10 8:30 "Andrew West" :

> On 10 March 2016 at 07:00, Martin J. Dürst  wrote:
> >
> > because these numbers can go up to the 200s, it doesn't make sense to
> > register them all as characters (one would need over 500!).
>
> I don't get why that would make no sense.  We already have CIRCLED
> NUMBER 1 through 50, and NEGATIVE CIRCLED NUMBER 1 through 20, and
> these characters are widely used (in East Asian contexts, at least)
> for representing note numbers in text.  In my opinion it would be
> eminently sensible to extend both series up to 999, which would cover
> the needs of Go notation and as well as note numbering for the vast
> majority of users.
>
> Andrew
>
>


Re: Girl, 12, charged for threatening her school with emojis

2016-03-01 Thread Leonardo Boiko
Ah but that is a "majority" by a dictionary/type count.  Due to Zipf's Law,
in language matters we should always distinguish dictionary counts from
actual usage.  E.g. Twitter is very popular in Japan, and I think we'll all
agree that the top used kanji are predominantly modal:
http://emojitracker.com/

Thomas Dimson's great distributional analysis for Instagram gives us
hashtags that are equivalent to emoji; again, I think it's clear that their
primary use is for modality.
http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji
.

What's more, a lot of emoji which seem to have no "clear emotional
referent" is appropriated for modal purposes.  For example, this thread's
   are graphical depictions of objects, but I think you'll all agree
that the girl was expressing a mood; she wasn't saying "gun, knife, bomb".
I'm told that U+1F481, INFORMATION DESK PERSON , was taken to be  "sassy
girl" or "hair flick", and from that it became a modality indicator for
sassiness, sarcasm, fabulousness etc.

(I suspect that another major use of emoji, besides modality, is deictic:
"I'm at Tokyo Tower" + Tokyo Tower emoji, "Merry Christmas" +
Christmas-related emoji.  Emotional mood still seems to be to be clearly
the dominant use.)

2016-02-29 21:25 GMT-03:00 Garth Wallace :

> Some are used to express emotions but many are not: food items,
> animals, landmarks, activities, etc. I think the majority do not have
> clear emotional referents. The original set introduced in Unicode 6.0
> included things like ROASTED SWEET POTATO and TOKYO TOWER.
>
> On Mon, Feb 29, 2016 at 4:04 PM, Philippe Verdy 
> wrote:
> > Today's Japanese emojis are (for most of them) recent inventions; may be
> > there are some earlier tracks in Japanese comics, but you may as well
> find
> > them in comics of America or Europe since the about the 1940's.
> >
> > All these icons were *later* renamed emojis in English and Unicode, but
> > there's a long history of using icons for such emotions Look at the
> little
> > heart drawn near the signature on an handwritten letter or discrete
> > messages, or similar symbols carved by lovers on walls and trees. Or long
> > before as a sign of recognition such as the fish for the first
> Christians in
> > the Roman Empire, or even before in some hieroglyphic inscriptions in
> antic
> > Egyptian, Mayan, and Chinese civilizations since Bronze Age or before.
> >
> > In fact you could also add all the symbols (not necessarily with
> religious
> > meaning) found on graves for expressing that the remaining family of
> friend
> > is missing the defunct.
> > You could also add the similar symbols on jewelry for showing we love
> > someone, or warrior paintings on faces.
> >
> > The modern Japanese Emojis were not the first pictograpic signs to
> express
> > emotions (even if now they have been extended to many other things and
> they
> > are now widespreading the rest of the world with these extensions). Still
> > their main usage remains for emotions ; starting in the 1970's these were
> > ASCII art symbols such as the famous :-)
> >
> >
> >
> > 2016-02-29 23:24 GMT+01:00 Asmus Freytag (t) :
> >>
> >> On 2/29/2016 1:55 PM, Philippe Verdy wrote:
> >>
> >> . Well emojis were initially designed to track amotions and form a sort
> of
> >> new language,
> >>
> >>
> >> E-moji means "picture-character" in Japanese, has nothing to do (at
> first)
> >> with emotions.
> >>
> >> A./
> >
> >
>


Re: Girl, 12, charged for threatening her school with emojis

2016-02-29 Thread Leonardo Boiko
It's a picture-character, sure; but I'd think that, like kaomoji before
them, they've been used since the beginning to express the attitude of the
writer, a kind of "emotion" (in linguistic terms, the "mood" of the
utterance).  For example, consider the ubiquitous ♥ sign, which also
predates cellphone emoji; it's long been used in manga to denote a mood of
flirtatiousness, fondness, cuteness, playfulness and so on. Likewise, the
"veins popping" sign in manga (
http://tvtropes.org/pmwiki/pmwiki.php/Main/CrossPoppingVeins ) may be a
drawing of veins; but it's used quite abstractly to denote an angry mood,
and can even be used among text, in speech balloons.



2016-02-29 19:24 GMT-03:00 Asmus Freytag (t) :

> On 2/29/2016 1:55 PM, Philippe Verdy wrote:
>
> . Well emojis were initially designed to track amotions and form a sort of
> new language,
>
>
> E-moji means "picture-character" in Japanese, has nothing to do (at first)
> with emotions.
>
> A./
>


Re: Hentaigana proposal

2015-12-16 Thread Leonardo Boiko
I like the more descriptive names, but I'd like to have this data available
in some supplementary table available anyway, regardless of the naming
scheme.

2015-12-16 16:17 GMT-02:00 Garth Wallace :

> On Wed, Dec 9, 2015 at 7:55 AM, Nicolas Tranter
>  wrote:
> > I comment as a western Japanologist who teaches and researches using
> > hentaigana. I have published with hentaigana using image files
> (resulting in
> > two publisher errors) and will publish next year with hentaigana using
> the
> > Koin Hentaigana font (Koin変体仮名外字明朝.tte), and anticipate typesetting
> > problems. I refer to the 2015 proposal L2/15-239 to include hentaigana,
> > including the appended paper by Takada Tomokazu, Yada Tsutomu and Saito
> > Tatsuya ('The past, present and future of Hentaigana Standardization for
> > Information Interchange'). I also refer to Yada Tsutomu's support of the
> > proposal ('About the inclusion of standardized codepoints for
> Hentaigana',
> > L2/15-318). As the names and numbering of proposed characters is an
> issue I
> > deal with below, I also refer to individual hentaigana in the proposal by
> > their MJ-codes as used in the proposers' own websites (e.g.
> > http://mojikiban.ipa.go.jp/xb164/).
> >
> >
> >
> > SELECTION: The selection is good, consisting of 286 forms, although this
> > would be realised as 299 characters. The earlier 2009 proposal referred
> to
> > was based on the Mojikyo M113.ttf font, which has 213 hentaigana
> characters
> > and includes a few major basic gaps. The Koin Hentaigana font has 549
> > characters, which excluding separate forms with voicing and
> 'half-voicing'
> > diacritics consists of 330 hentaigana, but includes some very rare forms,
> > including ones that do not occur in late period texts.
> >
> >
> >
> > The selection of 'academic' hentaigana is appropriate and lacks major
> gaps.
> > On the other hand, the Ministry of Justice hentaigana requirements are
> ones
> > that have been decided by the Ministry of Justice in 2004 for name
> > registration purposes, and so, although one could argue easily with their
> > 2004 decision (and I would), the fact that they are already official
> means
> > it is pointless to argue with their inclusion in Unicode.
> >
> >
> >
> > It's been noted that a few hentaigana are almost identical to normal
> > hiragana, especially e HENTAIGANA LETTER E VARIANT 4 = MJ090017 (cf. え),
> shi
> > HENTAIGANA LETTER SI VARIANT 2 = MJ090072 (cf. HIRAGANA LETTER SI し) and
> nu
> > HENTAIGANA LETTER NU VARIANT 2 = MJ090149 (cf. HIRAGANA LETTER NU ぬ):
> their
> > differences are solely that the 'brush' is removed from the paper on a
> > downward rather than a rightward flourish, reflecting vertical
> handwriting.
> > Ordinarily I would argue against including them, but since the MoJ has
> > recognised them as official variants they need to be included.
> >
> >
> >
> > The decision to propose in most cases one codepoint for the hentaigana
> > derived from a single Chinese character is sensible, as also is the
> decision
> > to allow multiple codepoints in certain cases where manuscripts use
> > side-by-side significantly distinct forms derived from the same Chinese
> > character and with the same value. An example of the latter is HENTAIGANA
> > LETTER KA VARIANT 3 = MJ090025and KA VARIANT 4 = MJ090026, both
> pronounced
> > ka and both derived from the Chinese character 可, but which are routinely
> > both found in the same manuscript by the same hand as if they were
> separate
> > graphemes from the Heian to the Meiji periods.
> >
> >
> >
> > POLYPHONY. Several hentaigana are truly polyphonous (e.g. the 子-derived
> > hentaigana = ne MJ090151 or MJ090059 ko, or the 馬-derived hentaigana = me
> > MJ090222 or ma MJ090205). In particular, those hentaigana derived from 无
> and
> > associated with n (MJ090298, MJ090299) historically (also the source of
> > HIRAGANA LETTER N ん)  are also used for mu (MJ090214, MJ090215) and mo
> > (MJ090224, MJ090223). Diachronically, n in native Japanese words is
> usually
> > derived from an earlier mu. Takada et al. includes a list of 10 kanji
> > sources that this applies to in the proposed repertoire. (Strictly, this
> > affects 11 hentaigana, because the proposal has two forms for 无-derived
> > characters.) The proposal's solution is to assign different identifiers,
> > e.g. 子 = HENTAIGANA LETTER NE VARIANT 1 and HENTAIGANA LETTER KO VARIANT
> 2,
> > 馬 = HENTAIGANA LETTER ME VARIANT 3 and HENTAIGANA LETTER MA VARIANT 7,
> and
> > the two derived from 无 = HENTAIGANA LETTER N VARIANT 1, N VARIANT 2, MU
> > VARIANT 1, MU VARIANT 2, MO VARIANT 1 and MO VARIANT 2. This means that
> > there would be characters that are given more than one codepoint and
> > identifier but are formally and etymologically identical, adding 13
> > unnecessary repetitions to the character set. I would favour Yada's
> naming
> > system, where the polyphonous characters are given a single codepoint 

Re: Stationary vs. waving flags (was: Re: Adding RAINBOW FLAG to Unicode)

2015-07-06 Thread Leonardo Boiko
2015-07-06 17:11 GMT-03:00 Doug Ewell d...@ewellic.org:
 Is it your belief that users who wish to display an emoji flag care
 whether the flag is shown stationary versus flapping in the wind?

I think a waving white flag is an emoji symbol for
truce/surrender/come in peace, whereas a white rectangle doesn't
easily transmit the same idea.


Re: Bunny hill symbol, used in America for signaling ski pistes for novices

2015-05-28 Thread Leonardo Boiko
You could use U+1F407 RABBIT combined with U+20E4 COMBINING ENCLOSING
UPWARD POINTING TRIANGLE, and pretend the triangle is a hill.   ⃤

If only we had a combining rabbit, we could add rabbits to U+1F3D4 SNOW
CAPPED MOUNTAIN.  Or anything else.


2015-05-28 16:46 GMT-03:00 Philippe Verdy verd...@wanadoo.fr:

 Is there a symbol that can represent the Bunny hill symbol used in North
 America and some other American territories with mountains, to designate
 the ski pistes open to novice skiers (those pistes are signaled with green
 signs in Europe).

 I'm looking for the symbol itself, not the color, or the form of the sign.

 For example blue pistes in Europe are designed with a green circle in
 America, but we have a symbol for the circle; red pistes in Europe are
 signaled by a blue square in America, but we have a symbol for the square;
 black pistes in Europe are signaled by a black diamond in America, but we
 also have such black diamond in Unicode.

 But I can't find an equivalent to the American Bunny hill signal,
 equivalent to green pistes in Europe (this is a problem for webpages
 related to skiing: do we have to embed an image ?).




Re: Bunny hill symbol, used in America for signaling ski pistes for novices

2015-05-28 Thread Leonardo Boiko
Serious question: Has someone discussed a generic combining mechanism? I
mean, characters with an effect like combine the last two.  Say, '!' +
'?' + COMBINING OVERLAY = '‽'.  '!' + '!' + COMBINING SIDE BY SIDE = '‼',
and so on.  Similar in spirit to the Ideographic Description Characters,
but meant to actually tell the rendering system to combine stuff.

2015-05-28 17:25 GMT-03:00 Shervin Afshar shervinafs...@gmail.com:

 Makes sense. But it doesn't seem like we need any new symbols. I think one
 of these should do for hard and extra-hard slopes:


 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Aname%3D%2FDIAMOND%2F%3A%5Dg=

 Also, I'm not at all against making use of the actual [image: ]we have.
 I will not hold my breath for a combining rabbit symbol though.

 ↪ Shervin

 On Thu, May 28, 2015 at 1:16 PM, Philippe Verdy verd...@wanadoo.fr
 wrote:

 I saif it: there's no symbol in Europe for pistes, just colors. The
 American Bunny hill maps to green pistes in Europe.
 (the European piste colors are used also for drawing their ways on maps,
 not just found in signages).
 Piste signs are typically all the same shape in the same station (most
 often discs) and the text on it (if present) shows the name or number of
 the piste in the station, or just an arrow showing the direction to follow.

 2015-05-28 22:11 GMT+02:00 Shervin Afshar shervinafs...@gmail.com:

 Well...to pick the nit, these shapes are rhombi; known colloquially as
 diamonds.

 So what's the symbol for bunny hill in Europe?

 ↪ Shervin

 On Thu, May 28, 2015 at 1:03 PM, Philippe Verdy verd...@wanadoo.fr
 wrote:

 Well also these symbols, if you want (these are not really diamonds),
 but the wordpress page forgets the bunny hill. It starts only with the
 green circle (in fact a black disc colored in green) which maps to blue
 pistes in Europe.

 2015-05-28 21:59 GMT+02:00 Shervin Afshar shervinafs...@gmail.com:

 Single and double diamond?

 https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg

 http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/IxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg

 http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg


 ↪ Shervin

 On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy verd...@wanadoo.fr
 wrote:

 Is there a symbol that can represent the Bunny hill symbol used in
 North America and some other American territories with mountains, to
 designate the ski pistes open to novice skiers (those pistes are signaled
 with green signs in Europe).

 I'm looking for the symbol itself, not the color, or the form of the
 sign.

 For example blue pistes in Europe are designed with a green circle in
 America, but we have a symbol for the circle; red pistes in Europe are
 signaled by a blue square in America, but we have a symbol for the 
 square;
 black pistes in Europe are signaled by a black diamond in America, but we
 also have such black diamond in Unicode.

 But I can't find an equivalent to the American Bunny hill signal,
 equivalent to green pistes in Europe (this is a problem for webpages
 related to skiing: do we have to embed an image ?).









Re: (R), (c) and ™

2014-12-18 Thread Leonardo Boiko
For the record, the emoji selection issue is also affecting the Google
Talk/Hangouts web client, where U+2122 (trademark, ™), U+00AE (registered,
®), U+00A9 (copyright, ©), and U+2194 (left right arrow, ↔) seem to be
treated as emoji and displayed in funky blue:

http://namakajiri.net/pics/screenshots/gmail_emouni.png

There are probably more I haven't discovered.



2014-12-18 8:31 GMT-02:00 Andrea Giammarchi andrea.giammar...@gmail.com:

 Hello there,
   I wonder if it's by accident that 00AE, 00A9, and 2122 are not listed as
 standard variant sensitive chars.

 OSX seems to threat them as such, so adding FE0F will force them to be an
 image, but I know there are few quirks in this behavior and I wonder if
 there should be an exception.

 Thanks for any clarification on this.

 Best Regards

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: The rapid … erosion of definition ability

2014-11-17 Thread Leonardo Boiko
Sign is too general.  The word has no less than 12 meanings, and can
refer e.g. to many Unicode characters that are not emojis (the sharp
sign, the less-than sign).[1]

It's useful to have a specialized word  referring specifically to the new
pictograms used to color electronic messages with emotional inflection.
Borrowing is a perfectly adequate and natural strategy to get such a word
into a language – as indeed English did with the word sign, from Old
French *signe * Latin *signum* ; and as Japanese did with the English
word *emotion
*, from which the *emo-*  in *emoji, *and with Chinese, from which *-ji*
written character.

If borrowing words when they're useful is ridiculous, then all languages
are ridiculous, and when everything is ridiculous nothing is.


[1] http://en.wiktionary.org/wiki/sign



2014-11-17 8:09 GMT-02:00 Andreas Stötzner a...@signographie.de:


 Am 17.11.2014 um 08:35 schrieb Mark Davis ☕️:

 IT’S EASY TO DISMISS EMOJI. They are, at first glance, ridiculous


 The only ridiculous thing is to name them “Emoji” outside Japan.
 They’re just signs and that’s it.


 Regards,
 Andreas Stötzner.





 ___

 Andreas Stötzner  Gestaltung Signographie Fontentwicklung

 Haus des Buches
 Gerichtsweg 28, Raum 434
 04103 Leipzig
 0176-86823396

 http://stoetzner-gestaltung.prosite.com



















 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: The rapid … erosion of definition ability

2014-11-17 Thread Leonardo Boiko
2014-11-17 9:08 GMT-02:00 Magnus Bodin ☀ mag...@bodin.org:

 Just to clarify. The transcribed form ji in the japanese emoji word
 絵文字 is probably not from mandarin, since 字 is pronounced zi in mandarin.
 Is it pronounced ji in an other chinese language?


Japanese doesn't usually borrow from Mandarin.  Rather, a large amount of
its vocabulary (about 60%) was borrowed from classical and medieval Chinese
(much like the way that 58% of English words were borrowed from Latin and
French).  These words of Chinese origin are called *kango* in Japanese, and *ji
*is one of them – quite naturally, as the concept of “written character”
itself was acquired from China.

There are three main layers of Chinese loans into Japanese: a stratum they
call *go-on*, which came from Late Old Chinese and Early Middle Chinese
(with a Korean flavor); the *kan-on* stratum *, *from the Chang'an dialect
of Late Middle Chinese; and a bit of Song/Yuan Late Middle Chinese as
*tōsō-on* [1].

The Japanese word *ji *“character” is from *go-on* Chinese, likely
developing from Old Chinese *tsəʔ/*dzəh [2] or *dzə [3].  字 may also be
pronounced *shi*, which is from the *kan-on* layer.

Notice that the Mandarin sound written as ‹z› (in 字 *zì *) doesn’t denote
the [z] consonant but rather [ts] (Mandarin has no voiced consonants like
[z] or [d]); and also that the Jap. ‹j› isn't English ‹j› but the same
phoneme as a voiced /ti/ → /di/ → [(d)ʑi].  But this similarity isn't
because Japanese borrowed from Mandarin; rather, they're cousins to the
same ancestor.

[1] Miyake, *Old Japanese: A Phonetic Reconstruction*.
[2] Schuessler, *ABC Etymological Dictionary of Old Chinese*.
[3] Baxter-Sagart Old Chinese reconstruction.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: The rapid … erosion of definition ability

2014-11-17 Thread Leonardo Boiko
2014-11-17 9:10 GMT-02:00 Andreas Stötzner a...@signographie.de:
 [sign] in its generality it is just perfect. […] At least, we should (in
English) speak of Emoticons and not Emoji. […] if precise terming is tricky
I find it better to generalize

These are your opinions.  I find them to be perfectly valid (exactly as
valid as anyone else’s, mine included).  However, no single individual's
opinion has any special power about what goes into the vocabulary of a
language; rather, the lexicon is determined collectively by whatever the
community of speakers finds to be useful.  Clearly English speakers found
sign to be too imprecise, and as of now, they seem to prefer emoji to
emoticon (probably because emoticon was already in use to denote
multi-character pictographs built from non-pictographs, such as :-) – the
original use of the coinage).  If speakers want a word referring
specifically to these new modal pictograms, they will have one and that's
it.

You're entitled to find linguistic borrowing to be ridiculous; but I'm
equally entitled to find your moral judgment to be condescending and
historically uninformed (unless you want to restrict yourself to
Anglo-Saxon words, in which case say goodbye to generality ( Lat.
*generalis*), emotion ( Fr. *émotion*), icon ( Greek *eikon*) etc.);
and at any rate neither of our opinions will have any effect in what words
shall the speakers adopt.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Quasiquotation marks

2014-06-10 Thread Leonardo Boiko
What about using U+0331 combining macron below or U+0320 combining
minus below?  Here are some samples:

U+0331

̱tesṯ
“̱test”̱

U+0320

̠test̠
“̠test”̠


2014-06-10 9:39 GMT-03:00 Philippe Verdy verd...@wanadoo.fr:
 (overstriking with del or s in HTML)

Modern HTML phased out s, and del has semantic meanings
innapropriate for this case.  It would be better to use CSS
text-decoration: line-through.  This point has been raised in the
comments of the original post.

 How are they different to quoting multiple personalities, each one with their 
 own color (red, green, blue, black for the author, grey for side remarks...)

That could be bad for people with color blindness (which may reach up
to some 10% of the genetically male population).

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Swift

2014-06-04 Thread Leonardo Boiko
Even Ruby could do it for years, despite having notoriously bad Unicode
string support back then:

irb 日本語 = 'むらさき'
= むらさき

irb íslenska = 'fjólublár'
= fjólublár

irb 日本語 + ' ' + íslenska
= むらさき fjólublár

I don't think this feature saw much use, since programmers in a global
world can't assume that everyone will have easy access to their input
methods, and so tend to restrict code tokens to the ASCII set to encourage
participation.



2014-06-04 8:45 GMT-03:00 David Starner prosfil...@gmail.com:

 On Wed, Jun 4, 2014 at 2:28 AM, Andre Schappo a.scha...@lboro.ac.uk
 wrote:
  I think this a huge step forward for i18n and Unicode.

 Could you not do that in Objective-C? If no, then it's a step forward
 for Apple, but the rest of us--Ada, C, C++, C#, Java, Python--have had
 this feature for years. 20 years in 2015 in the case of Ada.

 --
 Kie ekzistas vivo, ekzistas espero.
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: CJK stroke order data: kRSUnicode v. kRSKangXi

2014-03-09 Thread Leonardo Boiko
I don't know about the points you raise, but I wish it was easier to help
proofread Unihan data.  Back in 2012 I compared kKangXi to kIRGKangXI and
found 252 conflicts, besides the cases where a character only has one or
the other.  I even put together a simple tool to help fixing this, with
links to the relevant pages at the online Kang Xi[1].  I had no replies…

[1] http://namakajiri.net/misc/unihan_kangxi/compare_existing.html for
characters in Kang Xi, and for the others,
http://namakajiri.net/misc/unihan_kangxi/compare_nonexisting.html


2014-03-09 9:39 GMT-03:00 Adam Nohejl a...@nohejl.name:

 Hello again,

 I would be really grateful for any reply or at least pointers to relevant
 information about this topic (stroke-order data in Unihan, see my previous
 message below).

 Or is there any other appropriate place to discuss this?

 Thank you,

 --
 Adam

 On 2014/02/28, at 19:56, Adam Nohejl a...@nohejl.name wrote:
 
  Hello,
 
  I am comparing radical data for CJK characters from different sources,
 including the Unihan database. According to the Unihan documentation* the
 kRSUnicode radical should correspond to kRSKangXi radical, which in turn
 should be based on the Kang Xi dictionary.
 
  Is there any explanation for the following discrepancies? Did I miss any
 other rules or reasoning behind the content of these two fields?
 
  Examples of the discrepancies:
 
  (1) A very common character for most, maximum.
  U+6700kRSKangXi   73.8
  U+6700kRSUnicode  13.10
 
  (2) A funny character for autumn containing the turtle component.
  U+9F9DkRSKangXi   115.16
  U+9F9DkRSKanWa115.16
  U+9F9DkRSUnicode  213.5
 
  There are also characters that actually are not included in the Kang Xi
 dictionary**, but the Unihan data contain both a purported Kang Xi radical
 and in addition to that a _different_ Unicode radical.
 
  (3) The simplified turtle character (commonly assigned to the
 traditional radical #213):
  U+4E80kRSKangXi   213.0
  U+4E80kRSUnicode  5.10
 
  (4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary
 decision, but unexpectedly the fields differ:
  U+66FBkRSKangXi   72.7
  U+66FBkRSUnicode  73.7
 
  - - -
 
  [*] http://www.unicode.org/reports/tr38/tr38-8.html: Property:
 kRSUnicode // Description: (...) The first value is intended to reflect the
 same radical as the kRSKangXi field and the stroke count of the glyph used
 to print the character within the Unicode Standard.
 
  [**] The two characters are missing from the '89 edition of Kang Xi
 (which should be the same as used for Unihan) according to search on this
 site: http://ctext.org/dictionary.pl



 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: problem with combining diacritcs in HTML5

2012-10-07 Thread Leonardo Boiko
On 7 October 2012 04:37, Jukka K. Korpela jkorp...@cs.tut.fi wrote:
 Inspecting the Courier New font, version 5.11, I noticed that the advance
 width of the glyph for U+0332 (glyph uni0331) is 1129 units. I think this
 explains it all. The advance width should be 0.

 And other fonts have the same problem, at least the following: Courier,
 DejaVu Sans Mono, Fixedsys, Meiryo, Meiryo UI, Modern, Sun-ExtA, Terminal,
 VL PGothic.

I found some DejaVu bug reports where a developer called Ben Laenen
suggests the nonzero advance width is intentional:

 - https://bugs.freedesktop.org/show_bug.cgi?id=18614
 - https://bugs.freedesktop.org/show_bug.cgi?id=26941

He says that they use OpenType trickery to remove the extra spacing
and position the combining mark correctly, but some renderers don’t
play nice with that.

(just pointing; I'm a layman and have no idea which one’s the proper way.)




kKangXi and kIRGKangXi fields in Unihan

2012-05-23 Thread Leonardo Boiko
Hello,

As you know, the Unihan database has two fields listing indexes for
the Kāngxī Zìdiǎn dictionary, kKangXi and kIRGKangXi (where IRG is the
Ideographic Rapporteur Group).  If I’m counting correctly, 106
characters have values only in kKangXi, while 49300 have a value only
in kIRGKangXi .  The remaining usually have the same value for the two
fields, but they differ in 252 cases.

Earlier[1] someone asked about which field was correct when there’s a
conflict.  John H. Jenkins replied that “whichever one has the correct
data is the correct one. :-) ”, and invited help in finding errors.

Well I wanted to help, but I can’t read Chinese properly so I have
trouble validating the characters in the Kāngxī (I can recognize them
visually, but without understanding the definitions I might mistake
some Z-variant or something).  However, after a few Emacs macros, I
came up with this simple HTML form to help check which one is correct:

http://namakajiri.net/misc/unihan_kangxi/compare_existing.html
http://namakajiri.net/misc/unihan_kangxi/compare_nonexisting.html

The first link lists conflicting pairs where at least one of the
indexes claim the character is actually present in the Kāngxī, while
the other lists the remaining “virtual” indexes.  Each pair is listed
with links to the relevant Kāngxī pages (courtesy of the online
edition[2]), and a link to Unihan.  Once the form is submitted, it
makes a list of the entries chosen as correct by the user.  The
results are shown in plain text, and it should be simple to compare
several tries for double-checking.

I don’t know if there’s interest in such a thing at the moment, but if
so, there you go.  All values apply to Unihan data downloaded a week
ago or so.

--
Leonardo Boiko
http://namakajiri.net/nikki

References:
[1] http://unicode.org/mail-arch/unicode-ml/y2007-m03/0014.html
[2] http://www.kangxizidian.com/




Re: Best smart phones apps for diverse scripts?

2010-10-29 Thread Leonardo Boiko
On Fri, Oct 29, 2010 at 22:27, Deborah Goldsmith golds...@apple.com wrote:
 iPhone 4 supports Unicode in SMS messages. Furthermore, the SMS standard 
 provides for Unicode in messages:

Only UTF-16 though, which brings SMS’s already appaling low 160/140
character limit to a measly 70.  Not a problem if you’re writing
Chinese or Japanese, but if you’re writing, say, Spanish, or English
with a single symbol requiring you to engage Unicode mode, you’re back
to telegram age.  I don’t know in your countries, but here the price
per SMS really bites…

-- 
Leonardo Boiko




Re: Creative people on Twitter

2010-10-12 Thread Leonardo Boiko
I guess it’s only a matter of 퐭퐢퐦퐞 before people start doing
things like 햙햍햎햘 (notice this email is plain-text).
-- 
Leonardo Boiko




Re: ,,semi-virgula''

2010-08-31 Thread Leonardo Boiko
2010/8/31 Janusz S. Bień jsb...@mimuw.edu.pl:
 First, is semivirgula a good name? Google shows that it often refers
 to semicolon.

I’m no specialist and I have no idea about what’s the original name of
that diacritic, but “virgula” is the name of the comma in medieval
manuscripts[1].  To this day it’s the word for “comma” in Portuguese.
We also call the semicolon a “ponto e vírgula” – period and comma,
dot-and-comma.

http://www.etymonline.com/index.php?search=virgula

-- 
Leonardo Boiko




Re: TeX: insert Unicode character

2010-08-24 Thread Leonardo Boiko
I’d tell him to start using XeTeX+fontspec , if he already isn’t
(link: http://scripts.sil.org/cms/scripts/page.php?site_id=nrsiid=xetex
).  Solves all your unicode problems.

Then, all he has to do is to insert the character directly in the .tex
file and select a system font (using fontspec) that has its glyph.

-- 
Leonardo Boiko




Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters (was Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters))

2010-08-04 Thread Leonardo Boiko
On Wed, Aug 4, 2010 at 05:19, William_J_G Overington
 Long s was used with ordinary Roman type in England for English text in at 
 least part of the 17th and 18th centuries.

More on that by babelstone:
http://babelstone.blogspot.com/2006/06/rules-for-long-s.html

(Sorry for the duplicate email William, my mistake.)

-- 
Leonardo Boiko



Re: Most complete (free) Chinese font?

2010-08-02 Thread Leonardo Boiko
Emphasis on “the only font _I know_”.  I didn’t know Andron nor
Everson Mono.  Besides, while quality, both seem to be non-free, which
is something I’m not interested in as a Debian user (nothing against
it, it just isn’t my thing).

On Mon, Aug 2, 2010 at 05:48, Michael Everson ever...@evertype.com wrote:
 On 2 Aug 2010, at 08:52, Andreas Stötzner wrote:

 Am 01.08.2010 um 13:03 schrieb Leonardo Boiko:

 And it’s the only font I know with U+2E19 PALM BRANCH ⸙

 It is not. Andron has it.

 As does Everson Mono.

 Michael Everson * thttp://www.evertype.com/





-- 
Leonardo Boiko
http://namakajiri.net




Re: Most complete (free) Chinese font?

2010-08-02 Thread Leonardo Boiko
When did I say there was something shameful about non-freeness? I only
said, and I quote, that it’s not my thing.  Since I run a free
operating sytem, it can automatically download and manage free
content, so it’s more convenient for _me_ to keep using free content.
I manage about a thousand computers in a public university in Brazil,
with little funding and plenty of bureaucracy.  Dealing with custom
licensing terms and ad-hoc downloading and manual installation is
simply too inconvenient.  It’s much simpler, for me, to stick to an
automated system that guarantees freedom.

As an author, you’re entitled to license your work to your heart’s
content.  Don’t take this as an accusation.  As a sysadmin, I’m also
entitled to not care about non-free stuff.  I don’t think it’s
shameful, I simply don’t use it.

On Mon, Aug 2, 2010 at 08:11, Michael Everson ever...@evertype.com wrote:
 On 2 Aug 2010, at 11:57, Leonardo Boiko wrote:

 Emphasis on “the only font _I know_”.  I didn’t know Andron nor Everson 
 Mono.  Besides, while quality, both seem to be non-free, which is something 
 I’m not interested in as a Debian user (nothing against it, it just isn’t my 
 thing).

 Huh.

 Well, Leonardo, when I am independently wealthy, I'll be happy to give 
 everything I do away for free. In the meantime, I find the the extremely 
 occasional shareware fee I get to be a welcome affirmation that Everson Mono 
 is appreciated.

 There is nothing shameful, or a shame, about non-free fonts.

 Michael Everson * http://www.evertype.com/







-- 
Leonardo Boiko
http://namakajiri.net




Re: Most complete (free) Chinese font?

2010-08-01 Thread Leonardo Boiko
Oh, it _is_ totally blocky, and will look terrible if scaled to
anything other than its natural 16-pixel size. My point is, this is
how it’s supposed to be, cause it’s a bitmapped, monospace terminal
font.  Like Terminus or xorg’s “fixed”; you use it for computer code,
not books.  And it’s the only font I know with U+2E19 PALM BRANCH ⸙ ;)

I hope the other fonts mentioned were useful.  From a quick search in
my debian system I found, other than WQY, only the Arphic family of
fonts, with AR PL Ukai (kǎitǐ) and AR PL UMing (míngtǐ) being their
Unicode representatives.  I’m kind of surprised at how few free
Chinese fonts there seems to be; probably you’ll have to scavenge the
native web for more, as I had to do for Japanese.

On Sun, Aug 1, 2010 at 04:05, jander...@talentex.co.uk
jander...@talentex.co.uk wrote:
 I didn't mean it unkindly, though :-) It's just that it looks rather blocky.
 Also I think the developers themselves declare it to be ugly, but
 complete, if I remember correctly.

 /jan

 Leonardo Boiko wrote:

 Unifont is not ugly for its intended purpose: a bitmapped, fixed-width
 16-pixel font.  It’s great for terminals or Emacs IMHO, as long as
 your monitor resolution isn’t too high…

 I don’t know Chinese so I can’t vouch for coverage, but Wen Quan Yi
 seems to be the most popular open-source Chinese font (the hànzì in
 Unifont are actually based on it, IIRC).  The website is
 http://wenq.org/enindex.cgi , but it’s pre-packaged for all major
 distros.








-- 
Leonardo Boiko
http://namakajiri.net