I'm trying to get a sense of the range and variance of the Unicode set in the same way I have with hypertext on the web: for every HTML or XHTML document URL, for example ,there is going to be a* >0* Minimum of* "<"* and* ">"* characters. Depending on which Markup set and schema(s) you are using, char-MIN's and (eventually) char-MAX's are useful to have.
On Fri, Mar 27, 2015 at 5:03 PM, Michael Norton < [email protected]> wrote: > Doug Ewell's getting it. He sent this back to me, so I asked him if he > could provide the same dataset drawn from his written reply to me: > > > > > > > > > * For example, your original e-mail (327characters) consists of:U+0020 - > 14.07%U+0065 - 10.09%U+0061 - 7.03%U+0074 - 6.73%U+006F - 5.81%* > > This is good because when the volumes of traffic begin to exponentially > increase over a space, if there are predominant formulations of Unicode for > each, they need to be recognized for a number of reasons depending on which > sector or, as you say, corpus, you're in. > > In the above example, I think it's safe to say U+0020 online, though I > would like to compare with the other 30 "space" characters you mentioned > Markus. If I know traffic figures for where the other space characters > are used, I can draw a pretty good estimation and correlation of it. > > On Fri, Mar 27, 2015 at 4:56 PM, Markus Scherer <[email protected]> > wrote: > >> On Fri, Mar 27, 2015 at 1:27 PM, Michael Norton < >> [email protected]> wrote: >> >>> Easy example: what's the code for [blank space] U+020 across all >>> language sets of Unicode? Is it the same ie: 100%? >>> >> >> I don't understand what you are asking, and I have a hunch you haven't >> said it in a way that anyone else understands it either. >> >> The code point value that the Unicode Standard assigns to the normal >> space is U+0020, but >> - not every language uses spaces >> - not every language that uses spaces uses them for the same purpose as >> English >> - there are some 30 other "space" characters in Unicode >> >> Statistics of character frequencies vary by corpus, as others have said. >> Even if you "only" look "on the web", that's undefined until you specify a >> crawling strategy. Dynamically generated content means that there is an >> infinite number of "web pages". Every crawler will come up with a different >> set. >> >> Maybe you are asking about statistics of character encodings? On the web? >> Such as, Unicode vs. Shift-JIS vs. ISO 8859-2 etc.? >> >> markus >> > > > > -- > > Michael A. Norton, B.A. Cinema, M.P.A. > My Cinema Home: http://www.NortonsNook.com > > "All great actors are mere mathematical masters of speech and the human > body." > > > > > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body."
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

