Re: Usage stats?

Michael Norton Fri, 27 Mar 2015 14:25:40 -0700

I'm trying to get a sense of the range and variance of the Unicode set in
the same way I have with hypertext on the web: for every HTML or XHTML
document URL, for example ,there is going to be a* >0* Minimum of* "<"* and*
">"* characters.   Depending on which Markup set and schema(s) you are
using, char-MIN's and (eventually) char-MAX's are useful to have.


On Fri, Mar 27, 2015 at 5:03 PM, Michael Norton <
[email protected]> wrote:

> Doug Ewell's getting it.   He sent this back to me, so I asked him if he
> could provide the same dataset drawn from his written reply to me:
>
>
>
>
>
>
>
>
> * For example, your original e-mail (327characters) consists of:U+0020 -
> 14.07%U+0065 - 10.09%U+0061 -  7.03%U+0074 -  6.73%U+006F -  5.81%*
>
> This is good because when the volumes of traffic begin to exponentially
> increase over a space, if there are predominant formulations of Unicode for
> each, they need to be recognized for a number of reasons depending on which
> sector or, as you say, corpus, you're in.
>
> In the above example, I think it's safe to say U+0020 online, though I
> would like to compare with the other 30 "space" characters you mentioned
> Markus.   If I know traffic figures for where the other space characters
> are used, I can draw a pretty good estimation and correlation of it.
>
> On Fri, Mar 27, 2015 at 4:56 PM, Markus Scherer <[email protected]>
> wrote:
>
>> On Fri, Mar 27, 2015 at 1:27 PM, Michael Norton <
>> [email protected]> wrote:
>>
>>> Easy example: what's the code for [blank space] U+020 across all
>>> language sets of Unicode?  Is it the same ie: 100%?
>>>
>>
>> I don't understand what you are asking, and I have a hunch you haven't
>> said it in a way that anyone else understands it either.
>>
>> The code point value that the Unicode Standard assigns to the normal
>> space is U+0020, but
>> - not every language uses spaces
>> - not every language that uses spaces uses them for the same purpose as
>> English
>> - there are some 30 other "space" characters in Unicode
>>
>> Statistics of character frequencies vary by corpus, as others have said.
>> Even if you "only" look "on the web", that's undefined until you specify a
>> crawling strategy. Dynamically generated content means that there is an
>> infinite number of "web pages". Every crawler will come up with a different
>> set.
>>
>> Maybe you are asking about statistics of character encodings? On the web?
>> Such as, Unicode vs. Shift-JIS vs. ISO 8859-2 etc.?
>>
>> markus
>>
>
>
>
> --
>
> Michael A. Norton, B.A. Cinema, M.P.A.
> My Cinema Home: http://www.NortonsNook.com
>
> "All great actors are mere mathematical masters of speech and the human
> body."
>
>
>
>
>


-- 

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com

"All great actors are mere mathematical masters of speech and the human
body."

_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Re: Usage stats?

Reply via email to