Re: Usage stats?

2015-03-28 Thread Richard Wordingham
On Sat, 28 Mar 2015 00:59:56 +
Richard Wordingham  wrote:

> On Fri, 27 Mar 2015 16:27:26 -0400
> Michael Norton  wrote:
> 
> > Easy example: what's the code for [blank space] U+020 across all
> > language sets of Unicode?  Is it the same ie: 100%?

I've seen a claim from a normally reliable source that U+0020 is
extremely rare in Thai or Japanese text.  It does occur in Japanese
text, though quite possibly as an error for IDEOGRAPHIC SPACE.

In Thai, U+0020 is an extremely common and prescribed punctuation
mark. It is reliably used as a clause and sentence separator, and is
also used to delimit names and also numbers composed of digits.  In
newspaper columns, it occurs in most lines, and in books there are
usually several to the line.  The other common punctuation marks in
serious material are the abbreviation mark U+002E FULL STOP (especially
for initialisms) and the list item separator U+002C COMMA.  Quotation
marks, exclamation marks and ellipses occur in fictional dialogue with
pretty much the same meaning as in English.

Richard.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-28 Thread Michael Norton
& another:  the universe is the concert hall; its alements the instruments;
we're but the resonance of its winds, strings, and voices' echoes.

On Sat, Mar 28, 2015 at 12:21 AM, Steven R. Loomis 
wrote:

> Here's an analogy: it's more of a piano factory than a concert hall.
>
> S
>
> Enviado desde nuestro iPhone.
>
> El mar 27, 2015, a las 1:16 PM, Michael Norton <
> michaelanortons...@gmail.com> escribió:
>
> (I know this is way too simplistic a response but it is kind of like
> giving everyone an invisible cloak and an invisible dagger and not telling
> them what a cloak and dagger is for [cutting butter & keeping warm]).
>
> On Fri, Mar 27, 2015 at 3:57 PM, Michael Norton <
> michaelanortons...@gmail.com> wrote:
>
> Why wouldn't Unicode itself have it?
>>
>


-- 

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com

"All great actors are mere mathematical masters of speech and the human
body."
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-28 Thread Michael Norton
You needn't teach me about English sir, I am a writer most of the time.
Science, however, has no shortage in the need for great teachers.   There
are even English people here in America short on the science out of
England!

Anyway, the pattern checks out on a web page I ran this morning in a
Unicode counter this membership provided me.   Equilibrium is of great
importance in electromagnetism when considering whether or not work needs
to be done in a given scenario.

On Sat, Mar 28, 2015 at 12:52 PM, Doug Ewell  wrote:

> Michael Norton wrote:
>
>  Thanks Doug.  I did not know there exists a representative sample of
>> the world's text. :)
>>
>
> There is not, which was the point.
>
> Thanks for reposting a private message back to the list, by the way. [image:
> 💢]
>
>  Your frequency chart is great.   The average char appearance is 2.91%.
>> Only 34% from your list exceed 10% of it.  Therefore, U+0020 is the
>> elephant in the room (ie. 15%.05% is far > 2.91%).   In fact, it's
>> almost >50% greater than the next most-appearing character.
>>
>
> Words in English are separated by spaces, and the average English word is
> about 5 letters long. It follows that English text will contain a lot of
> spaces. You can eyeball this.
>
>  Only 34% from your list exceed 10% of the average percentile (2.9%).
>>
>> This is serendipitously common (eg. the Earth:Moon albedo ratio is
>> .36).   A relationship about motion and other natural properties and
>> charactetristics among the local texts begin to emerge.
>>
>
> Right.
>
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸
>



-- 

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com

"All great actors are mere mathematical masters of speech and the human
body."
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-28 Thread Doug Ewell

Michael Norton wrote:


Thanks Doug.  I did not know there exists a representative sample of
the world's text. :)


There is not, which was the point.

Thanks for reposting a private message back to the list, by the way. 💢


Your frequency chart is great.   The average char appearance is 2.91%.
Only 34% from your list exceed 10% of it.  Therefore, U+0020 is the
elephant in the room (ie. 15%.05% is far > 2.91%).   In fact, it's
almost >50% greater than the next most-appearing character.


Words in English are separated by spaces, and the average English word 
is about 5 letters long. It follows that English text will contain a lot 
of spaces. You can eyeball this.



Only 34% from your list exceed 10% of the average percentile (2.9%).

This is serendipitously common (eg. the Earth:Moon albedo ratio is
.36).   A relationship about motion and other natural properties and
charactetristics among the local texts begin to emerge.


Right.

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸 


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-28 Thread Michael Norton
Thanks Doug.  I did not know there exists a *representative* sample of the
world's text. :)  I do know that 400 years ago there were about 10,000
languages; now there are about 6,500.  Time flies!

Your frequency chart is great.   The average char appearance is 2.91%.
Only 34% from your list exceed 10% of it.  Therefore, U+0020 is the
elephant in the room (ie. 15%.05% is far > 2.91%).   In fact, it's almost
>50% greater than the next most-appearing character.

So from the two frequency lists you've given me (my email and yours) we
begin to see some patterns emerge.  Provided prior data and observation,
most useful patterns prevail over other more obscure ones and present a
provocative opportunity for webbers out thereWhile this is probably out
of context for most of the 700 Unicode members, I can report that it's good
news.

On Fri, Mar 27, 2015 at 5:31 PM, Doug Ewell  wrote:

> Here is a frequency chart for my previous message. I used the Character
> Frequency tool in Andrew West's BabelPad editor (
> http://www.babelstone.co.uk/Software/BabelPad.html) and sent the output
> to Excel to calculate the percentages. To make Excel happy, I had to
> manually add a single quote ' before the double quotation mark " .
>
> This is still *nowhere near* a realistic sample of which Unicode
> characters are used with what frequency in the entire world. There are
> still only 69 discrete characters, less than the printable ASCII set. And
> according to this sample, Regional Indicator Symbols occur as often as
> capital A, and capital R never occurs at all.
>
> In Japanese or Thai text you will have almost no instances of U+0020.
>
> If you search a non-representative sample of the world's text, you will
> get non-representative statistics.
>
>
>  Code point Character Character Name Count   U+0020   SPACE 177 15.05%
> U+0065 e LATIN SMALL LETTER E 92 7.82% U+0074 t LATIN SMALL LETTER T 86
> 7.31% U+006F o LATIN SMALL LETTER O 76 6.46% U+0061 a LATIN SMALL LETTER A
> 63 5.36% U+0069 i LATIN SMALL LETTER I 62 5.27% U+006E n LATIN SMALL
> LETTER N 54 4.59% U+0072 r LATIN SMALL LETTER R 50 4.25% U+0073 s LATIN
> SMALL LETTER S 47 4.00% U+006C l LATIN SMALL LETTER L 44 3.74% U+0063 c LATIN
> SMALL LETTER C 38 3.23% U+0068 h LATIN SMALL LETTER H 34 2.89% U+0075 u LATIN
> SMALL LETTER U 33 2.81% U+0064 d LATIN SMALL LETTER D 27 2.30% U+0079 y LATIN
> SMALL LETTER Y 25 2.13% U+0067 g LATIN SMALL LETTER G 18 1.53% U+002E . FULL
> STOP 16 1.36% U+0030 0 DIGIT ZERO 15 1.28% U+0062 b LATIN SMALL LETTER B
> 15 1.28% U+0066 f LATIN SMALL LETTER F 15 1.28% U+003E > GREATER-THAN SIGN
> 13 1.11% U+0070 p LATIN SMALL LETTER P 13 1.11% U+0077 w LATIN SMALL
> LETTER W 12 1.02% U+002C , COMMA 11 0.94% U+006D m LATIN SMALL LETTER M 11
> 0.94% U+0055 U LATIN CAPITAL LETTER U 9 0.77% U+002D - HYPHEN-MINUS 8
> 0.68% U+0076 v LATIN SMALL LETTER V 7 0.60% U+0078 x LATIN SMALL LETTER X
> 7 0.60% U+0027 ' APOSTROPHE 6 0.51% U+0025 % PERCENT SIGN 5 0.43% U+002B + 
> PLUS
> SIGN 5 0.43% U+0037 7 DIGIT SEVEN 5 0.43% U+006B k LATIN SMALL LETTER K 5
> 0.43% U+0022 '" QUOTATION MARK 4 0.34% U+0031 1 DIGIT ONE 4 0.34% U+0036 6 
> DIGIT
> SIX 4 0.34% U+003A : COLON 4 0.34% U+003F ? QUESTION MARK 4 0.34% U+0032 2 
> DIGIT
> TWO 3 0.26% U+0033 3 DIGIT THREE 3 0.26% U+0034 4 DIGIT FOUR 3 0.26%
> U+0042 B LATIN CAPITAL LETTER B 3 0.26% U+004C L LATIN CAPITAL LETTER L 3
> 0.26% U+004F O LATIN CAPITAL LETTER O 3 0.26% U+0057 W LATIN CAPITAL
> LETTER W 3 0.26% U+002F / SOLIDUS 2 0.17% U+0035 5 DIGIT FIVE 2 0.17%
> U+0043 C LATIN CAPITAL LETTER C 2 0.17% U+0045 E LATIN CAPITAL LETTER E 2
> 0.17% U+0046 F LATIN CAPITAL LETTER F 2 0.17% U+0049 I LATIN CAPITAL
> LETTER I 2 0.17% U+004E N LATIN CAPITAL LETTER N 2 0.17% U+007C | VERTICAL
> LINE 2 0.17% U+0028 ( LEFT PARENTHESIS 1 0.09% U+0029 ) RIGHT PARENTHESIS
> 1 0.09% U+0038 8 DIGIT EIGHT 1 0.09% U+0039 9 DIGIT NINE 1 0.09% U+003B ;
> SEMICOLON 1 0.09% U+003C < LESS-THAN SIGN 1 0.09% U+0041 A LATIN CAPITAL
> LETTER A 1 0.09% U+0044 D LATIN CAPITAL LETTER D 1 0.09% U+004A J LATIN
> CAPITAL LETTER J 1 0.09% U+004D M LATIN CAPITAL LETTER M 1 0.09% U+0050 P 
> LATIN
> CAPITAL LETTER P 1 0.09% U+0054 T LATIN CAPITAL LETTER T 1 0.09% U+0059 Y 
> LATIN
> CAPITAL LETTER Y 1 0.09% U+1F1F8 🇸 REGIONAL INDICATOR SYMBOL LETTER S 1
> 0.09% U+1F1FA 🇺 REGIONAL INDICATOR SYMBOL LETTER U 1 0.09%   1176
> 100.00%
>
>  --
> Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸
>



-- 

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com

"All great actors are mere mathematical masters of speech and the human
body."
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-28 Thread Michael Norton
*Important correction from my last sent email*:

*Only 34% from your list exceed 10% of **the average percentile (2.9%)**. *


This is serendipitously common (eg. the Earth:Moon albedo ratio is .36).
A relationship about motion and other natural properties and
charactetristics among the local texts begin to emerge.

On Sat, Mar 28, 2015 at 7:30 AM, Michael Norton <
michaelanortons...@gmail.com> wrote:

> Thanks Doug.  I did not know there exists a *representative* sample of
> the world's text. :)  I do know that 400 years ago there were about 10,000
> languages; now there are about 6,500.  Time flies!
>
> Your frequency chart is great.   The average char appearance is 2.91%.
> Only 34% from your list exceed 10% of it.  Therefore, U+0020 is the
> elephant in the room (ie. 15%.05% is far > 2.91%).   In fact, it's almost
> >50% greater than the next most-appearing character.
>
> So from the two frequency lists you've given me (my email and yours) we
> begin to see some patterns emerge.  Provided prior data and observation,
> most useful patterns prevail over other more obscure ones and present a
> provocative opportunity for webbers out thereWhile this is probably out
> of context for most of the 700 Unicode members, I can report that it's good
> news.
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Steven R. Loomis
Here's an analogy: it's more of a piano factory than a concert hall. 

S

Enviado desde nuestro iPhone.

> El mar 27, 2015, a las 1:16 PM, Michael Norton  
> escribió:
> 
> (I know this is way too simplistic a response but it is kind of like giving 
> everyone an invisible cloak and an invisible dagger and not telling them what 
> a cloak and dagger is for [cutting butter & keeping warm]).
> 
>> On Fri, Mar 27, 2015 at 3:57 PM, Michael Norton 
>>  wrote:
>> Why wouldn't Unicode itself have it?
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Clive Hohberger
Interesting that you should bring up the ^ and tilde. Their OS independence
and IBM mainframe compatibility is the reason in 1985 I chose them as the
command prefixes for the now widely used ZPL label design programming
language. ISO 646 IRV was chosen as the programming character set.

On Friday, March 27, 2015, David Starner  wrote:

> On Fri, Mar 27, 2015 at 2:03 PM, Michael Norton
> > wrote:
> >
> > This is good because when the volumes of traffic begin to exponentially
> increase over a space, if there are predominant formulations of Unicode for
> each, they need to be recognized for a number of reasons depending on which
> sector or, as you say, corpus, you're in.
>
> Huh?
>
> > In the above example, I think it's safe to say U+0020 online, though I
> would like to compare with the other 30 "space" characters you mentioned
> Markus.   If I know traffic figures for where the other space characters
> are used, I can draw a pretty good estimation and correlation of it.
>
> ASCII characters are the safest to use online (everyone supports
> them), except when they are the most dangerous (characters found
> outside ASCII can rarely be used for tag/SQL/code injection). If you
> want to know what people can display, look at the fonts that come with
> the OSes that you're interested in. There's interesting things you can
> do with this data, but if you want to know what's safe online, it's
> way more important to be familiar with the basic preexisting character
> sets then to know what the distribution of characters is. ~ and ^ will
> work about everywhere, whereas á won't and ę is even worse, and that
> has nothing to do with their frequency online.
>
>
> --
> Kie ekzistas vivo, ekzistas espero.
>
> ___
> Unicode mailing list
> Unicode@unicode.org 
> http://unicode.org/mailman/listinfo/unicode
>


-- 
Clive P. Hohberger, PhD MBA
Managing Director
Clive Hohberger, LLC
+1 847 910 8794
cp...@case.edu
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread David Starner
On Fri, Mar 27, 2015 at 2:03 PM, Michael Norton
 wrote:
>
> This is good because when the volumes of traffic begin to exponentially 
> increase over a space, if there are predominant formulations of Unicode for 
> each, they need to be recognized for a number of reasons depending on which 
> sector or, as you say, corpus, you're in.

Huh?

> In the above example, I think it's safe to say U+0020 online, though I would 
> like to compare with the other 30 "space" characters you mentioned Markus.   
> If I know traffic figures for where the other space characters are used, I 
> can draw a pretty good estimation and correlation of it.

ASCII characters are the safest to use online (everyone supports
them), except when they are the most dangerous (characters found
outside ASCII can rarely be used for tag/SQL/code injection). If you
want to know what people can display, look at the fonts that come with
the OSes that you're interested in. There's interesting things you can
do with this data, but if you want to know what's safe online, it's
way more important to be familiar with the basic preexisting character
sets then to know what the distribution of characters is. ~ and ^ will
work about everywhere, whereas á won't and ę is even worse, and that
has nothing to do with their frequency online.


-- 
Kie ekzistas vivo, ekzistas espero.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Richard Wordingham
On Fri, 27 Mar 2015 16:27:26 -0400
Michael Norton  wrote:

> Easy example: what's the code for [blank space] U+020 across all
> language sets of Unicode?  Is it the same ie: 100%?

No.  In China, U+3000 IDEOGRAPHIC SPACE, which is the appropriate
ordinary intra-line white space character for use with ideographs, is
also commonly used in scripts which elsewhere would use U+0020 SPACE.

Richard.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Eric Muller
Would a corpus like wikipedia or Project Gutenberg be appropriate for 
you purpose ? Both are freely and easily accessible. 
 and 
.


Eric.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Michael Norton
I'm trying to get a sense of the range and variance of the Unicode set in
the same way I have with hypertext on the web: for every HTML or XHTML
document URL, for example ,there is going to be a* >0* Minimum of* "<"* and*
">"* characters.   Depending on which Markup set and schema(s) you are
using, char-MIN's and (eventually) char-MAX's are useful to have.

On Fri, Mar 27, 2015 at 5:03 PM, Michael Norton <
michaelanortons...@gmail.com> wrote:

> Doug Ewell's getting it.   He sent this back to me, so I asked him if he
> could provide the same dataset drawn from his written reply to me:
>
>
>
>
>
>
>
>
> * For example, your original e-mail (327characters) consists of:U+0020 -
> 14.07%U+0065 - 10.09%U+0061 -  7.03%U+0074 -  6.73%U+006F -  5.81%*
>
> This is good because when the volumes of traffic begin to exponentially
> increase over a space, if there are predominant formulations of Unicode for
> each, they need to be recognized for a number of reasons depending on which
> sector or, as you say, corpus, you're in.
>
> In the above example, I think it's safe to say U+0020 online, though I
> would like to compare with the other 30 "space" characters you mentioned
> Markus.   If I know traffic figures for where the other space characters
> are used, I can draw a pretty good estimation and correlation of it.
>
> On Fri, Mar 27, 2015 at 4:56 PM, Markus Scherer 
> wrote:
>
>> On Fri, Mar 27, 2015 at 1:27 PM, Michael Norton <
>> michaelanortons...@gmail.com> wrote:
>>
>>> Easy example: what's the code for [blank space] U+020 across all
>>> language sets of Unicode?  Is it the same ie: 100%?
>>>
>>
>> I don't understand what you are asking, and I have a hunch you haven't
>> said it in a way that anyone else understands it either.
>>
>> The code point value that the Unicode Standard assigns to the normal
>> space is U+0020, but
>> - not every language uses spaces
>> - not every language that uses spaces uses them for the same purpose as
>> English
>> - there are some 30 other "space" characters in Unicode
>>
>> Statistics of character frequencies vary by corpus, as others have said.
>> Even if you "only" look "on the web", that's undefined until you specify a
>> crawling strategy. Dynamically generated content means that there is an
>> infinite number of "web pages". Every crawler will come up with a different
>> set.
>>
>> Maybe you are asking about statistics of character encodings? On the web?
>> Such as, Unicode vs. Shift-JIS vs. ISO 8859-2 etc.?
>>
>> markus
>>
>
>
>
> --
>
> Michael A. Norton, B.A. Cinema, M.P.A.
> My Cinema Home: http://www.NortonsNook.com
>
> "All great actors are mere mathematical masters of speech and the human
> body."
>
>
>
>
>


-- 

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com

"All great actors are mere mathematical masters of speech and the human
body."
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Michael Norton
Doug Ewell's getting it.   He sent this back to me, so I asked him if he
could provide the same dataset drawn from his written reply to me:








* For example, your original e-mail (327characters) consists of:U+0020 -
14.07%U+0065 - 10.09%U+0061 -  7.03%U+0074 -  6.73%U+006F -  5.81%*

This is good because when the volumes of traffic begin to exponentially
increase over a space, if there are predominant formulations of Unicode for
each, they need to be recognized for a number of reasons depending on which
sector or, as you say, corpus, you're in.

In the above example, I think it's safe to say U+0020 online, though I
would like to compare with the other 30 "space" characters you mentioned
Markus.   If I know traffic figures for where the other space characters
are used, I can draw a pretty good estimation and correlation of it.

On Fri, Mar 27, 2015 at 4:56 PM, Markus Scherer 
wrote:

> On Fri, Mar 27, 2015 at 1:27 PM, Michael Norton <
> michaelanortons...@gmail.com> wrote:
>
>> Easy example: what's the code for [blank space] U+020 across all language
>> sets of Unicode?  Is it the same ie: 100%?
>>
>
> I don't understand what you are asking, and I have a hunch you haven't
> said it in a way that anyone else understands it either.
>
> The code point value that the Unicode Standard assigns to the normal space
> is U+0020, but
> - not every language uses spaces
> - not every language that uses spaces uses them for the same purpose as
> English
> - there are some 30 other "space" characters in Unicode
>
> Statistics of character frequencies vary by corpus, as others have said.
> Even if you "only" look "on the web", that's undefined until you specify a
> crawling strategy. Dynamically generated content means that there is an
> infinite number of "web pages". Every crawler will come up with a different
> set.
>
> Maybe you are asking about statistics of character encodings? On the web?
> Such as, Unicode vs. Shift-JIS vs. ISO 8859-2 etc.?
>
> markus
>



-- 

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com

"All great actors are mere mathematical masters of speech and the human
body."
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Markus Scherer
On Fri, Mar 27, 2015 at 1:27 PM, Michael Norton <
michaelanortons...@gmail.com> wrote:

> Easy example: what's the code for [blank space] U+020 across all language
> sets of Unicode?  Is it the same ie: 100%?
>

I don't understand what you are asking, and I have a hunch you haven't said
it in a way that anyone else understands it either.

The code point value that the Unicode Standard assigns to the normal space
is U+0020, but
- not every language uses spaces
- not every language that uses spaces uses them for the same purpose as
English
- there are some 30 other "space" characters in Unicode

Statistics of character frequencies vary by corpus, as others have said.
Even if you "only" look "on the web", that's undefined until you specify a
crawling strategy. Dynamically generated content means that there is an
infinite number of "web pages". Every crawler will come up with a different
set.

Maybe you are asking about statistics of character encodings? On the web?
Such as, Unicode vs. Shift-JIS vs. ISO 8859-2 etc.?

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Michael Norton
Thank you.   What's the count for "universal characters" at this time?  Eg:
[SP]

On Fri, Mar 27, 2015 at 4:40 PM, Phillips, Addison 
wrote:

>  What you might be looking for would be the CLDR project's "exemplar
> sets" (see for example [1]), which describes which characters are
> customarily used for a given language and which are sometimes used.
> However, this is not the same thing as statistical distribution. One of the
> points of Unicode is that any character can be used at any time in any
> document--regardless of language.
>
>
>
>
>
> [1]
> http://www.unicode.org/cldr/charts/27/by_type/core_data.alphabetic_information.main.html
>
>
>
> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Michael
> Norton
> *Sent:* Friday, March 27, 2015 1:25 PM
> *To:* John D. Burger
> *Cc:* Vint Cerf; Unicode@unicode.org
> *Subject:* Re: Usage stats?
>
>
>
> Just using the tools and formulations we have at present ought to allow
> Unicode to produce a usage set without indexing the entire web which would
> provide implementors with an indication of variances for traffic, overflow,
> and override purposes relative to users of the standard.  If the figure
> varies significantly from page:website, website:region, region:language,
> for example, it simplifies our ability to standardize the set.
>
>
>
> I have particular concerns, but, like Google, they are proprietary.
>
>
>
> On Fri, Mar 27, 2015 at 4:23 PM, John D. Burger  wrote:
>
>  On Mar 27, 2015, at 15:57 , Michael Norton 
> wrote:
>
>
>
>  Why wouldn't Unicode itself have it?
>
>
>
> Because as Ken explained, acquiring (and constantly updating) such
> statistics would require roughly the effort that Google puts into its
> crawler. And it wouldn't include all the printed material that isn't on the
> web.
>
>
>
> Turning your question around, why would Unicode have this information?
> What would be the value, and how would it be worth the (considerable)
> effort required?
>
>
>
> - John Burger
>
>   MITRE
>
>
>
>
>
> On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler  wrote:
>
> Search engine companies (and in particular, Google) have such
> information squirreled away in their index databases, at least as
> far as usage stats for Unicode characters on the web go -- but it
> is proprietary information, and they generally don't publish
> information about such statistics.
>
> Perhaps there are researchers out there who have set web crawlers
> on a mission to generate such web statistics for publication, and maybe
> somebody on this list knows of such research -- but it would be
> virtually impossible to generate such information for the much
> wider collection of documents and data that are not easily accessible
> for web indexing. (Behind password walls, in pdf document archives,
> in proprietary databases, ... ) As an example of why this is a problem,
> consider the fact that there are *peta*bytes of information picked up
> and stored in databases from scanners and other devices used at
> tens of millions of retail points of sale. Such data, by its nature, would
> tend
> to skew heavily towards use of ASCII a-z and digits 0-9 in its
> character data. How would you end up weighting such (mostly
> publicly inaccessible) data in trying to count up for overall statistics
> on character use?
>
> There are more traditional usage count studies that focus on
> counts of character frequency within single language orthographies
> in single scripts (e.g., letter frequences for French text), but I don't
> think that is what you were asking about.
>
> Here is some discussion of a similar question posted on stackoverflow:
>
>
> http://stackoverflow.com/questions/22184624/unicode-character-usage-statistics
>
> --Ken
>
> On 3/27/2015 9:31 AM, Michael Norton wrote:
>
> Hello and thank you for an incredible service (just joining the list).
>  Is there a list of usage statistics per character of the Unicode set
> available somewhere?
>
>
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
>
>
>
> --
>
>
> Michael A. Norton, B.A. Cinema, M.P.A.
>
> My Cinema Home: http://www.NortonsNook.com <http://www.nortonsnook.com/>
>
>
>
> "All great actors are mere mathematical masters of speech and the human
> body."
>
> [image: Image removed by sender.]
>
>
>
>   ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/un

RE: Usage stats?

2015-03-27 Thread Phillips, Addison
What you might be looking for would be the CLDR project’s “exemplar sets” (see 
for example [1]), which describes which characters are customarily used for a 
given language and which are sometimes used. However, this is not the same 
thing as statistical distribution. One of the points of Unicode is that any 
character can be used at any time in any document—regardless of language.


[1] 
http://www.unicode.org/cldr/charts/27/by_type/core_data.alphabetic_information.main.html

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Michael Norton
Sent: Friday, March 27, 2015 1:25 PM
To: John D. Burger
Cc: Vint Cerf; Unicode@unicode.org
Subject: Re: Usage stats?

Just using the tools and formulations we have at present ought to allow Unicode 
to produce a usage set without indexing the entire web which would provide 
implementors with an indication of variances for traffic, overflow, and 
override purposes relative to users of the standard.  If the figure varies 
significantly from page:website, website:region, region:language, for example, 
it simplifies our ability to standardize the set.

I have particular concerns, but, like Google, they are proprietary.

On Fri, Mar 27, 2015 at 4:23 PM, John D. Burger 
mailto:j...@mitre.org>> wrote:
On Mar 27, 2015, at 15:57 , Michael Norton 
mailto:michaelanortons...@gmail.com>> wrote:

Why wouldn't Unicode itself have it?

Because as Ken explained, acquiring (and constantly updating) such statistics 
would require roughly the effort that Google puts into its crawler. And it 
wouldn't include all the printed material that isn't on the web.

Turning your question around, why would Unicode have this information? What 
would be the value, and how would it be worth the (considerable) effort 
required?

- John Burger
  MITRE


On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler 
mailto:kenwhist...@att.net>> wrote:
Search engine companies (and in particular, Google) have such
information squirreled away in their index databases, at least as
far as usage stats for Unicode characters on the web go -- but it
is proprietary information, and they generally don't publish
information about such statistics.

Perhaps there are researchers out there who have set web crawlers
on a mission to generate such web statistics for publication, and maybe
somebody on this list knows of such research -- but it would be
virtually impossible to generate such information for the much
wider collection of documents and data that are not easily accessible
for web indexing. (Behind password walls, in pdf document archives,
in proprietary databases, ... ) As an example of why this is a problem,
consider the fact that there are *peta*bytes of information picked up
and stored in databases from scanners and other devices used at
tens of millions of retail points of sale. Such data, by its nature, would tend
to skew heavily towards use of ASCII a-z and digits 0-9 in its
character data. How would you end up weighting such (mostly
publicly inaccessible) data in trying to count up for overall statistics
on character use?

There are more traditional usage count studies that focus on
counts of character frequency within single language orthographies
in single scripts (e.g., letter frequences for French text), but I don't
think that is what you were asking about.

Here is some discussion of a similar question posted on stackoverflow:

http://stackoverflow.com/questions/22184624/unicode-character-usage-statistics

--Ken

On 3/27/2015 9:31 AM, Michael Norton wrote:
Hello and thank you for an incredible service (just joining the list).   Is 
there a list of usage statistics per character of the Unicode set available 
somewhere?


___
Unicode mailing list
Unicode@unicode.org<mailto:Unicode@unicode.org>
http://unicode.org/mailman/listinfo/unicode



--

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com<http://www.nortonsnook.com/>

"All great actors are mere mathematical masters of speech and the human body."
[Image removed by sender.]


___
Unicode mailing list
Unicode@unicode.org<mailto:Unicode@unicode.org>
http://unicode.org/mailman/listinfo/unicode




--

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com

"All great actors are mere mathematical masters of speech and the human body."
[Image removed by sender.]


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Michael Norton
Easy example: what's the code for [blank space] U+020 across all language
sets of Unicode?  Is it the same ie: 100%?

On Fri, Mar 27, 2015 at 4:24 PM, Michael Norton <
michaelanortons...@gmail.com> wrote:

> Just using the tools and formulations we have at present ought to allow
> Unicode to produce a usage set without indexing the entire web which would
> provide implementors with an indication of variances for traffic, overflow,
> and override purposes relative to users of the standard.  If the figure
> varies significantly from page:website, website:region, region:language,
> for example, it simplifies our ability to standardize the set.
>
> I have particular concerns, but, like Google, they are proprietary.
>
> On Fri, Mar 27, 2015 at 4:23 PM, John D. Burger  wrote:
>
>> On Mar 27, 2015, at 15:57 , Michael Norton 
>> wrote:
>>
>> Why wouldn't Unicode itself have it?
>>
>>
>> Because as Ken explained, acquiring (and constantly updating) such
>> statistics would require roughly the effort that Google puts into its
>> crawler. And it wouldn't include all the printed material that isn't on the
>> web.
>>
>> Turning your question around, why would Unicode have this information?
>> What would be the value, and how would it be worth the (considerable)
>> effort required?
>>
>> - John Burger
>>   MITRE
>>
>>
>> On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler 
>> wrote:
>>
>>> Search engine companies (and in particular, Google) have such
>>> information squirreled away in their index databases, at least as
>>> far as usage stats for Unicode characters on the web go -- but it
>>> is proprietary information, and they generally don't publish
>>> information about such statistics.
>>>
>>> Perhaps there are researchers out there who have set web crawlers
>>> on a mission to generate such web statistics for publication, and maybe
>>> somebody on this list knows of such research -- but it would be
>>> virtually impossible to generate such information for the much
>>> wider collection of documents and data that are not easily accessible
>>> for web indexing. (Behind password walls, in pdf document archives,
>>> in proprietary databases, ... ) As an example of why this is a problem,
>>> consider the fact that there are *peta*bytes of information picked up
>>> and stored in databases from scanners and other devices used at
>>> tens of millions of retail points of sale. Such data, by its nature,
>>> would tend
>>> to skew heavily towards use of ASCII a-z and digits 0-9 in its
>>> character data. How would you end up weighting such (mostly
>>> publicly inaccessible) data in trying to count up for overall statistics
>>> on character use?
>>>
>>> There are more traditional usage count studies that focus on
>>> counts of character frequency within single language orthographies
>>> in single scripts (e.g., letter frequences for French text), but I don't
>>> think that is what you were asking about.
>>>
>>> Here is some discussion of a similar question posted on stackoverflow:
>>>
>>> http://stackoverflow.com/questions/22184624/unicode-
>>> character-usage-statistics
>>>
>>> --Ken
>>>
>>> On 3/27/2015 9:31 AM, Michael Norton wrote:
>>>
 Hello and thank you for an incredible service (just joining the list).
  Is there a list of usage statistics per character of the Unicode set
 available somewhere?



>>> ___
>>> Unicode mailing list
>>> Unicode@unicode.org
>>> http://unicode.org/mailman/listinfo/unicode
>>>
>>
>>
>>
>> --
>>
>> Michael A. Norton, B.A. Cinema, M.P.A.
>> My Cinema Home: http://www.NortonsNook.com 
>>
>> "All great actors are mere mathematical masters of speech and the human
>> body."
>>
>>
>>
>>
>>  ___
>> Unicode mailing list
>> Unicode@unicode.org
>> http://unicode.org/mailman/listinfo/unicode
>>
>>
>>
>
>
> --
>
> Michael A. Norton, B.A. Cinema, M.P.A.
> My Cinema Home: http://www.NortonsNook.com
>
> "All great actors are mere mathematical masters of speech and the human
> body."
>
>
>
>
>


-- 

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com

"All great actors are mere mathematical masters of speech and the human
body."
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Michael Norton
Just using the tools and formulations we have at present ought to allow
Unicode to produce a usage set without indexing the entire web which would
provide implementors with an indication of variances for traffic, overflow,
and override purposes relative to users of the standard.  If the figure
varies significantly from page:website, website:region, region:language,
for example, it simplifies our ability to standardize the set.

I have particular concerns, but, like Google, they are proprietary.

On Fri, Mar 27, 2015 at 4:23 PM, John D. Burger  wrote:

> On Mar 27, 2015, at 15:57 , Michael Norton 
> wrote:
>
> Why wouldn't Unicode itself have it?
>
>
> Because as Ken explained, acquiring (and constantly updating) such
> statistics would require roughly the effort that Google puts into its
> crawler. And it wouldn't include all the printed material that isn't on the
> web.
>
> Turning your question around, why would Unicode have this information?
> What would be the value, and how would it be worth the (considerable)
> effort required?
>
> - John Burger
>   MITRE
>
>
> On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler  wrote:
>
>> Search engine companies (and in particular, Google) have such
>> information squirreled away in their index databases, at least as
>> far as usage stats for Unicode characters on the web go -- but it
>> is proprietary information, and they generally don't publish
>> information about such statistics.
>>
>> Perhaps there are researchers out there who have set web crawlers
>> on a mission to generate such web statistics for publication, and maybe
>> somebody on this list knows of such research -- but it would be
>> virtually impossible to generate such information for the much
>> wider collection of documents and data that are not easily accessible
>> for web indexing. (Behind password walls, in pdf document archives,
>> in proprietary databases, ... ) As an example of why this is a problem,
>> consider the fact that there are *peta*bytes of information picked up
>> and stored in databases from scanners and other devices used at
>> tens of millions of retail points of sale. Such data, by its nature,
>> would tend
>> to skew heavily towards use of ASCII a-z and digits 0-9 in its
>> character data. How would you end up weighting such (mostly
>> publicly inaccessible) data in trying to count up for overall statistics
>> on character use?
>>
>> There are more traditional usage count studies that focus on
>> counts of character frequency within single language orthographies
>> in single scripts (e.g., letter frequences for French text), but I don't
>> think that is what you were asking about.
>>
>> Here is some discussion of a similar question posted on stackoverflow:
>>
>> http://stackoverflow.com/questions/22184624/unicode-
>> character-usage-statistics
>>
>> --Ken
>>
>> On 3/27/2015 9:31 AM, Michael Norton wrote:
>>
>>> Hello and thank you for an incredible service (just joining the list).
>>>  Is there a list of usage statistics per character of the Unicode set
>>> available somewhere?
>>>
>>>
>>>
>> ___
>> Unicode mailing list
>> Unicode@unicode.org
>> http://unicode.org/mailman/listinfo/unicode
>>
>
>
>
> --
>
> Michael A. Norton, B.A. Cinema, M.P.A.
> My Cinema Home: http://www.NortonsNook.com 
>
> "All great actors are mere mathematical masters of speech and the human
> body."
>
>
>
>
>  ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
>


-- 

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com

"All great actors are mere mathematical masters of speech and the human
body."
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread John D. Burger
On Mar 27, 2015, at 15:57 , Michael Norton mailto:michaelanortons...@gmail.com>> wrote:

> Why wouldn't Unicode itself have it?

Because as Ken explained, acquiring (and constantly updating) such statistics 
would require roughly the effort that Google puts into its crawler. And it 
wouldn't include all the printed material that isn't on the web.

Turning your question around, why would Unicode have this information? What 
would be the value, and how would it be worth the (considerable) effort 
required?

- John Burger
  MITRE

> 
> On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler  > wrote:
> Search engine companies (and in particular, Google) have such
> information squirreled away in their index databases, at least as
> far as usage stats for Unicode characters on the web go -- but it
> is proprietary information, and they generally don't publish
> information about such statistics.
> 
> Perhaps there are researchers out there who have set web crawlers
> on a mission to generate such web statistics for publication, and maybe
> somebody on this list knows of such research -- but it would be
> virtually impossible to generate such information for the much
> wider collection of documents and data that are not easily accessible
> for web indexing. (Behind password walls, in pdf document archives,
> in proprietary databases, ... ) As an example of why this is a problem,
> consider the fact that there are *peta*bytes of information picked up
> and stored in databases from scanners and other devices used at
> tens of millions of retail points of sale. Such data, by its nature, would 
> tend
> to skew heavily towards use of ASCII a-z and digits 0-9 in its
> character data. How would you end up weighting such (mostly
> publicly inaccessible) data in trying to count up for overall statistics
> on character use?
> 
> There are more traditional usage count studies that focus on
> counts of character frequency within single language orthographies
> in single scripts (e.g., letter frequences for French text), but I don't
> think that is what you were asking about.
> 
> Here is some discussion of a similar question posted on stackoverflow:
> 
> http://stackoverflow.com/questions/22184624/unicode-character-usage-statistics
>  
> 
> 
> --Ken
> 
> On 3/27/2015 9:31 AM, Michael Norton wrote:
> Hello and thank you for an incredible service (just joining the list).   Is 
> there a list of usage statistics per character of the Unicode set available 
> somewhere?
> 
> 
> 
> ___
> Unicode mailing list
> Unicode@unicode.org 
> http://unicode.org/mailman/listinfo/unicode 
> 
> 
> 
> 
> -- 
> 
> Michael A. Norton, B.A. Cinema, M.P.A.
> My Cinema Home: http://www.NortonsNook.com  
> 
> "All great actors are mere mathematical masters of speech and the human body."
> 
> 
> 
> 
> ___
> Unicode mailing list
> Unicode@unicode.org 
> http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Michael Norton
(I know this is way too simplistic a response but it is kind of like giving
everyone an invisible cloak and an invisible dagger and not telling them
what a cloak and dagger is for [cutting butter & keeping warm]).

On Fri, Mar 27, 2015 at 3:57 PM, Michael Norton <
michaelanortons...@gmail.com> wrote:

> Why wouldn't Unicode itself have it?
>
> On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler  wrote:
>
>> Search engine companies (and in particular, Google) have such
>> information squirreled away in their index databases, at least as
>> far as usage stats for Unicode characters on the web go -- but it
>> is proprietary information, and they generally don't publish
>> information about such statistics.
>>
>> Perhaps there are researchers out there who have set web crawlers
>> on a mission to generate such web statistics for publication, and maybe
>> somebody on this list knows of such research -- but it would be
>> virtually impossible to generate such information for the much
>> wider collection of documents and data that are not easily accessible
>> for web indexing. (Behind password walls, in pdf document archives,
>> in proprietary databases, ... ) As an example of why this is a problem,
>> consider the fact that there are *peta*bytes of information picked up
>> and stored in databases from scanners and other devices used at
>> tens of millions of retail points of sale. Such data, by its nature,
>> would tend
>> to skew heavily towards use of ASCII a-z and digits 0-9 in its
>> character data. How would you end up weighting such (mostly
>> publicly inaccessible) data in trying to count up for overall statistics
>> on character use?
>>
>> There are more traditional usage count studies that focus on
>> counts of character frequency within single language orthographies
>> in single scripts (e.g., letter frequences for French text), but I don't
>> think that is what you were asking about.
>>
>> Here is some discussion of a similar question posted on stackoverflow:
>>
>> http://stackoverflow.com/questions/22184624/unicode-
>> character-usage-statistics
>>
>> --Ken
>>
>> On 3/27/2015 9:31 AM, Michael Norton wrote:
>>
>>> Hello and thank you for an incredible service (just joining the list).
>>>  Is there a list of usage statistics per character of the Unicode set
>>> available somewhere?
>>>
>>>
>>>
>> ___
>> Unicode mailing list
>> Unicode@unicode.org
>> http://unicode.org/mailman/listinfo/unicode
>>
>
>
>
> --
>
> Michael A. Norton, B.A. Cinema, M.P.A.
> My Cinema Home: http://www.NortonsNook.com
>
> "All great actors are mere mathematical masters of speech and the human
> body."
>
>
>
>
>


-- 

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com

"All great actors are mere mathematical masters of speech and the human
body."
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Doug Ewell
Michael Norton  wrote:

> Why wouldn't Unicode itself have it?

Probably because the Unicode Consortium isn't responsible for indexing
the entire web. Would you expect it to be?

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Michael Norton
Why wouldn't Unicode itself have it?

On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler  wrote:

> Search engine companies (and in particular, Google) have such
> information squirreled away in their index databases, at least as
> far as usage stats for Unicode characters on the web go -- but it
> is proprietary information, and they generally don't publish
> information about such statistics.
>
> Perhaps there are researchers out there who have set web crawlers
> on a mission to generate such web statistics for publication, and maybe
> somebody on this list knows of such research -- but it would be
> virtually impossible to generate such information for the much
> wider collection of documents and data that are not easily accessible
> for web indexing. (Behind password walls, in pdf document archives,
> in proprietary databases, ... ) As an example of why this is a problem,
> consider the fact that there are *peta*bytes of information picked up
> and stored in databases from scanners and other devices used at
> tens of millions of retail points of sale. Such data, by its nature, would
> tend
> to skew heavily towards use of ASCII a-z and digits 0-9 in its
> character data. How would you end up weighting such (mostly
> publicly inaccessible) data in trying to count up for overall statistics
> on character use?
>
> There are more traditional usage count studies that focus on
> counts of character frequency within single language orthographies
> in single scripts (e.g., letter frequences for French text), but I don't
> think that is what you were asking about.
>
> Here is some discussion of a similar question posted on stackoverflow:
>
> http://stackoverflow.com/questions/22184624/unicode-
> character-usage-statistics
>
> --Ken
>
> On 3/27/2015 9:31 AM, Michael Norton wrote:
>
>> Hello and thank you for an incredible service (just joining the list).
>>  Is there a list of usage statistics per character of the Unicode set
>> available somewhere?
>>
>>
>>
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>



-- 

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com

"All great actors are mere mathematical masters of speech and the human
body."
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Usage stats?

2015-03-27 Thread Ken Whistler

Search engine companies (and in particular, Google) have such
information squirreled away in their index databases, at least as
far as usage stats for Unicode characters on the web go -- but it
is proprietary information, and they generally don't publish
information about such statistics.

Perhaps there are researchers out there who have set web crawlers
on a mission to generate such web statistics for publication, and maybe
somebody on this list knows of such research -- but it would be
virtually impossible to generate such information for the much
wider collection of documents and data that are not easily accessible
for web indexing. (Behind password walls, in pdf document archives,
in proprietary databases, ... ) As an example of why this is a problem,
consider the fact that there are *peta*bytes of information picked up
and stored in databases from scanners and other devices used at
tens of millions of retail points of sale. Such data, by its nature, 
would tend

to skew heavily towards use of ASCII a-z and digits 0-9 in its
character data. How would you end up weighting such (mostly
publicly inaccessible) data in trying to count up for overall statistics
on character use?

There are more traditional usage count studies that focus on
counts of character frequency within single language orthographies
in single scripts (e.g., letter frequences for French text), but I don't
think that is what you were asking about.

Here is some discussion of a similar question posted on stackoverflow:

http://stackoverflow.com/questions/22184624/unicode-character-usage-statistics

--Ken

On 3/27/2015 9:31 AM, Michael Norton wrote:
Hello and thank you for an incredible service (just joining the list). 
  Is there a list of usage statistics per character of the Unicode set 
available somewhere?





___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode