Re: Usage stats?

Michael Norton Fri, 27 Mar 2015 13:47:00 -0700

Thank you.   What's the count for "universal characters" at this time?  Eg:
[SP]


On Fri, Mar 27, 2015 at 4:40 PM, Phillips, Addison <[email protected]>
wrote:

>  What you might be looking for would be the CLDR project's "exemplar
> sets" (see for example [1]), which describes which characters are
> customarily used for a given language and which are sometimes used.
> However, this is not the same thing as statistical distribution. One of the
> points of Unicode is that any character can be used at any time in any
> document--regardless of language.
>
>
>
>
>
> [1]
> http://www.unicode.org/cldr/charts/27/by_type/core_data.alphabetic_information.main.html
>
>
>
> *From:* Unicode [mailto:[email protected]] *On Behalf Of *Michael
> Norton
> *Sent:* Friday, March 27, 2015 1:25 PM
> *To:* John D. Burger
> *Cc:* Vint Cerf; [email protected]
> *Subject:* Re: Usage stats?
>
>
>
> Just using the tools and formulations we have at present ought to allow
> Unicode to produce a usage set without indexing the entire web which would
> provide implementors with an indication of variances for traffic, overflow,
> and override purposes relative to users of the standard.  If the figure
> varies significantly from page:website, website:region, region:language,
> for example, it simplifies our ability to standardize the set.
>
>
>
> I have particular concerns, but, like Google, they are proprietary.
>
>
>
> On Fri, Mar 27, 2015 at 4:23 PM, John D. Burger <[email protected]> wrote:
>
>  On Mar 27, 2015, at 15:57 , Michael Norton <[email protected]>
> wrote:
>
>
>
>  Why wouldn't Unicode itself have it?
>
>
>
> Because as Ken explained, acquiring (and constantly updating) such
> statistics would require roughly the effort that Google puts into its
> crawler. And it wouldn't include all the printed material that isn't on the
> web.
>
>
>
> Turning your question around, why would Unicode have this information?
> What would be the value, and how would it be worth the (considerable)
> effort required?
>
>
>
> - John Burger
>
>   MITRE
>
>
>
>
>
> On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler <[email protected]> wrote:
>
> Search engine companies (and in particular, Google) have such
> information squirreled away in their index databases, at least as
> far as usage stats for Unicode characters on the web go -- but it
> is proprietary information, and they generally don't publish
> information about such statistics.
>
> Perhaps there are researchers out there who have set web crawlers
> on a mission to generate such web statistics for publication, and maybe
> somebody on this list knows of such research -- but it would be
> virtually impossible to generate such information for the much
> wider collection of documents and data that are not easily accessible
> for web indexing. (Behind password walls, in pdf document archives,
> in proprietary databases, ... ) As an example of why this is a problem,
> consider the fact that there are *peta*bytes of information picked up
> and stored in databases from scanners and other devices used at
> tens of millions of retail points of sale. Such data, by its nature, would
> tend
> to skew heavily towards use of ASCII a-z and digits 0-9 in its
> character data. How would you end up weighting such (mostly
> publicly inaccessible) data in trying to count up for overall statistics
> on character use?
>
> There are more traditional usage count studies that focus on
> counts of character frequency within single language orthographies
> in single scripts (e.g., letter frequences for French text), but I don't
> think that is what you were asking about.
>
> Here is some discussion of a similar question posted on stackoverflow:
>
>
> http://stackoverflow.com/questions/22184624/unicode-character-usage-statistics
>
> --Ken
>
> On 3/27/2015 9:31 AM, Michael Norton wrote:
>
> Hello and thank you for an incredible service (just joining the list).
>  Is there a list of usage statistics per character of the Unicode set
> available somewhere?
>
>
> _______________________________________________
> Unicode mailing list
> [email protected]
> http://unicode.org/mailman/listinfo/unicode
>
>
>
>
>
> --
>
>
> Michael A. Norton, B.A. Cinema, M.P.A.
>
> My Cinema Home: http://www.NortonsNook.com <http://www.nortonsnook.com/>
>
>
>
> "All great actors are mere mathematical masters of speech and the human
> body."
>
> [image: Image removed by sender.]
>
>
>
>       _______________________________________________
> Unicode mailing list
> [email protected]
> http://unicode.org/mailman/listinfo/unicode
>
>
>
>
>
>
>
> --
>
>
> Michael A. Norton, B.A. Cinema, M.P.A.
>
> My Cinema Home: http://www.NortonsNook.com
>
>
>
> "All great actors are mere mathematical masters of speech and the human
> body."
>
> [image: Image removed by sender.]
>
>
>
>


-- 

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com

"All great actors are mere mathematical masters of speech and the human
body."

_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Re: Usage stats?

Reply via email to