Thank you. What's the count for "universal characters" at this time? Eg: [SP]
On Fri, Mar 27, 2015 at 4:40 PM, Phillips, Addison <[email protected]> wrote: > What you might be looking for would be the CLDR project's "exemplar > sets" (see for example [1]), which describes which characters are > customarily used for a given language and which are sometimes used. > However, this is not the same thing as statistical distribution. One of the > points of Unicode is that any character can be used at any time in any > document--regardless of language. > > > > > > [1] > http://www.unicode.org/cldr/charts/27/by_type/core_data.alphabetic_information.main.html > > > > *From:* Unicode [mailto:[email protected]] *On Behalf Of *Michael > Norton > *Sent:* Friday, March 27, 2015 1:25 PM > *To:* John D. Burger > *Cc:* Vint Cerf; [email protected] > *Subject:* Re: Usage stats? > > > > Just using the tools and formulations we have at present ought to allow > Unicode to produce a usage set without indexing the entire web which would > provide implementors with an indication of variances for traffic, overflow, > and override purposes relative to users of the standard. If the figure > varies significantly from page:website, website:region, region:language, > for example, it simplifies our ability to standardize the set. > > > > I have particular concerns, but, like Google, they are proprietary. > > > > On Fri, Mar 27, 2015 at 4:23 PM, John D. Burger <[email protected]> wrote: > > On Mar 27, 2015, at 15:57 , Michael Norton <[email protected]> > wrote: > > > > Why wouldn't Unicode itself have it? > > > > Because as Ken explained, acquiring (and constantly updating) such > statistics would require roughly the effort that Google puts into its > crawler. And it wouldn't include all the printed material that isn't on the > web. > > > > Turning your question around, why would Unicode have this information? > What would be the value, and how would it be worth the (considerable) > effort required? > > > > - John Burger > > MITRE > > > > > > On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler <[email protected]> wrote: > > Search engine companies (and in particular, Google) have such > information squirreled away in their index databases, at least as > far as usage stats for Unicode characters on the web go -- but it > is proprietary information, and they generally don't publish > information about such statistics. > > Perhaps there are researchers out there who have set web crawlers > on a mission to generate such web statistics for publication, and maybe > somebody on this list knows of such research -- but it would be > virtually impossible to generate such information for the much > wider collection of documents and data that are not easily accessible > for web indexing. (Behind password walls, in pdf document archives, > in proprietary databases, ... ) As an example of why this is a problem, > consider the fact that there are *peta*bytes of information picked up > and stored in databases from scanners and other devices used at > tens of millions of retail points of sale. Such data, by its nature, would > tend > to skew heavily towards use of ASCII a-z and digits 0-9 in its > character data. How would you end up weighting such (mostly > publicly inaccessible) data in trying to count up for overall statistics > on character use? > > There are more traditional usage count studies that focus on > counts of character frequency within single language orthographies > in single scripts (e.g., letter frequences for French text), but I don't > think that is what you were asking about. > > Here is some discussion of a similar question posted on stackoverflow: > > > http://stackoverflow.com/questions/22184624/unicode-character-usage-statistics > > --Ken > > On 3/27/2015 9:31 AM, Michael Norton wrote: > > Hello and thank you for an incredible service (just joining the list). > Is there a list of usage statistics per character of the Unicode set > available somewhere? > > > _______________________________________________ > Unicode mailing list > [email protected] > http://unicode.org/mailman/listinfo/unicode > > > > > > -- > > > Michael A. Norton, B.A. Cinema, M.P.A. > > My Cinema Home: http://www.NortonsNook.com <http://www.nortonsnook.com/> > > > > "All great actors are mere mathematical masters of speech and the human > body." > > [image: Image removed by sender.] > > > > _______________________________________________ > Unicode mailing list > [email protected] > http://unicode.org/mailman/listinfo/unicode > > > > > > > > -- > > > Michael A. Norton, B.A. Cinema, M.P.A. > > My Cinema Home: http://www.NortonsNook.com > > > > "All great actors are mere mathematical masters of speech and the human > body." > > [image: Image removed by sender.] > > > > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body."
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

