Easy example: what's the code for [blank space] U+020 across all language sets of Unicode? Is it the same ie: 100%?
On Fri, Mar 27, 2015 at 4:24 PM, Michael Norton < [email protected]> wrote: > Just using the tools and formulations we have at present ought to allow > Unicode to produce a usage set without indexing the entire web which would > provide implementors with an indication of variances for traffic, overflow, > and override purposes relative to users of the standard. If the figure > varies significantly from page:website, website:region, region:language, > for example, it simplifies our ability to standardize the set. > > I have particular concerns, but, like Google, they are proprietary. > > On Fri, Mar 27, 2015 at 4:23 PM, John D. Burger <[email protected]> wrote: > >> On Mar 27, 2015, at 15:57 , Michael Norton <[email protected]> >> wrote: >> >> Why wouldn't Unicode itself have it? >> >> >> Because as Ken explained, acquiring (and constantly updating) such >> statistics would require roughly the effort that Google puts into its >> crawler. And it wouldn't include all the printed material that isn't on the >> web. >> >> Turning your question around, why would Unicode have this information? >> What would be the value, and how would it be worth the (considerable) >> effort required? >> >> - John Burger >> MITRE >> >> >> On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler <[email protected]> >> wrote: >> >>> Search engine companies (and in particular, Google) have such >>> information squirreled away in their index databases, at least as >>> far as usage stats for Unicode characters on the web go -- but it >>> is proprietary information, and they generally don't publish >>> information about such statistics. >>> >>> Perhaps there are researchers out there who have set web crawlers >>> on a mission to generate such web statistics for publication, and maybe >>> somebody on this list knows of such research -- but it would be >>> virtually impossible to generate such information for the much >>> wider collection of documents and data that are not easily accessible >>> for web indexing. (Behind password walls, in pdf document archives, >>> in proprietary databases, ... ) As an example of why this is a problem, >>> consider the fact that there are *peta*bytes of information picked up >>> and stored in databases from scanners and other devices used at >>> tens of millions of retail points of sale. Such data, by its nature, >>> would tend >>> to skew heavily towards use of ASCII a-z and digits 0-9 in its >>> character data. How would you end up weighting such (mostly >>> publicly inaccessible) data in trying to count up for overall statistics >>> on character use? >>> >>> There are more traditional usage count studies that focus on >>> counts of character frequency within single language orthographies >>> in single scripts (e.g., letter frequences for French text), but I don't >>> think that is what you were asking about. >>> >>> Here is some discussion of a similar question posted on stackoverflow: >>> >>> http://stackoverflow.com/questions/22184624/unicode- >>> character-usage-statistics >>> >>> --Ken >>> >>> On 3/27/2015 9:31 AM, Michael Norton wrote: >>> >>>> Hello and thank you for an incredible service (just joining the list). >>>> Is there a list of usage statistics per character of the Unicode set >>>> available somewhere? >>>> >>>> >>>> >>> _______________________________________________ >>> Unicode mailing list >>> [email protected] >>> http://unicode.org/mailman/listinfo/unicode >>> >> >> >> >> -- >> >> Michael A. Norton, B.A. Cinema, M.P.A. >> My Cinema Home: http://www.NortonsNook.com <http://www.nortonsnook.com/> >> >> "All great actors are mere mathematical masters of speech and the human >> body." >> >> >> >> >> _______________________________________________ >> Unicode mailing list >> [email protected] >> http://unicode.org/mailman/listinfo/unicode >> >> >> > > > -- > > Michael A. Norton, B.A. Cinema, M.P.A. > My Cinema Home: http://www.NortonsNook.com > > "All great actors are mere mathematical masters of speech and the human > body." > > > > > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body."
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

