BartC <b...@freeuk.com>: > On 12/03/2016 12:13, Marko Rauhamaa wrote: >> BartC <b...@freeuk.com>: >> >>> If you're looking at fast processing of language source code (in a >>> thread partly about efficiency), then you cannot ignore the fact >>> that the vast majority of characters being processed are going to >>> have ASCII codes. >> >> I don't know why you would optimize for inputting program source >> code. Text in general has left ASCII behind a long time ago. Just go >> to Wikipedia and click on any of the other languages. >> >> Why, look at the *English* page on Hillary Clinton: >> >> Hillary Diane Rodham Clinton /ˈhɪləri daɪˈæn ˈrɒdəm ˈklɪntən/ >> (born October 26, 1947) is an American politician. <URL: >> https://en.wikipedia.org/wiki/Hillary_Clinton> >> >> You couldn't get past the first sentence in ASCII. > > I saved that page locally as a .htm file in UTF-8 encoding. I ran a > modified version of my benchmark, and it appeared that 99.7% of the > bytes had ASCII codes. The other 0.3% presumably were multi-byte > sequences, so that the actual proportion of Unicode characters would > be even less. > > I then saved the Arabic version of the page, which visually, when > rendered, consists of 99% Arabic script. But the .htm file was still > 80% ASCII! > > So what were you saying about ASCII being practically obsolete ... ?
Yes, HTML markup is all ASCII. However, as you say, the text content is often anything but. What I'm saying is that if you are designing a new programming language and associated ecosystem, you are well advised to take Unicode into account from the start. Take advantage of the hindsight; Python, Linux, C, Java and Windows were not so lucky. Marko -- https://mail.python.org/mailman/listinfo/python-list