On Thu 29 Apr 2021 at 11:33:29 (-0400), rhkra...@gmail.com wrote: > On Thursday, April 29, 2021 09:03:59 AM Albretch Mueller wrote: > > > What is "alpha-offset format"? > > > > we, corpora research kinds of folks, need to process thousand of > > files as other people process bytes. UTF8 was basically an > > Americanizierung of alle alphabets.
"Americanizierung" merely seems an odd way of saying that UTF8 favours the Roman alphabet in terms of the length of encoded strings. > > UTF is great to describe an > > alphabet but not for text files. Are we now discussing non-alphabetic scripts? Beats me. > > UTF8 turned all files into streams not good for questions such as > > what is the charatcer/string sequence starting on the nth addressable > > unit of a file ... It's very good for such questions, because it is unambiguous whether the nth addressable unit (let's say byte) is the start of a character or not. When it /is/ the start of a character, the number of bytes in the character is given by at most 1–5 bits of that first byte, so you don't have to hunt for some kind of terminating byte. > > Doing that with utF8 is from way too complicated to impossible. OTOH it's not straightforward to determine at which byte does the nth character start. Perhaps that is what you meant to say. > > Also > > alpha offset nicely splits the files segments into its different > > parts: ALPHABETICAL text, js, css, ... I'd be interested to see examples of JavaScript and Cascading Style Sheets written in non-ALPHABETICAL text. > Ok, but what does it look like? (What is the format?) > > A google search shows only links to this thread and some page about its > relevance to aiming a telescope -- I strongly suspect that is not relevant to > your use case. Whatever it is is not well described by that post, but perhaps better by (e.1.2) immediately following it. >From the context, alpha-offset format sounds like jargon for what html2text does with a web page, which is to separate the text you want to read from all the interspersed code that controls how it would be displayed by a browser. That text¹, in whatever encoding, and now called "corpora", could be analysed as part of someone's corpus linguistics research. The major problem that seems to be exercising this thread at this point is that the OP needs to swallow scads of documents, some, many, perhaps most of which are on the web, but process them on a computer that has never been connected to the internet. Some of us, decades ago, were coping with this problem, as it applies to the software, in the days before Debian's tools had evolved to the functionality they now have. These tools seem to have been overlooked. And so we have discussions of removing networking from the kernel, redesigning apt using java, and de-americanising unicode, all closely monitored, keystroke by keystroke, by NSAs around the world. Typical debian-user, eh? Did we ever decide which meaning of MSI applied for explaining the quirk in the subject line? Perhaps Randolph Quirk would coincidentally have held an opinion. ¹ I assume a similar treatment is meted out to PDFs, and the cache of videos is fed into a speech analyser, all to add to the mountain of text. Cheers, David.