On Thu, 29 Apr 2021 Albretch Mueller wrote:
What is "alpha-offset format"?

we, corpora research kinds of folks, need to process thousand of
files as other people process bytes.

That was a helpful clue, that it could be a term of art in corpus

After some searches in that direction, I found a couple of papers
using the term "character offset format", which sounds a little more

UTF8 was basically an Americanizierung of alle alphabets.

All roads lead to [DEL: Rome :DEL] American Standard Code for
Information Interchange.

UTF is great to describe an alphabet but not for text files.

UTF8 turned all files into streams not good for questions such as
what is the charatcer/string sequence starting on the nth
addressable unit of a file ...

Variable-width encoding is a complication if you need random access to
the nth character. I see.

Doing that with utF8 is from way too complicated to impossible.

The solution, clearly, is for everyone to use UTF-32 instead.

Also alpha offset nicely splits the files segments into its
different parts: ALPHABETICAL text, js, css, ...

Ah. It sets *character data* apart from markup, annotations, etc.

 Definition: All text that is not markup constitutes the "character
 data" of the document. ( https://www.w3.org/TR/REC-xml/#dt-chardata )

An alpha/character offset format separates the data from things said
*about* the data.

If I understand you correctly, that is.

Now I wonder how this might enable random access to the nth
character. I will keep looking around.

Anyways, thank you for taking the time to elaborate.

And good luck with your projects. Many of the points in your outline


particularly the ones under section

 e) acting as a squid-like application proxy

are on my wishlist, because I think a web browser should resemble more
a pair of binoculars than the eyelid retainers in Stanley Kubrick's A
Clockwork Orange.

Ce qui est important est rarement urgent
et ce qui est urgent est rarement important
-- Dwight David Eisenhower

Reply via email to