On Thu 29 Apr 2021 at 11:33:29 (-0400), rhkra...@gmail.com wrote:
> On Thursday, April 29, 2021 09:03:59 AM Albretch Mueller wrote:
> > > What is "alpha-offset format"?
> > 
> >  we, corpora research kinds of folks, need to process thousand of
> > files as other people process bytes. UTF8 was basically an
> > Americanizierung of alle alphabets.

"Americanizierung" merely seems an odd way of saying that UTF8 favours
the Roman alphabet in terms of the length of encoded strings.

> > UTF is great to describe an
> > alphabet but not for text files.

Are we now discussing non-alphabetic scripts? Beats me.

> >  UTF8 turned all files into streams not good for questions such as
> > what is the charatcer/string sequence starting on the nth addressable
> > unit of a file ...

It's very good for such questions, because it is unambiguous whether
the nth addressable unit (let's say byte) is the start of a character
or not. When it /is/ the start of a character, the number of bytes in
the character is given by at most 1–5 bits of that first byte, so you
don't have to hunt for some kind of terminating byte.

> >  Doing that with utF8 is from way too complicated to impossible.

OTOH it's not straightforward to determine at which byte does the
nth character start. Perhaps that is what you meant to say.

> > Also
> > alpha offset nicely splits the files segments into its different
> > parts: ALPHABETICAL text, js, css, ...

I'd be interested to see examples of JavaScript and Cascading Style
Sheets written in non-ALPHABETICAL text.

> Ok, but what does it look like?  (What is the format?)
> 
> A google search shows only links to this thread and some page about its 
> relevance to aiming a telescope -- I strongly suspect that is not relevant to 
> your use case.

Whatever it is is not well described by that post, but perhaps
better by (e.1.2) immediately following it.

>From the context, alpha-offset format sounds like jargon for what
html2text does with a web page, which is to separate the text you
want to read from all the interspersed code that controls how it
would be displayed by a browser.

That text¹, in whatever encoding, and now called "corpora", could be
analysed as part of someone's corpus linguistics research.

The major problem that seems to be exercising this thread at this
point is that the OP needs to swallow scads of documents, some, many,
perhaps most of which are on the web, but process them on a computer
that has never been connected to the internet.

Some of us, decades ago, were coping with this problem, as it applies
to the software, in the days before Debian's tools had evolved to the
functionality they now have. These tools seem to have been overlooked.

And so we have discussions of removing networking from the kernel,
redesigning apt using java, and de-americanising unicode, all
closely monitored, keystroke by keystroke, by NSAs around the world.
Typical debian-user, eh? Did we ever decide which meaning of MSI
applied for explaining the quirk in the subject line? Perhaps
Randolph Quirk would coincidentally have held an opinion.

¹ I assume a similar treatment is meted out to PDFs, and the cache
of videos is fed into a speech analyser, all to add to the mountain
of text.

Cheers,
David.

Reply via email to