Re: [fpc-pascal] Parse unicode scalar

Mattias Gaertner via fpc-pascal Sun, 02 Jul 2023 21:43:51 -0700

On Mon, 3 Jul 2023 08:29:11 +0700
Hairy Pixels via fpc-pascal <fpc-pascal@lists.freepascal.org> wrote:


> > On Jul 2, 2023, at 11:16 PM, Jer Haan <jdehaan2...@gmail.com> wrote:
> > 
> > This table is copied from Wikipedia.<uencoding.pas>Hope it’s useful
> > for you. If you improve the code pls let me know. 
> 
> This is perfect, thanks! Much more complicated than I thought.
> 
> I'm curious now, if you were going the other direction and parsing a
> string of different unicode characters with different code point
> sequence lengths how would you know which length it was? For example
> I started off know which unicode scalar to use by looking at a table
> but if I had to find the character is stream of text?
> 
> I think UTF8 can have 1-4 byte characters so you could encounter 1
> byte character followed by 4 byte characters interleaved and there's
> no header or terminator for each character. How is this solved?

There is a header byte.

It depends, if you want to check for invalid UTF-8 sequences.

From LazUTF8:

function UTF8CodepointSizeFast(p: PChar): integer;
begin
  case p^ of
    #0..#191   : Result := 1;
    #192..#223 : Result := 2;
    #224..#239 : Result := 3;
    #240..#247 : Result := 4;
    else Result := 1; // An optimization + prevents compiler warning about 
uninitialized Result.
  end;
end;

Mattias
_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

Reply via email to