utf-8 bom frequency of bytes

John Little Thu, 19 Jan 2012 16:06:16 -0800

Hi all

I'm revising the function f_readfile in eval.c, to speed it up when
processing very long lines. (It presently grows a string every 200
bytes by allocating a new one 200 bytes longer, copying the old to the
new, and deallocating the new.  F.ex., for a 1 MB line, such as may be
used by the yank ring plug in, there's 5000 allocations and
deallocations and about 5 GB of data copies.)


I've noted also that presently its handling of CR and bom removal
fails if the characters are read in different calls to fread, so I'm
fixing that.  One can only decide that the utf-8 bom sequence EF BB BF
is present if all three bytes have been read, so I was about to code a
check when the BF is encountered, but it occurred to me that if BF is
common in UTF-8 text, there'd be a lot of checking the previous bytes.

So, how common is the byte BF in utf-8 text?  How common are EF and
BB?  I've little idea.  Perhaps someone on vim_dev has a better idea.

Regards, John

-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

utf-8 bom frequency of bytes

Raspunde prin e-mail lui