Re: utf-8 bom frequency of bytes

John Little Fri, 20 Jan 2012 00:49:53 -0800

On Jan 20, 5:42 pm, "Benjamin R. Haskell" <[email protected]> wrote:
>
> I don't know the background of 'f_readfile', but why would the BOM be
> removed in positions other than at the start of the string?  Isn't it
> only meaningful as an encoding detection when it's the first thing being
> read?  Anywhere else U+FEFF is a zero-width, non-breaking space.


I had similar misgivings, but I'm following the documentation in :help
readfile().

(From my point of view boms are a kludge to fix a kludge, if not a
third or higher level kludge.  If someone had had the wit to specify
that UTF-16's ancestor was big-endian from the beginning we wouldn't
have this mess.)

In a *nix utf-8 environment, boms can proliferate as files are
concatenated or interpolated, so I can imagine use cases where their
removal is a good idea.  As well, the U+FEFF code point is deprecated.

> I'm also not sure checking for it is expensive enough to worry about the
> inefficiency of looking backward two buffer positions.  The cost of
> disk access is so much greater than comparing it once in memory that one
> or two extra occasional comparisons seems insignificant.  Seems like
> premature optimization.

I'm focussed on this sort of thing, sorry, and don't want to make the
performance worse than the previous version.  The cost of the
allocations and memory copies the current version has when I used it
far dwarfed the disk access.

> They all seem to appear throughout script blocks, so it's really
> data-dependent.

Thank you for your informative answer.  I was afraid of stepping into
something egregious.

Regards, John

-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Re: utf-8 bom frequency of bytes

Raspunde prin e-mail lui