Le dim. 9 sept. 2018 à 10:10, Richard Wordingham via Unicode < [email protected]> a écrit :
> On Sat, 8 Sep 2018 18:36:00 +0200 > Mark Davis ☕️ via Unicode <[email protected]> wrote: > > > I recently did some extensive revisions of a paper on Unicode string > > models (APIs). Comments are welcome. > > > > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# > > > Theoretically at least, the cost of indexing a big string by codepoint > is negligible. For example, cost of accessing the middle character is > O(1)*, not O(n), where n is the length of the string. The trick is to > use a proportionately small amount of memory to store and maintain a > partial conversion table from character index to byte index. For > example, Emacs claims to offer O(1) access to a UTF-8 buffer by > character number, and I can't significantly fault the claim. > I fully agree, as long as the "middle" character is **approximated** by the middle of the **encoded** length. But if it has to be the exact middle (by code point number), you have to count the codepoints exactly by parsing the whole string as O(n), then compute the middle from it and parse again from the begining to locate the encoded position of that code point index as O(n/2) so the final cost is O(n*3/2). The trick using a "small amount" of memory only is only to avoid the second parsing to get a O(n) result. You get O(1)* only if you keep that "small memory" to locate ofthe indexes. But the claim that it is "small" is wrong if the string is large (big value n). and has no interest if the string is indexed only once. In practive, we use a memory by preparing the "small memory" while instantiating a new iterator that will process the whole string (which may not be fully loaded in memory, in which case that "small memory" will need reallocation as we also read the whole string (but not necessarily keep it in memory if it's a very long text file: the index buffer will still remain in memory even if we no longer need to come back to the start of the string). That "small memory" is just a local helper, its cost must be evaluated. In practice however, long texts come from I/O: the text will have its interface from files, in which case you'll benefit from the filesystem cache of the OS to save I/O, or from network (in which case you'll need to store the network data in a local temporary file if you don't want to keep it fully in memory and allow some parts to be paged out of memory by the OS. But in Emacs, it only works with files: network texts are necessarily backed at least by a local temporary file. So that "small memory" for the index is not even needed (but Emacs maintains an index in memory only to locate line numbers. It has no need to do that for column numbers, as it is just faster to rescan the line (and extremely long lines of text are exceptional, these files are rarely edited with Emacs, unless you use it to load a binary file, whose representation on screen will be very different, notably for controls, which are expanded into another cached form: the column index for display, which is different from the code point index and specific to the Emacs representation for display/editing, is built only line by line, separately from the line index kept for the whole edited file; it is also independant of the effective encoding: it would still be needed even if the encoding of the backing buffer was UTF-32 with only 1 codepoint per code unit, becase the actual display will still expand the code points to other forms using visible escaping mechanisms, and it is even needed when the file is pure 7-bit ASCII, and kept with one byte per code point: choosing the Unicode encoding forms has no impact at all to what is really needed for display in text editors). Text editors use various indexing caches always, to manage memory, I/O, and allow working on large texts even on systems with low memory available. As much as possible they attempt to use the OS-level caches of the filesystem. And in all cases, they don't work directly on their text buffer (whose internal represenation in their backing store is not just a single string, but a structured collection of buffers, built on top of an interface masking the details: the effective text will then be reencoded and saved from that object, using complex serialization schemes; the text buffer is "virtualized"). Only very basic text editors (such as Notepad) use a native single text buffer, but they are very slow when editing very large files as they constantly need to copy/move large blocks of memory to perform inserts/deletions, and they also use too much the memory reallocator. Even vi(m) or (s)ed in Unix/Linux now use another internal encoded form with a temporary backing store in temporary files, created automatically when needed as you start modifying the content. The final consolidation and serialization will occur only when saving the result.

