Le dim. 9 sept. 2018 à 17:53, Eli Zaretskii <e...@gnu.org> a écrit :
> > Text editors use various indexing caches always, to manage memory, I/O, > and allow working on large texts > > even on systems with low memory available. As much as possible they > attempt to use the OS-level caches > > of the filesystem. And in all cases, they don't work directly on their > text buffer (whose internal represenation in > > their backing store is not just a single string, but a structured > collection of buffers, built on top of an interface > > masking the details: the effective text will then be reencoded and saved > from that object, using complex > > serialization schemes; the text buffer is "virtualized"). > > In Emacs, buffer text is a character string with a gap, actually. > A text buffer with gaps is a complex structure, not just a plain string. Gaps are one way to manage memory more efficiently and get reasonnable performance when editing, without having to constantly move large blocks: these "strings" with gaps may then actually be just a byte buffer using as a backing store, but that buffer alone does not represent only the currently represented text. A process will still serialize and perform cleanup befire this buffer can be used to save the edited text. Emacs may not necasserily unallocate the end of the buffer, but I doubt it constantly uses a single gap at end (insertions and deletions in the middle would constant move large blocks and use excessive CPU and memory bandwidth, with very slow response: users do not want to see what they type appearing on the screen at one keystroke every few seconds because each typed key causes massive block moves and excessive memory paging from/to disk while this move is being performed). All editors I have seen treat the text as ordered collections of small buffers (these small buffers may still have small gaps), which are occasionnally merged or splitted when needed (merging does not cause any reallocation but may free one of the buffers), some of them being paged out to tempoary files when memory is stressed. There are some heuristics in the editor's code to when mainatenance of the collection is really needed and useful for the performance. But beside this the performance cost of UTF indexing of the codepoints is invisible: each buffer will only need to avoid breaking text between codepoint boundaries, if the current encoding of the edited text is an UTF. An editor may also avoid breaking buffers in the middle of clusters if they render clusters (including ligatures if they are supported): clusters are still small in size in every encoding and reasonnable buffer sizes can hold at least hundreds of clusters (even the largest ones which occur rarely). How editors will manage clusters to make them editable is dependant of the implementation, buyt even the UTF or codepoints boundaries are not enough to handle that. In all cases the logical text buffer is structured with a complex backing store, where parts may be paged out (and will also include more than just the current text, notably it will include parts of the indexes, possibly in another temporary working file).