Another concept considered is the ability to display regions of a file in formatting that is different from another region. For example in the case where endianness varies across the file, if we could apply different renderings to these different regions it could assist schema developers in reasoning on the data.
> File size limits for edit are acceptable. We have found that to provide an interactive hex editor via VS Code we would use a html view. Wrapping bit/byte representation in html tags significantly increases the size for non-editing functions as well, so if we need to limit size it might be in general and not isolated to editing. The scalable approach here will likely be to provide a viewport type functionality that allows limiting the amount data that is loaded into VS Code to what is needed for display. Viewports combined with editing presents further challenges. We are researching some ways to approach that as well. On Fri, Oct 15, 2021 at 3:21 PM Mike Beckerle <mbecke...@apache.org> wrote: > Some requirements that came up in a non-email discussion: > * ability to edit and save data - editing at hex and bits level. File size > limits for edit are acceptable. > * display of bits > * Right-to-left byte order > * word-size-sensitive display (e.g., user can set to use 70 bits wide) > * support for parse & unparse cycle (comparison of output data from > unparse, to input data in hex/bits display) > > The rest of this email is a bunch of random pointers/ideas about binary > data display/edit. Hopefully useful, not TL;DR. > > This is what some least-significant-bit-first data lines look like: They > are 70 bits wide, because that's the word size of the format. The byte > order is Right to Left. > > 00 1100 0011 1000 0000 0000 0000 0000 0100 0000 0101 0000 0000 1000 > 0000 > 1110 1000 1100 > 00 0000 0000 0000 0000 0001 0101 1001 1111 1110 1010 1000 0101 1011 > 0011 1001 1010 0010 > 11 1111 1111 1000 0000 0000 1101 0110 0000 0000 0000 0000 0000 0000 > 0000 0000 0000 0101 > 00 0000 0000 0000 0001 1000 0000 0000 0111 1111 1000 0000 0000 0000 > 0000 0000 0000 1101 > > I have highlighted the first fields of the first word. Just to show how > non-byte-aligned this sort of data is. > > This same kind of data is sometimes padded with extra bits which would show > up on the left. 2 bits more is pretty common, as then each "word" is an > even 9 bytes, which would make using a hex representation potentially > useful. But I've also seen 5 bits of padding, and 75 bit words are no > help. So a user needs to be able to say how wide they want the > presentation, in bits. > > The above 4 lines of bits,... that data format is often preceded by a > 32-byte long big-endian mostSignificantBitFirst header all of which is > byte-oriented byte-aligned data, and is most easily understood by looking > at an ordinary L-to-R hex dump. > > Hence, users need to be able to examine a file of this sort of data, and > break the data at byte 32, so that from byte 33 (base 1 numbering) onwards > for the next 35 bytes (70 bits x 4 = 280 bits = 35 bytes) use the > bit-oriented 70-bit-wide display. A typical data file will have many such > header+message, suggesting one must switch back and forth between > presentations of the data. > > You should also look at this bit-order tutorial: > https://daffodil.apache.org/tutorials/bitorder.tutorial.tdml.xml which > discusses R-to-L byte display also. > This tutorial should convince you there is no need to be reordering the > bits, only the bytes. I.e, in the above 70 bit words, the first byte is > "1000 1100", regardless of whether the presentation is L-to-R or R-to-L. > > The Daffodil CLI debugger has a "data dump" utility that creates R-to-L > dump display things like this: > > fedcba9876543210 ffee ddcc bbaa 9988 7766 5544 3322 1100 87654321 > cø€␀␀␀wü␚’gU€␀gä 63f8 8000 0000 77fc 1a92 6755 8000 67e4 :00000000 > ␀␀␁›¶þ␐HD 00 0001 9bb6 fe10 4844 :00000010 > > That example is in the TestDump.scala file in the > daffodil-io/src/test/scala/org/apache/daffodil/io/TestDump.scala file. > The chars on the left are iso-8859-1 code points, except for the > control-pictures characters used to represent those code points. > > (Email isn't lining up these characters correctly due to the > control-pictures characters (like for NUL, DLE, SUB, etc.) not being fixed > width in this font. I don't think there is a fixed-width font in the world > with every unicode code point in it.) > > There's also examples there of L-to-R dumps for utf-8, utf-16, and utf-32 > data. E.g., this is utf-8 with some 3-byte Kanji chars: > > 87654321 0011 2233 4455 6677 8899 aabb ccdd eeff > 0~1~2~3~4~5~6~7~8~9~a~b~c~d~e~f~ > 00000000: 4461 7465 20e5 b9b4 e69c 88e6 97a5 3d32 D~a~t~e~␣~年~~~~月~~~~日 > ~~~~=~2~ > 00000010: 3030 33e5 b9b4 3038 e69c 8832 37e6 97a5 0~0~3~年~~~~0~8~月~~~~2~7~日 > ~~~~ > > Character sets are in general quite problematic, as there are some that > include shift-chars which change the interpretation of bytes as to what > character they correspond to. Mojibake > <https://en.wikipedia.org/wiki/Mojibake> is sometimes unavoidable. > Defaulting to just iso-8859-1 (where every byte is a valid character) is > perfectly reasonable in many situations and is probably fine for first cut. >