Now my proposal for the other use cases: 2) There be some way to store mostly ascii-compatible strings in a single > byte-per-character array -- so not to be wasting space for "typical > european-language-oriented data". Note: this should ALSO be compatible with > Python's character-oriented string model. i.e. a Python String with length > N will fit into a dtype of size N. > > arr = np.array(("this", "that",), dtype=np.single_byte_string) > > (name TBD) > > and arr[1] would return a python string. > > attempting to put in a not-compatible with the encoding String would > raise an EncodingError. > > This is also a use-case primarily for "casual" users -- but ones concerned > with the size of the data storage and know that are using european text. >
more detail elsewhere -- but either ascii with surrageescape or latin-1 always are good options here. I prefer latin-1 (I really see no downside), but others disagree... But then we get to: > 3) dtypes that support storage in particular encodings: > We need utf-8. We may need others. We may need a 1-byte per char compact encoding that isn't close enough to ascii or latin-1 to be useful (say, shift-jis), And I don't think we are going to come to a consensus on what "single" encoding to use for 1-byte-per-char. So really -- going back to Julian's earlier proposal: dytpe with an encoding specified "size" in bytes once defined, numpy would encode/decode to/from python strings "correctly" we might need "null-terminated utf-8" as a special case. That would support all the other use cases. Even the one-byte per char encoding. I"d like to see a clean alias to a latin-1 encoding, but not a big deal. That leaves a couple decisions: - error out or truncate if the passed-in string is too long? - error out or suragateescape if there are invalid bytes in the data? - error out or something else if there are characters that can't be encoded in the specified encoding. And we still need a proper bytes type: 4) a fixed length bytes dtype -- pretty much what 'S' is now under python > three -- settable from a bytes or bytearray object (or other memoryview?), > and returns a bytes object. > > You could use astype() to convert between bytes and a specified encoding > with no change in binary representation. This could be used to store any > binary data, including encoded text or anything else. this should map > directly to the Python bytes model -- thus NOT null-terminted. > > This is a little different than 'S' behaviour on py3 -- it appears that > with 'S', a if ALL the trailing bytes are null, then it is truncated, but > if there is a null byte in the middle, then it is preserved. I suspect that > this is a legacy from Py2's use of "strings" as both text and binary data. > But in py3, a "bytes" type should be about bytes, and not text, and thus > null-values bytes are simply another value a byte can hold. > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion