On Wed, Dec 11, 2013 at 09:05:16AM +0100, John Darrington wrote: > On Tue, Dec 10, 2013 at 12:38:04PM -0800, Ben Pfaff wrote: > > I understand now. However, in other places in PSPP, and in particular > in syntax and the output engine, we tend to convert everything we > receive externally into UTF-8 for internal processing, and then convert > back to other encodings as necessary. It would be convenient for some > purposes to do this for filenames also (e.g. to include file names in > output), and it would avoid needing to keep around two pieces of > information (file name plus encoding) when one (UTF-8 file name) would > do. > > Do you think that storing file name plus encoding is superior? > > Both solutions have advantages and disadvantages. > > The converting-all-filenames-to-utf8 solution has two disadvantages that I > can see: > > *. Unnecessary recoding - often it will be necessary to convert from > "filename encoding" > to utf8 and then, back to "filename encoding".
Is the concern here about performance, or something else? I doubt that there is a real performance problem with doing one or two conversions of a file name, once per file open. Also, on GNU/Linux the filename encoding is UTF-8 anyway, so there is no actual conversion. > *. The bigger disadvantage, is that it will be very easy simply to forget to > do > the necessary conversion. If the programmer forgets - the compiler won't > complain - > it is just a char * - Passing a struct file_handle * one cannot forget - > there'll > be a compiler error. That's true. In data, we use uint8_t instead of char to remind ourselves that the data is in the dictionary encoding. We could use int8_t for UTF-8 data, but that doesn't match either libunistring or glib practice so it would probably cause a lot of friction at interfaces. _______________________________________________ pspp-dev mailing list [email protected] https://lists.gnu.org/mailman/listinfo/pspp-dev
