John Darrington <[email protected]> writes: [data-encoding branch] > Comments about this branch are invited before I merge with master.
At a high level, it bothers me how much allocation and deallocation and copying this adds to simple operations. I guess that it's probably not worth worrying about without profiler output showing that it's significant, though. In "Store variable names, labels and value labels as UTF8": The function recode_strings() in sys-file-reader.c will assert-fail if renaming variables causes a duplicate variable name, even if the duplication would only be temporary, that is, if variables named A and B will be renamed B and C in the process of recoding. This is because dict_rename_var assert-fails if a duplicate would be created. I can think of two solutions: either check whether a variable by the recoded name exists before calling dict_rename_var, or use the dict_rename_vars function to bulk-rename all of the variables in the dictionary at once, which doesn't have this problem. Probably the former solution makes sense if you think the possibility of recoding causing this problem is absurdly unlikely, and the latter otherwise. In "data_out function to dynamically allocate return value": The changes to table_value_missing are unsafe because they allocate a fixed-size buffer and then strcpy the data_out return value into that, even though data_out might return arbitrarily large data. Instead, it should not allocate its buffers (via tab_alloc) until it has received its formatted data from data_out, because only then does it know how much data there is. Similarly in format_cell_entry. Actually, tab_alloc just allocates from a pool, so these functions could use data_out_pool. The loop in tab_double is wrong, since there might be more bytes in the string than fmt->w. data_out_pool always null-terminates its output (presumably?) so the "&& cp < s + fmt->w" test can just be deleted. I think that the call to g_string_append_len in data_out_g_string is now wrong, because fs->w is not necessarily the length of s. Probably this should just be g_string_append now? In "Convert to utf8 in data_out function.": There should be a comment on data_out and data_out_pool explaining the meaning of the 'encoding' argument. There is a tendency to assuming that it is the encoding output by the function, whereas I believe it is actually the encoding of the function's input. As I read "Recode strings when writing system files." it occurred to me that a sensible way to deal with recoding on system file read and write would be to write a dict_recode function that both sys-file-reader and sys-file-writer could use. Then neither file would have to worry about recoding much; they could just call this function. Does that make sense? I didn't have time to build and test this at all. If you feel confident about it, go ahead and merge into master. -- Ben Pfaff http://benpfaff.org _______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
