> Here's what I came up with, please see if it looks better now. It looks okay as far as I can tell without testing it, except for this addition:
> else > { > utf8_char_ptr = utf8_char; > /* i is width of UTF-8 character */ > degrade_utf8 (&utf8_char_ptr, &i); > + /* If we are done, make sure iconv flushes the last character. */ > + if (bytes_left <= 0) > + { > + utf8_char_ptr = utf8_char; > + i = 4; > + iconv (iconv_to_utf8, NULL, NULL, > + &utf8_char_ptr, &utf8_char_free); > + if (utf8_char_ptr > utf8_char) > + { > + utf8_char_ptr = utf8_char; > + degrade_utf8 (&utf8_char_ptr, &i); > + } > + } > } That's okay for that code path, but I wonder if we should also call iconv to flush the last character after the main loop exits because of this condition: if (iconv_ret != (size_t) -1) /* Success: all of input converted. */ break; I'm trying to read the libc manual closely and, actually, it's probably not necessary: If all input from the input buffer is successfully converted and stored in the output buffer, the function returns the number of non-reversible conversions performed. In all other cases the return value is `(size_t) -1' and `errno' is set appropriately. So if there's one character held back waiting for a following combining character, there won't be a positive return value indicating success. But if that interpretation is correct, then why should the following be necessary? + /* Make sure libiconv flushes out the last converted character. + This is required when the conversion is stateful, in which + case libiconv might not output the last charcater, waiting to + see whether it should be combined with the next one. */ + if (iconv_ret != (size_t) -1 + && text_buffer_iconv (&output_buf, iconv_to_output, + NULL, NULL) != (size_t) -1) So maybe it is necessary after exiting the main loop, and the wording in the manual is misleading. Re this: >So there's a ping-pong between 2 separate conversions, and the assumption seems to be that each conversion advances the input pointer and the bytes-left variable according to what it produced. I put the following comment in the code because I wasn't sure about this point: /* If file is not in UTF-8, we degrade to ASCII in two steps: first convert the character to UTF-8, then look up a replacement string. Note that mixing iconv_to_output and iconv_to_utf8 on the same input may not work well if the input encoding is stateful. We could deal with this by always converting to UTF-8 first; then we could mix conversions on the UTF-8 stream. */ > Having played with this code, I must say that I feel it's based on > somewhat fragile assumptions whose validity is not clear to me. It will take me some more time to respond to this. If you find code that you think is correct and works, by all means please go ahead and commit it.