Re: [PATCH v2] utf8.c: print warning about iconv errors
Jeff King writes: > On Fri, Aug 14, 2015 at 03:35:58PM -0700, Junio C Hamano wrote: > >> Max Kirillov writes: >> >> > * do not limit number of warnings - does not worth complicating the code >> >> Unless the warning leads to a quick "die()", wouldn't this make Git >> unusable by spewing a "falling back to verbatim copy" for each and >> every line of the message of a commit that has 'encoding' element in >> its header in the "git log" output, no? > > We only do the reencode once per commit. So it would be once per commit > rather than once per line. Which still sounds kind of annoying, if you > are using "git log --oneline" or similar. > > I think I'd favor a single warning in general, along the lines of > "some encodings could not be converted". But of course if you are trying > to figure out _which_ encodings your system doesn't have, that's not > very helpful. Maybe we could have an advice.encodingFailure config flag > with a tristate: > > - false: don't spew any warnings > > - true: give a generic warning once per program > > - all: give a specific warning for each case, like "unable to convert > EUC-JP to UTF-8: iconv_open: Invalid argument". (Sadly EINVAL is > what iconv_open seems to return when you it doesn't know about a > particular encoding; it may be nicer to translate to something more > reasonable than what strerror() provides). Sounds sensible. >> > +char *reencode_string_len(const char *in, int insz, >> > +const char *out_encoding, const char *in_encoding, >> > +int *outsz) >> > +{ >> > + if (!same_encoding(in_encoding, out_encoding)) >> > + warning("Iconv support is disabled at compile time. It is likely >> > that\nincorrect data will be printed or stored in >> > repository.\nConsider using other build for this task."); >> > + return NULL; >> > +} >> >> Hmmm, I suspect this may be seen as regression by those who build >> Git without ICONV for performance, knowing that there is nothing in >> their data that requires character set conversion. > > I don't think it matters that much. Yeah, I think I agree. Thanks. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] utf8.c: print warning about iconv errors
On Fri, Aug 14, 2015 at 03:35:58PM -0700, Junio C Hamano wrote: > Max Kirillov writes: > > > * do not limit number of warnings - does not worth complicating the code > > Unless the warning leads to a quick "die()", wouldn't this make Git > unusable by spewing a "falling back to verbatim copy" for each and > every line of the message of a commit that has 'encoding' element in > its header in the "git log" output, no? We only do the reencode once per commit. So it would be once per commit rather than once per line. Which still sounds kind of annoying, if you are using "git log --oneline" or similar. I think I'd favor a single warning in general, along the lines of "some encodings could not be converted". But of course if you are trying to figure out _which_ encodings your system doesn't have, that's not very helpful. Maybe we could have an advice.encodingFailure config flag with a tristate: - false: don't spew any warnings - true: give a generic warning once per program - all: give a specific warning for each case, like "unable to convert EUC-JP to UTF-8: iconv_open: Invalid argument". (Sadly EINVAL is what iconv_open seems to return when you it doesn't know about a particular encoding; it may be nicer to translate to something more reasonable than what strerror() provides). > > +char *reencode_string_len(const char *in, int insz, > > + const char *out_encoding, const char *in_encoding, > > + int *outsz) > > +{ > > + if (!same_encoding(in_encoding, out_encoding)) > > + warning("Iconv support is disabled at compile time. It is > > likely that\nincorrect data will be printed or stored in > > repository.\nConsider using other build for this task."); > > + return NULL; > > +} > > Hmmm, I suspect this may be seen as regression by those who build > Git without ICONV for performance, knowing that there is nothing in > their data that requires character set conversion. I don't think it matters that much. The obvious tight loop is logmsg_reencode, and it already checks same_encoding (because it really wants to avoid reallocation in the first place if it can). So anybody who cares about the performance of reencode_string_len would do better to optimize out any calls to it. :) If anything, we could make same_encoding faster by memo-izing its ptrs, like: diff --git a/utf8.c b/utf8.c index 28e6d76..50a8ac0 100644 --- a/utf8.c +++ b/utf8.c @@ -409,13 +409,26 @@ int is_encoding_utf8(const char *name) return 0; } -int same_encoding(const char *src, const char *dst) +static int same_encoding_1(const char *src, const char *dst) { + warning("actually checking same_encoding(%s, %s)", src, dst); if (is_encoding_utf8(src) && is_encoding_utf8(dst)) return 1; return !strcasecmp(src, dst); } +int same_encoding(const char *src, const char *dst) +{ + static const char *cached_src, *cached_dst; + static int cached_ret = -1; + + if (src == cached_src && dst == cached_dst && cached_ret >= 0) + return cached_ret; + cached_src = src; + cached_dst = dst; + return cached_ret = same_encoding_1(src, dst); +} + /* * Wrapper for fprintf and returns the total number of columns required * for the printed string, assuming that the string is utf8. But I couldn't measure any real speedup on "git log --oneline" from doing so. It's also kind of gross (it will yield the wrong answer if you write a different encoding to the same buffer; I don't think we do that, but it's quite a gotcha). Another approach would be to preserve NULL encodings (which we treat as utf8) through the code base more. The common case of utf8 should be a quick check for two NULLs, then. Unfortunately just teaching get_commit_encoding and get_log_output_encoding to return NULL isn't enough. Some parts of the code want to output the actual value (e.g., format-patch for a charset header), and would need to be adjusted. Given that I couldn't measure any speedup, I don't think it's worth pursuing, though. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] utf8.c: print warning about iconv errors
Max Kirillov writes: > * do not limit number of warnings - does not worth complicating the code Unless the warning leads to a quick "die()", wouldn't this make Git unusable by spewing a "falling back to verbatim copy" for each and every line of the message of a commit that has 'encoding' element in its header in the "git log" output, no? I suspect that this may be a huge mistake. > +char *reencode_string_len(const char *in, int insz, > + const char *out_encoding, const char *in_encoding, > + int *outsz) > +{ > + if (!same_encoding(in_encoding, out_encoding)) > + warning("Iconv support is disabled at compile time. It is > likely that\nincorrect data will be printed or stored in > repository.\nConsider using other build for this task."); > + return NULL; > +} Hmmm, I suspect this may be seen as regression by those who build Git without ICONV for performance, knowing that there is nothing in their data that requires character set conversion. We'd call same_encoding() every time, which would involve a few strcasecmp() calls. Originally, we didn't even have a function call overhead. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] utf8.c: print warning about iconv errors
If reencoding a text data from one encoding to another fails, the original version is used insted. Currently there is no warning about failed reencoding, which can have an undesired outcome that returned data is incorrect but user is not aware about it. Add printing warning when conversion fails. Also add test script to assert that warning is actually printed and output is not changed, as expected. Signed-off-by: Max Kirillov --- Changes since v1: * rebase to recent changes * add handling runtime errors * add test * do not limit number of warnings - does not worth complicating the code * noticed that incomplete utf8 sequence in input silently treated as latin1. so mark the testcase as expect_failure. Actually, it's quite surprising, would be nice if somebody tries it in various environments Actually, as far as I could grep, all uses of the resoding happen only for printing, so probably it is not that important. t/t3911-show-reencode.sh | 46 ++ utf8.c | 24 +++- utf8.h | 7 ++- 3 files changed, 71 insertions(+), 6 deletions(-) create mode 100755 t/t3911-show-reencode.sh diff --git a/t/t3911-show-reencode.sh b/t/t3911-show-reencode.sh new file mode 100755 index 000..061d820 --- /dev/null +++ b/t/t3911-show-reencode.sh @@ -0,0 +1,46 @@ +#!/bin/sh + +test_description='reencoding' + +. ./test-lib.sh + +printf '\304\201\n' >a_macron_utf8 +printf '\303\244\n' >a_diaeresis_utf8 +printf '\303\244\304\n' >incomplete_utf8 +printf '\344\n' >a_diaeresis_latin1 + +test_expect_success 'setup' ' + git commit --allow-empty -F a_diaeresis_utf8 && + git tag latin1_utf8 && + git commit --allow-empty -F a_macron_utf8 && + git tag extended_utf8 && + git commit --allow-empty -F incomplete_utf8 && + git tag invalid_utf8 +' + +test_expect_success 'encoding to latin1' ' + git log --encoding=latin1 --pretty=format:%B -1 latin1_utf8 >out 2>err && + test_must_be_empty err && + test_cmp out a_diaeresis_latin1 +' + +test_expect_success 'unknown encoding' ' + git log --encoding=no-encoding --pretty=format:%B -1 latin1_utf8 >out 2>err && + grep -q "not supported" err && + test_cmp out a_diaeresis_utf8 +' + +# apparently incomplete UTF8 byte sequences silently treated as latin1 +test_expect_failure 'incomplete utf8' ' + git log --encoding=latin1 --pretty=format:%B -1 invalid_utf8 >out 2>err && + grep -q "Invalid input" err && + test_cmp out incomplete_utf8 +' + +test_expect_success 'does not fit into latin1' ' + git log --encoding=latin1 --pretty=format:%B -1 extended_utf8 >out 2>err && + grep -q "Invalid input" err && + test_cmp out a_macron_utf8 +' + +test_done diff --git a/utf8.c b/utf8.c index 28e6d76..d284bb0 100644 --- a/utf8.c +++ b/utf8.c @@ -465,7 +465,9 @@ char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv, int *outs if (cnt == (size_t) -1) { size_t sofar; if (errno != E2BIG) { + int failure_errno = errno; free(out); + errno = failure_errno; return NULL; } /* insz has remaining number of bytes. @@ -513,14 +515,34 @@ char *reencode_string_len(const char *in, int insz, if (is_encoding_utf8(out_encoding)) out_encoding = "UTF-8"; conv = iconv_open(out_encoding, in_encoding); - if (conv == (iconv_t) -1) + if (conv == (iconv_t) -1) { + if (errno == EINVAL) + warning("Conversion from %s to %s not supported, falling back to verbatim copy", in_encoding, out_encoding); + else + warning("Conversion from %s to %s failed: %s, falling back to verbatim copy", in_encoding, out_encoding, strerror(errno)); return NULL; + } } out = reencode_string_iconv(in, insz, conv, outsz); + if (out == NULL) { + if (errno == EILSEQ || errno == EINVAL) + warning("Invalid input for conversion from %s to %s, falling back to verbatim copy", in_encoding, out_encoding); + else + warning("Conversion from %s to %s failed: %s, falling back to verbatim copy", in_encoding, out_encoding, strerror(errno)); + } iconv_close(conv); return out; } +#else +char *reencode_string_len(const char *in, int insz, + const char *out_encoding, const char *in_encoding, + int *outsz) +{ + if (!same_encoding(in_encoding, out_encoding)) + warning("Iconv support is disabled at compile time. It is likely