On Wed, Jun 17, 2015 at 07:07:48PM +0200, Jan-Philip Gehrcke wrote:

> The two-option scenario is totally clear. Although one must stress that the
> "error-out" option can, as discussed, be kept minimally invasive: it is
> sufficient (and common) to just skip those byte sequences (and replace them
> with a replacement symbol) that would be invalid in the requested output
> encoding. This would retain as much information as possible while
> guaranteeing a subsequent decoder to retrieve valid input.

I think "munge into valid UTF-8, even if it means losing data" is a
totally valid and useful option. I'm not completely sure that git should
do that, though.  E.g., you could just as easily do:

  git log --encoding=utf8 | drop_invalid_utf8 | your_script

Or quite possibly, your_script could do the munging itself while reading
the data. I do not know much about Python's input handling, but in Perl,
it is easy to say "the input is utf8, and replace anything bogus with a
substitution character"[1].

> Should we
> 
> * just make this more clear in the docs and/or
> * should we adjust the behavior of --encoding or
> * should we do something entirely different, like adding a new command line
> option or
> * should we just leave things as they are?

I would vote for a documentation change, perhaps like:

Subject: docs: clarify that --encoding can produce invalid sequences

In the common case that the commit encoding matches the
output encoding, we do not touch the buffer at all, which
makes things much more efficient. But it might be unclear to
a consumer that we will pass through bogus sequences.

Signed-off-by: Jeff King <p...@peff.net>
---
 Documentation/pretty-options.txt | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt
index 74aa01a..642af6e 100644
--- a/Documentation/pretty-options.txt
+++ b/Documentation/pretty-options.txt
@@ -37,7 +37,10 @@ people using 80-column terminals.
        in their encoding header; this option can be used to tell the
        command to re-code the commit log message in the encoding
        preferred by the user.  For non plumbing commands this
-       defaults to UTF-8.
+       defaults to UTF-8. Note that if an object claims to be encoded
+       in `X` and we are outputting in `X`, we will output the object
+       verbatim; this means that invalid sequences in the original
+       commit may be copied to the output.
 
 --notes[=<ref>]::
        Show the notes (see linkgit:git-notes[1]) that annotate the
-- 
2.4.4.719.g3984bc6

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to