> On 11.10.2013 17:19, Bob Archer wrote: > >> On 11.10.2013 16:55, Bob Archer wrote: > >>>> On 11.10.2013 15:58, Bob Archer wrote: > >>>>>> On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer > <[email protected]> > >>>> wrote: > >>>>>> I assume he was asking how to "fix" the blame. Cause, sure, he > >>>>>> could open the file, convert it back to UTF-8 with CRLF line > >>>>>> endings... and commit it... of course, now blame is going to show > >>>>>> him on every line, since he just changed every line. > >>>>>> > >>>>>> That's exactly what I meant. You're correct with how the blame > >>>>>> is handled. I committed the UTF-8 copy to a test branch, diff'd, > >>>>>> and it showed every line as being changed. Unfortunately it > >>>>>> looks like this is our > >>>> best option. > >>>>> Yep, we have done the same thing. As a matter of fact, I just over > >>>>> the past > >>>> few days rescripted all our database scripts to be UTF-8 since > >>>> merging them just doesn't work correctly when they are UTF-16 even > >>>> if you remove the binary mime type. > >>>>>> On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser <[email protected]> wrote: > >>>>>> At current blame is not UTF-16 aware. > >>>>> It's not just blame that isn't... the diff engine, or whatever > >>>>> detects file > >>>> types always considers UTF-16 files to be binary. If you "add" a > >>>> UTF-16 file you see that svn adds the application/octet-stream mime > >>>> type. There is an issue in the bug database about this from when I > >>>> reported/complained about it... however it hasn't been addressed. > >>>> I'm surprised still at this time that svn still can't support > >>>> UTF-16 text files as > >> text wrt adding, diffing, blaming, etc. > >>>> It's quite simple: no-one has written the necessary code. While I > >>>> can understand it's an interesting feature for Windows users, most > >>>> Subversion developers have other things to do. This being a > >>>> volunteer project, and most of us do not use Windows, you can > >>>> hardly expect anyone to spend several weeks on solving a problem > >>>> that has a perfectly simple workaround. Since > >>>> UFT-8 and UTF-16 can be interchanged without data loss, there are > >>>> other, much more important things to do in Subversion. > >>> I appreciate all that you said. I didn't expect that UTF-16 was so > >>> uncommon > >> in non-Windows OSes. A large number of dev tools that I work with on > >> Windows, especially the Microsoft tools default to creating UTF-16 files. > >>> I disagree with your "can be converted without data loss". If you > >>> need UTF- > >> 16 then you need it. Also, if you are working in an international > >> team and you have developers with other language Oss which have > >> different code pages then what you see when you look at a UTF-8 file > >> might be different than what I see. > >> > >> I don't follow. Both UTF-16 and UTF-8 are complete representations of > >> the Unicode character set. Exactly the same code sequences can be > >> represented in both encodings. You can convert from UTF-16 to UTF-8 > >> and back and get exactly the same sequence of bytes. > >> > > Ok, I have to back pedal here a bit. You are correct, UTF-8 is a Unicode > format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday > senior moment). What I recall being told by one of the subversion > developers was that subversion only supported the ASCII character set and > while UTF-8 was compatible with ASCII it didn't truly support Unicode files. > > > > However, this blog entry seems to dispute that: > > > > http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/ > > > > Would adding that mime-type to this file fix the blame issues this user is > seeing? > > I think the user is just very lucky. Subversion does not actually try to > interpret > the svn:mime-type property, other than to determine whether to treat a file > as text or binary. (The comment is correct in that the proper parameter is > charset=, not encoding=, but that's not important for this discussion). > > Subversion's merge algorithm depends on being able to detect line endings > in the file, and always scans the file as a sequence of bytes. > There are several ways to represent line endings in a UTF-16 file (shown here > as hex byte sequences): > > * 00 0A (Unix newline, UTF16-BE) > * 00 0D 00 0A (Windows newline, UTF16-BE) > * 0A 00 (Unix newline, UTF16-LE) > * 0D 00 0A 00 (Windows newline, UTF16-LE) > * 24 24 (Unicode newline, same in LE and BE) > > Subversion, however, expects one of the following newline sequences: > > * 0A (Unix newline) > * 0D 0A (Windows newline) > > My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII > newline character, are interpreted as the end-of-line markers, and the zero > bytes are treated as part of the text. In most cases, the result will be > close to > correct, as long as there are no conflicts in the merge -- because Subversion > will not emit conflict markers in UTF-16. > > Of course, if someone used the U+2424 newline code point instead, then in > the worst case, the whole file would be interpreted as a single line. > > -- Brane
Great information.. thanks for that. Bottom line is use UTF-8 for your text files and svn will be happy and work correctly. How hard would it be to create a warning on an add that a file looks like UTF-16 and should be converted to UTF-8 otherwise it will be treated as a binary file? BOb
