Re: SVN Blame Returns Corrupt Data
On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. About a year ago there was a patch (actually they just reposted an entire copy of blame.c) posted that helped, but it really didn't go anywhere since the original poster didn't continue following up. Perhaps you'll find the above useful. Patches are of course welcome. I'll take a look at that this weekend. Thanks for all the input everyone. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: On 10/10/13 12:17 PM, T.J. Perovich wrote: On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt subversion-20...@ryandesign.com mailto:subversion-20...@ryandesign.com wrote: Sounds like you've converted the file from UTF-8 to UTF-16. Thanks, you're absolutely right. It changed from UTF-8 to UTF-16LE. Any idea how to go about fixing it elegantly? We have about 3 months of commits since this happened. Diff's in the GUIs have been working fine and we don't blame too often which is why it was never noticed. At current blame is not UTF-16 aware. About a year ago there was a patch (actually they just reposted an entire copy of blame.c) posted that helped, but it really didn't go anywhere since the original poster didn't continue following up. https://mail-archives.apache.org/mod_mbox/subversion-dev/201207.mbox/%3CCAAF0CB13B282344AF95AD2DE3D1962215627A3C%40DAG-B.nexon.corp%3E and the followup https://mail-archives.apache.org/mod_mbox/subversion-dev/201208.mbox/%3ccab84ubvvrhffqyea5pf5gstmpxz+rh2jkvdvcqscocjv+rq...@mail.gmail.com%3E Perhaps you'll find the above useful. Patches are of course welcome.
RE: SVN Blame Returns Corrupt Data
On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. BOb About a year ago there was a patch (actually they just reposted an entire copy of blame.c) posted that helped, but it really didn't go anywhere since the original poster didn't continue following up. Perhaps you'll find the above useful. Patches are of course welcome. I'll take a look at that this weekend. Thanks for all the input everyone. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: On 10/10/13 12:17 PM, T.J. Perovich wrote: On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt subversion-20...@ryandesign.com mailto:subversion- 20...@ryandesign.com wrote: Sounds like you've converted the file from UTF-8 to UTF-16. Thanks, you're absolutely right. It changed from UTF-8 to UTF-16LE. Any idea how to go about fixing it elegantly? We have about 3 months of commits since this happened. Diff's in the GUIs have been working fine and we don't blame too often which is why it was never noticed. At current blame is not UTF-16 aware. About a year ago there was a patch (actually they just reposted an entire copy of blame.c) posted that helped, but it really didn't go anywhere since the original poster didn't continue following up. https://mail-archives.apache.org/mod_mbox/subversion- dev/201207.mbox/%3CCAAF0CB13B282344AF95AD2DE3D1962215627A3C%40 DAG-B.nexon.corp%3E and the followup https://mail-archives.apache.org/mod_mbox/subversion- dev/201208.mbox/%3CCAB84uBVVrHFfQyEA5pF5gStMpXz+RH2jKvdvCQsCO cjv+rq...@mail.gmail.com%3E Perhaps you'll find the above useful. Patches are of course welcome.
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. To turn your argument around: I'm surprised no Windows user has yet written a patch for Subversion to make it support UTF-16 ... -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
RE: SVN Blame Returns Corrupt Data
On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in non-Windows OSes. A large number of dev tools that I work with on Windows, especially the Microsoft tools default to creating UTF-16 files. I disagree with your can be converted without data loss. If you need UTF-16 then you need it. Also, if you are working in an international team and you have developers with other language Oss which have different code pages then what you see when you look at a UTF-8 file might be different than what I see. So, when I say I'm surprised I only say that with the knowledge that the internet has made the world very flat and I'm sure there is much more collaboration amoungs people that use different languages and work on apps that need to deal with international languages, etc. I'm not dissing the devs in any way. To turn your argument around: I'm surprised no Windows user has yet written a patch for Subversion to make it support UTF-16 ... If I knew how to I would. While I work with C# and I'm sure C is similar it is probably much different. If a svn dev would mentor me through it, and perhaps tell me what modules would need to be modified I would be happy to take a whack at it. BOb
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 16:55, Bob Archer wrote: On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in non-Windows OSes. A large number of dev tools that I work with on Windows, especially the Microsoft tools default to creating UTF-16 files. I disagree with your can be converted without data loss. If you need UTF-16 then you need it. Also, if you are working in an international team and you have developers with other language Oss which have different code pages then what you see when you look at a UTF-8 file might be different than what I see. I don't follow. Both UTF-16 and UTF-8 are complete representations of the Unicode character set. Exactly the same code sequences can be represented in both encodings. You can convert from UTF-16 to UTF-8 and back and get exactly the same sequence of bytes. -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
RE: SVN Blame Returns Corrupt Data
On 11.10.2013 16:55, Bob Archer wrote: On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in non-Windows OSes. A large number of dev tools that I work with on Windows, especially the Microsoft tools default to creating UTF-16 files. I disagree with your can be converted without data loss. If you need UTF- 16 then you need it. Also, if you are working in an international team and you have developers with other language Oss which have different code pages then what you see when you look at a UTF-8 file might be different than what I see. I don't follow. Both UTF-16 and UTF-8 are complete representations of the Unicode character set. Exactly the same code sequences can be represented in both encodings. You can convert from UTF-16 to UTF-8 and back and get exactly the same sequence of bytes. Ok, I have to back pedal here a bit. You are correct, UTF-8 is a Unicode format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday senior moment). What I recall being told by one of the subversion developers was that subversion only supported the ASCII character set and while UTF-8 was compatible with ASCII it didn't truly support Unicode files. However, this blog entry seems to dispute that: http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/ Would adding that mime-type to this file fix the blame issues this user is seeing? BOb
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 17:19, Bob Archer wrote: On 11.10.2013 16:55, Bob Archer wrote: On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in non-Windows OSes. A large number of dev tools that I work with on Windows, especially the Microsoft tools default to creating UTF-16 files. I disagree with your can be converted without data loss. If you need UTF- 16 then you need it. Also, if you are working in an international team and you have developers with other language Oss which have different code pages then what you see when you look at a UTF-8 file might be different than what I see. I don't follow. Both UTF-16 and UTF-8 are complete representations of the Unicode character set. Exactly the same code sequences can be represented in both encodings. You can convert from UTF-16 to UTF-8 and back and get exactly the same sequence of bytes. Ok, I have to back pedal here a bit. You are correct, UTF-8 is a Unicode format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday senior moment). What I recall being told by one of the subversion developers was that subversion only supported the ASCII character set and while UTF-8 was compatible with ASCII it didn't truly support Unicode files. However, this blog entry seems to dispute that: http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/ Would adding that mime-type to this file fix the blame issues this user is seeing? I think the user is just very lucky. Subversion does not actually try to interpret the svn:mime-type property, other than to determine whether to treat a file as text or binary. (The comment is correct in that the proper parameter is charset=, not encoding=, but that's not important for this discussion). Subversion's merge algorithm depends on being able to detect line endings in the file, and always scans the file as a sequence of bytes. There are several ways to represent line endings in a UTF-16 file (shown here as hex byte sequences): * 00 0A (Unix newline, UTF16-BE) * 00 0D 00 0A (Windows newline, UTF16-BE) * 0A 00 (Unix newline, UTF16-LE) * 0D 00 0A 00 (Windows newline, UTF16-LE) * 24 24 (Unicode newline, same in LE and BE) Subversion, however, expects one of the following newline sequences: * 0A (Unix newline) * 0D 0A (Windows newline) My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII newline character, are interpreted as the end-of-line markers, and the zero bytes are treated as part of the text. In most cases, the result will be close to correct, as long as there are no conflicts in the merge -- because Subversion will not emit conflict markers in UTF-16. Of course, if someone used the U+2424 newline code point instead, then in the worst case, the whole file would be interpreted as a single line. -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
RE: SVN Blame Returns Corrupt Data
On 11.10.2013 17:19, Bob Archer wrote: On 11.10.2013 16:55, Bob Archer wrote: On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in non-Windows OSes. A large number of dev tools that I work with on Windows, especially the Microsoft tools default to creating UTF-16 files. I disagree with your can be converted without data loss. If you need UTF- 16 then you need it. Also, if you are working in an international team and you have developers with other language Oss which have different code pages then what you see when you look at a UTF-8 file might be different than what I see. I don't follow. Both UTF-16 and UTF-8 are complete representations of the Unicode character set. Exactly the same code sequences can be represented in both encodings. You can convert from UTF-16 to UTF-8 and back and get exactly the same sequence of bytes. Ok, I have to back pedal here a bit. You are correct, UTF-8 is a Unicode format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday senior moment). What I recall being told by one of the subversion developers was that subversion only supported the ASCII character set and while UTF-8 was compatible with ASCII it didn't truly support Unicode files. However, this blog entry seems to dispute that: http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/ Would adding that mime-type to this file fix the blame issues this user is seeing? I think the user is just very lucky. Subversion does not actually try to interpret the svn:mime-type property, other than to determine whether to treat a file as text or binary. (The comment is correct in that the proper parameter is charset=, not encoding=, but that's not important for this discussion). Subversion's merge algorithm depends on being able to detect line endings in the file, and always scans the file as a sequence of bytes. There are several ways to represent line endings in a UTF-16 file (shown here as hex byte sequences): * 00 0A (Unix newline, UTF16-BE) * 00 0D 00 0A (Windows newline, UTF16-BE) * 0A 00 (Unix newline, UTF16-LE) * 0D 00 0A 00 (Windows newline, UTF16-LE) * 24 24 (Unicode newline, same in LE and BE) Subversion, however, expects one of the following newline sequences: * 0A (Unix newline) * 0D 0A (Windows newline) My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII newline character, are interpreted as the end-of-line markers, and the zero bytes are treated as part of the text. In most cases, the result will be close to correct, as long as there are no conflicts in the merge -- because Subversion will not emit conflict markers in UTF-16. Of course, if someone used the U+2424 newline code point instead, then in the worst case, the whole file would be interpreted as a single line. -- Brane Great information.. thanks for that. Bottom line is use UTF-8 for your text files and svn will be happy and work correctly. How hard would it be to create
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 18:12, Bob Archer wrote: On 11.10.2013 17:19, Bob Archer wrote: On 11.10.2013 16:55, Bob Archer wrote: On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in non-Windows OSes. A large number of dev tools that I work with on Windows, especially the Microsoft tools default to creating UTF-16 files. I disagree with your can be converted without data loss. If you need UTF- 16 then you need it. Also, if you are working in an international team and you have developers with other language Oss which have different code pages then what you see when you look at a UTF-8 file might be different than what I see. I don't follow. Both UTF-16 and UTF-8 are complete representations of the Unicode character set. Exactly the same code sequences can be represented in both encodings. You can convert from UTF-16 to UTF-8 and back and get exactly the same sequence of bytes. Ok, I have to back pedal here a bit. You are correct, UTF-8 is a Unicode format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday senior moment). What I recall being told by one of the subversion developers was that subversion only supported the ASCII character set and while UTF-8 was compatible with ASCII it didn't truly support Unicode files. However, this blog entry seems to dispute that: http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/ Would adding that mime-type to this file fix the blame issues this user is seeing? I think the user is just very lucky. Subversion does not actually try to interpret the svn:mime-type property, other than to determine whether to treat a file as text or binary. (The comment is correct in that the proper parameter is charset=, not encoding=, but that's not important for this discussion). Subversion's merge algorithm depends on being able to detect line endings in the file, and always scans the file as a sequence of bytes. There are several ways to represent line endings in a UTF-16 file (shown here as hex byte sequences): * 00 0A (Unix newline, UTF16-BE) * 00 0D 00 0A (Windows newline, UTF16-BE) * 0A 00 (Unix newline, UTF16-LE) * 0D 00 0A 00 (Windows newline, UTF16-LE) * 24 24 (Unicode newline, same in LE and BE) Subversion, however, expects one of the following newline sequences: * 0A (Unix newline) * 0D 0A (Windows newline) My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII newline character, are interpreted as the end-of-line markers, and the zero bytes are treated as part of the text. In most cases, the result will be close to correct, as long as there are no conflicts in the merge -- because Subversion will not emit conflict markers in UTF-16. Of course, if someone used the U+2424 newline code point instead, then in the worst case, the whole file would be interpreted as a single line. -- Brane Great information.. thanks for that. Bottom line is use UTF-8 for your text files and svn will be happy and work correctly. How hard would it be to create a warning on an add that a
Re: SVN Blame Returns Corrupt Data
On Fri, 11 Oct 2013 17:43:30 +, Branko ??ibej wrote: ... Of course, if someone used the U+2424 newline code point instead, then in the worst case, the whole file would be interpreted as a single line. And SVN would be right, as U+2424 is 'SYMBOL FOR NEWLINE', which is actually a printable character, not a control charactor. Andreas -- Totally trivial. Famous last words. From: Linus Torvalds torvalds@*.org Date: Fri, 22 Jan 2010 07:29:21 -0800
Re: SVN Blame Returns Corrupt Data
On 10/11/13 9:22 AM, Branko Čibej wrote: You'd have to extend Subversion's file type detection to detect UTF-16. See svn_io_detect_mimetype2 in line in this file: http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup Subversion currently only looks at the first 1k Bytes of a file. It may be enough to check that this initial part of the file contains only valid UTF-16 (BE or LE) codes. Even if all we looked for is the BOM it might be helpful enough. I suspect the development tools producing UTF-16 are including BOMs. Windows seems to be fond of including them, Notepad puts one even on UTF-8.
Re: SVN Blame Returns Corrupt Data
On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote: On 10/11/13 9:22 AM, Branko Čibej wrote: You'd have to extend Subversion's file type detection to detect UTF-16. See svn_io_detect_mimetype2 in line in this file: http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup Subversion currently only looks at the first 1k Bytes of a file. It may be enough to check that this initial part of the file contains only valid UTF-16 (BE or LE) codes. Even if all we looked for is the BOM it might be helpful enough. I suspect the development tools producing UTF-16 are including BOMs. Windows seems to be fond of including them, Notepad puts one even on UTF-8. Couldn't Subversion automatically convert UTF-16 files to UTF-8 before processing them for diff/merge/blame, and convert output written to the original files back to UTF-16? That would require some work because existing streams, strings, and files passed around in the code would need to be wrapped so that translation to/from the internal from/to the external encoding is seamless. But I don't see why such an approach couldn't be made to work in principle. It might even result in some spring cleaning in the code base and pave the way for improved handling of file formats such as XML for diff and merge. What do you think? Is it worth adding this to our project ideas page?
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 18:52, Ben Reser wrote: On 10/11/13 9:22 AM, Branko Čibej wrote: You'd have to extend Subversion's file type detection to detect UTF-16. See svn_io_detect_mimetype2 in line in this file: http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup Subversion currently only looks at the first 1k Bytes of a file. It may be enough to check that this initial part of the file contains only valid UTF-16 (BE or LE) codes. Even if all we looked for is the BOM it might be helpful enough. I suspect the development tools producing UTF-16 are including BOMs. Windows seems to be fond of including them, Notepad puts one even on UTF-8. That would work only on Windows. On other platforms, you typically don't get a BOM (actually, a zero-width non-breaking space) at the beginning of a file. Granted, other platforms most likely use UTF-8 in any case. -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 19:25, Stefan Sperling wrote: On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote: On 10/11/13 9:22 AM, Branko Čibej wrote: You'd have to extend Subversion's file type detection to detect UTF-16. See svn_io_detect_mimetype2 in line in this file: http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup Subversion currently only looks at the first 1k Bytes of a file. It may be enough to check that this initial part of the file contains only valid UTF-16 (BE or LE) codes. Even if all we looked for is the BOM it might be helpful enough. I suspect the development tools producing UTF-16 are including BOMs. Windows seems to be fond of including them, Notepad puts one even on UTF-8. Couldn't Subversion automatically convert UTF-16 files to UTF-8 before processing them for diff/merge/blame, and convert output written to the original files back to UTF-16? That would be less work than supporting whitespace compression, etc. in UTF-16, but we'd still have to detect U+2424 as an end-of-line marker in UTF-8 text. Still, we'd actually have to correctly identify UTF-16 content first, and handle invalid byte sequences. That would require some work because existing streams, strings, and files passed around in the code would need to be wrapped so that translation to/from the internal from/to the external encoding is seamless. But I don't see why such an approach couldn't be made to work in principle. It might even result in some spring cleaning in the code base and pave the way for improved handling of file formats such as XML for diff and merge. Can't see what XML has to do with it. The diff algorithm already uses a tokenizer; and for XML, that should be good enough most of the time. What do you think? Is it worth adding this to our project ideas page? It's already here: http://subversion.tigris.org/issues/show_bug.cgi?id=2194 -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 18:23, Andreas Krey wrote: On Fri, 11 Oct 2013 17:43:30 +, Branko ??ibej wrote: ... Of course, if someone used the U+2424 newline code point instead, then in the worst case, the whole file would be interpreted as a single line. And SVN would be right, as U+2424 is 'SYMBOL FOR NEWLINE', which is actually a printable character, not a control charactor. Meh, you're right ... it's U+0085 (next line), U+2028 (line separator) or U+2029 (paragraph separator). I don't know what came over me; sorry for misleading everyone. -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
Re: SVN Blame Returns Corrupt Data
On 10/11/13 10:25 AM, Stefan Sperling wrote: Couldn't Subversion automatically convert UTF-16 files to UTF-8 before processing them for diff/merge/blame, and convert output written to the original files back to UTF-16? That's what the patch I pointed out did. Nobody seemed to object to the idea at the time it was posted, though I think Branko does bring up some interesting questions about handling the unicode control characters.
SVN Blame Returns Corrupt Data
I'm having trouble running svn blame on a particular file. It's returning garbage. In TortoiseBlame: 3341 TJP ÿþO 3341 TJP In the command line: 3341TJP ■O 3341TJP The file is 10.1k lines, not 2. If I run the blame from revision 0 to 3341 it returns the correct information. In WinMerge and TortoiseMerge, diffing the files shows about 10 lines changing between 3340 and 3341 (it was merge). However, the command line diff shows the entire contents being changed with spaces between every character. So End Class reads E n d C l a s s, etc.. Diffing a merge post-rev# 3341 show the same spaces between every letter. svn diff -r 3341:3489 svn://... @@ -20032,7 +20058,7 @@ F i l l _ d d l L o c a t i o n ( ) F i l l _ d d l C o u n t r y ( ) Another strange thing is it's marking these as lines 20,032 and 20,058. But in Notepad++ they are lines 10,026 and 10,031. The line numbers in pre-rev# 3341 diffs match up between the Notepad++ and command line fine. Any ideas on what happened or how to fix it? I'm most concerned about getting the blame working again. Please let me know if you need more information or if I was unclear with anything. Thanks in advance!
Re: SVN Blame Returns Corrupt Data
On Oct 10, 2013, at 11:29, T.J. Perovich tjperov...@gmail.com wrote: I'm having trouble running svn blame on a particular file. It's returning garbage. In TortoiseBlame: 3341 TJP ÿþO 3341 TJP In the command line: 3341TJP ■O 3341TJP The file is 10.1k lines, not 2. If I run the blame from revision 0 to 3341 it returns the correct information. In WinMerge and TortoiseMerge, diffing the files shows about 10 lines changing between 3340 and 3341 (it was merge). However, the command line diff shows the entire contents being changed with spaces between every character. So End Class reads E n d C l a s s, etc.. Diffing a merge post-rev# 3341 show the same spaces between every letter. svn diff -r 3341:3489 svn://... @@ -20032,7 +20058,7 @@ F i l l _ d d l L o c a t i o n ( ) F i l l _ d d l C o u n t r y ( ) Sounds like you've converted the file from UTF-8 to UTF-16. Another strange thing is it's marking these as lines 20,032 and 20,058. But in Notepad++ they are lines 10,026 and 10,031. The line numbers in pre-rev# 3341 diffs match up between the Notepad++ and command line fine. Sounds like the line endings changed as well.
RE: SVN Blame Returns Corrupt Data
On Oct 10, 2013, at 11:29, T.J. Perovich tjperov...@gmail.com wrote: I'm having trouble running svn blame on a particular file. It's returning garbage. In TortoiseBlame: 3341 TJP ÿþO 3341 TJP In the command line: 3341TJP ■O 3341TJP The file is 10.1k lines, not 2. If I run the blame from revision 0 to 3341 it returns the correct information. In WinMerge and TortoiseMerge, diffing the files shows about 10 lines changing between 3340 and 3341 (it was merge). However, the command line diff shows the entire contents being changed with spaces between every character. So End Class reads E n d C l a s s, etc.. Diffing a merge post-rev# 3341 show the same spaces between every letter. svn diff -r 3341:3489 svn://... @@ -20032,7 +20058,7 @@ F i l l _ d d l L o c a t i o n ( ) F i l l _ d d l C o u n t r y ( ) Sounds like you've converted the file from UTF-8 to UTF-16. Another strange thing is it's marking these as lines 20,032 and 20,058. But in Notepad++ they are lines 10,026 and 10,031. The line numbers in pre-rev# 3341 diffs match up between the Notepad++ and command line fine. Sounds like the line endings changed as well. Sigh... if only svn would support Unicode encodings. BOb
Re: SVN Blame Returns Corrupt Data
On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt subversion-20...@ryandesign.com wrote: Sounds like you've converted the file from UTF-8 to UTF-16. Thanks, you're absolutely right. It changed from UTF-8 to UTF-16LE. Any idea how to go about fixing it elegantly? We have about 3 months of commits since this happened. Diff's in the GUIs have been working fine and we don't blame too often which is why it was never noticed. On Thu, Oct 10, 2013 at 3:15 PM, Bob Archer bob.arc...@amsi.com wrote: On Oct 10, 2013, at 11:29, T.J. Perovich tjperov...@gmail.com wrote: I'm having trouble running svn blame on a particular file. It's returning garbage. In TortoiseBlame: 3341 TJP ÿþO 3341 TJP In the command line: 3341TJP ■O 3341TJP The file is 10.1k lines, not 2. If I run the blame from revision 0 to 3341 it returns the correct information. In WinMerge and TortoiseMerge, diffing the files shows about 10 lines changing between 3340 and 3341 (it was merge). However, the command line diff shows the entire contents being changed with spaces between every character. So End Class reads E n d C l a s s, etc.. Diffing a merge post-rev# 3341 show the same spaces between every letter. svn diff -r 3341:3489 svn://... @@ -20032,7 +20058,7 @@ F i l l _ d d l L o c a t i o n ( ) F i l l _ d d l C o u n t r y ( ) Sounds like you've converted the file from UTF-8 to UTF-16. Another strange thing is it's marking these as lines 20,032 and 20,058. But in Notepad++ they are lines 10,026 and 10,031. The line numbers in pre-rev# 3341 diffs match up between the Notepad++ and command line fine. Sounds like the line endings changed as well. Sigh... if only svn would support Unicode encodings. BOb
Re: SVN Blame Returns Corrupt Data
Guten Tag T.J. Perovich, am Donnerstag, 10. Oktober 2013 um 21:17 schrieben Sie: Any idea how to go about fixing it elegantly? Simply convert it back using your method of choice, Notepad++ should be able to handle this. Mit freundlichen Grüßen, Thorsten Schöning -- Thorsten Schöning E-Mail:thorsten.schoen...@am-soft.de AM-SoFT IT-Systeme http://www.AM-SoFT.de/ Telefon...05151- 9468- 55 Fax...05151- 9468- 88 Mobil..0178-8 9468- 04 AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow
RE: SVN Blame Returns Corrupt Data
Guten Tag T.J. Perovich, am Donnerstag, 10. Oktober 2013 um 21:17 schrieben Sie: Any idea how to go about fixing it elegantly? Simply convert it back using your method of choice, Notepad++ should be able to handle this. Mit freundlichen Grüßen, Thorsten Schöning I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. However, at this point blame is probably wrong anyway, because it is showing every line has been changed by whomever changed all the line endings. Bottom line, I think he stuck with it. BOb
Re: SVN Blame Returns Corrupt Data
On 10/10/13 12:17 PM, T.J. Perovich wrote: On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt subversion-20...@ryandesign.com mailto:subversion-20...@ryandesign.com wrote: Sounds like you've converted the file from UTF-8 to UTF-16. Thanks, you're absolutely right. It changed from UTF-8 to UTF-16LE. Any idea how to go about fixing it elegantly? We have about 3 months of commits since this happened. Diff's in the GUIs have been working fine and we don't blame too often which is why it was never noticed. At current blame is not UTF-16 aware. About a year ago there was a patch (actually they just reposted an entire copy of blame.c) posted that helped, but it really didn't go anywhere since the original poster didn't continue following up. https://mail-archives.apache.org/mod_mbox/subversion-dev/201207.mbox/%3CCAAF0CB13B282344AF95AD2DE3D1962215627A3C%40DAG-B.nexon.corp%3E and the followup https://mail-archives.apache.org/mod_mbox/subversion-dev/201208.mbox/%3ccab84ubvvrhffqyea5pf5gstmpxz+rh2jkvdvcqscocjv+rq...@mail.gmail.com%3E Perhaps you'll find the above useful. Patches are of course welcome.