Fresh checkout shows modified files
Hi all, for some directories I get modified working copies right after a fresh checkout. This seems very strange to me. I do not get any error message. My Platform is Windows 7 64 Bit. I am using TortoiseSVN 1.8.1, Build 24570 - 64 Bit as client. Visual SVN as server. But also tortoise svn command line tools are producing the same results. How can I start to debug? Network doesn't seem to be the issue. Am I right? Mit freundlichen Grüßen Thomas Stümpfig Senior Presales Consultant Global Sales Services Product Lifecycle Management Siemens Industry Sector Siemens Industry Software GmbH Co. KG Franz-Geuer-Str. 10 50823 Cologne, Germany Tel. :+49 (2153) 9107117 Fax :+49 (221) 20802 699 Mobile :+49 (175) 2205712 thomas.stuemp...@siemens.com www.siemens.com/plm - Siemens Industry Software GmbH Co. KG; Anschrift: Franz-Geuer-Str. 10, 50674 Köln; Kommanditgesellschaft: Sitz der Gesellschaft: Köln; Registergericht: Amtsgericht Köln, HRA 28227; Geschäftsführung und persönlich haftender Gesellschafter: Siemens Industry Software Management GmbH; Geschäftsführer: Urban August, Daniel Trebes; Sitz der Gesellschaft: Köln; Registergericht: Amtsgericht Köln, HRB 70858
Re: Fresh checkout shows modified files
On Fri, Oct 11, 2013 at 09:41:33AM +, Stümpfig, Thomas wrote: Hi all, for some directories I get modified working copies right after a fresh checkout. This seems very strange to me. I do not get any error message. My Platform is Windows 7 64 Bit. I am using TortoiseSVN 1.8.1, Build 24570 - 64 Bit as client. Visual SVN as server. But also tortoise svn command line tools are producing the same results. How can I start to debug? Network doesn't seem to be the issue. Am I right? Does 'svn cleanup' fix it? If so, my guess would be that timestamps of the on-disk files differ from what was recorded in .svn/wc.db. This difference would cause the modification check logic to flag files as modified. 'svn cleanup' syncs the on-disk timestamps with the ones stored in meta data.
Re: SVN Blame Returns Corrupt Data
On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. About a year ago there was a patch (actually they just reposted an entire copy of blame.c) posted that helped, but it really didn't go anywhere since the original poster didn't continue following up. Perhaps you'll find the above useful. Patches are of course welcome. I'll take a look at that this weekend. Thanks for all the input everyone. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: On 10/10/13 12:17 PM, T.J. Perovich wrote: On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt subversion-20...@ryandesign.com mailto:subversion-20...@ryandesign.com wrote: Sounds like you've converted the file from UTF-8 to UTF-16. Thanks, you're absolutely right. It changed from UTF-8 to UTF-16LE. Any idea how to go about fixing it elegantly? We have about 3 months of commits since this happened. Diff's in the GUIs have been working fine and we don't blame too often which is why it was never noticed. At current blame is not UTF-16 aware. About a year ago there was a patch (actually they just reposted an entire copy of blame.c) posted that helped, but it really didn't go anywhere since the original poster didn't continue following up. https://mail-archives.apache.org/mod_mbox/subversion-dev/201207.mbox/%3CCAAF0CB13B282344AF95AD2DE3D1962215627A3C%40DAG-B.nexon.corp%3E and the followup https://mail-archives.apache.org/mod_mbox/subversion-dev/201208.mbox/%3ccab84ubvvrhffqyea5pf5gstmpxz+rh2jkvdvcqscocjv+rq...@mail.gmail.com%3E Perhaps you'll find the above useful. Patches are of course welcome.
RE: SVN Blame Returns Corrupt Data
On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. BOb About a year ago there was a patch (actually they just reposted an entire copy of blame.c) posted that helped, but it really didn't go anywhere since the original poster didn't continue following up. Perhaps you'll find the above useful. Patches are of course welcome. I'll take a look at that this weekend. Thanks for all the input everyone. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: On 10/10/13 12:17 PM, T.J. Perovich wrote: On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt subversion-20...@ryandesign.com mailto:subversion- 20...@ryandesign.com wrote: Sounds like you've converted the file from UTF-8 to UTF-16. Thanks, you're absolutely right. It changed from UTF-8 to UTF-16LE. Any idea how to go about fixing it elegantly? We have about 3 months of commits since this happened. Diff's in the GUIs have been working fine and we don't blame too often which is why it was never noticed. At current blame is not UTF-16 aware. About a year ago there was a patch (actually they just reposted an entire copy of blame.c) posted that helped, but it really didn't go anywhere since the original poster didn't continue following up. https://mail-archives.apache.org/mod_mbox/subversion- dev/201207.mbox/%3CCAAF0CB13B282344AF95AD2DE3D1962215627A3C%40 DAG-B.nexon.corp%3E and the followup https://mail-archives.apache.org/mod_mbox/subversion- dev/201208.mbox/%3CCAB84uBVVrHFfQyEA5pF5gStMpXz+RH2jKvdvCQsCO cjv+rq...@mail.gmail.com%3E Perhaps you'll find the above useful. Patches are of course welcome.
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. To turn your argument around: I'm surprised no Windows user has yet written a patch for Subversion to make it support UTF-16 ... -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
RE: SVN Blame Returns Corrupt Data
On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in non-Windows OSes. A large number of dev tools that I work with on Windows, especially the Microsoft tools default to creating UTF-16 files. I disagree with your can be converted without data loss. If you need UTF-16 then you need it. Also, if you are working in an international team and you have developers with other language Oss which have different code pages then what you see when you look at a UTF-8 file might be different than what I see. So, when I say I'm surprised I only say that with the knowledge that the internet has made the world very flat and I'm sure there is much more collaboration amoungs people that use different languages and work on apps that need to deal with international languages, etc. I'm not dissing the devs in any way. To turn your argument around: I'm surprised no Windows user has yet written a patch for Subversion to make it support UTF-16 ... If I knew how to I would. While I work with C# and I'm sure C is similar it is probably much different. If a svn dev would mentor me through it, and perhaps tell me what modules would need to be modified I would be happy to take a whack at it. BOb
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 16:55, Bob Archer wrote: On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in non-Windows OSes. A large number of dev tools that I work with on Windows, especially the Microsoft tools default to creating UTF-16 files. I disagree with your can be converted without data loss. If you need UTF-16 then you need it. Also, if you are working in an international team and you have developers with other language Oss which have different code pages then what you see when you look at a UTF-8 file might be different than what I see. I don't follow. Both UTF-16 and UTF-8 are complete representations of the Unicode character set. Exactly the same code sequences can be represented in both encodings. You can convert from UTF-16 to UTF-8 and back and get exactly the same sequence of bytes. -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
RE: SVN Blame Returns Corrupt Data
On 11.10.2013 16:55, Bob Archer wrote: On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in non-Windows OSes. A large number of dev tools that I work with on Windows, especially the Microsoft tools default to creating UTF-16 files. I disagree with your can be converted without data loss. If you need UTF- 16 then you need it. Also, if you are working in an international team and you have developers with other language Oss which have different code pages then what you see when you look at a UTF-8 file might be different than what I see. I don't follow. Both UTF-16 and UTF-8 are complete representations of the Unicode character set. Exactly the same code sequences can be represented in both encodings. You can convert from UTF-16 to UTF-8 and back and get exactly the same sequence of bytes. Ok, I have to back pedal here a bit. You are correct, UTF-8 is a Unicode format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday senior moment). What I recall being told by one of the subversion developers was that subversion only supported the ASCII character set and while UTF-8 was compatible with ASCII it didn't truly support Unicode files. However, this blog entry seems to dispute that: http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/ Would adding that mime-type to this file fix the blame issues this user is seeing? BOb
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 17:19, Bob Archer wrote: On 11.10.2013 16:55, Bob Archer wrote: On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in non-Windows OSes. A large number of dev tools that I work with on Windows, especially the Microsoft tools default to creating UTF-16 files. I disagree with your can be converted without data loss. If you need UTF- 16 then you need it. Also, if you are working in an international team and you have developers with other language Oss which have different code pages then what you see when you look at a UTF-8 file might be different than what I see. I don't follow. Both UTF-16 and UTF-8 are complete representations of the Unicode character set. Exactly the same code sequences can be represented in both encodings. You can convert from UTF-16 to UTF-8 and back and get exactly the same sequence of bytes. Ok, I have to back pedal here a bit. You are correct, UTF-8 is a Unicode format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday senior moment). What I recall being told by one of the subversion developers was that subversion only supported the ASCII character set and while UTF-8 was compatible with ASCII it didn't truly support Unicode files. However, this blog entry seems to dispute that: http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/ Would adding that mime-type to this file fix the blame issues this user is seeing? I think the user is just very lucky. Subversion does not actually try to interpret the svn:mime-type property, other than to determine whether to treat a file as text or binary. (The comment is correct in that the proper parameter is charset=, not encoding=, but that's not important for this discussion). Subversion's merge algorithm depends on being able to detect line endings in the file, and always scans the file as a sequence of bytes. There are several ways to represent line endings in a UTF-16 file (shown here as hex byte sequences): * 00 0A (Unix newline, UTF16-BE) * 00 0D 00 0A (Windows newline, UTF16-BE) * 0A 00 (Unix newline, UTF16-LE) * 0D 00 0A 00 (Windows newline, UTF16-LE) * 24 24 (Unicode newline, same in LE and BE) Subversion, however, expects one of the following newline sequences: * 0A (Unix newline) * 0D 0A (Windows newline) My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII newline character, are interpreted as the end-of-line markers, and the zero bytes are treated as part of the text. In most cases, the result will be close to correct, as long as there are no conflicts in the merge -- because Subversion will not emit conflict markers in UTF-16. Of course, if someone used the U+2424 newline code point instead, then in the worst case, the whole file would be interpreted as a single line. -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
RE: SVN Blame Returns Corrupt Data
On 11.10.2013 17:19, Bob Archer wrote: On 11.10.2013 16:55, Bob Archer wrote: On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in non-Windows OSes. A large number of dev tools that I work with on Windows, especially the Microsoft tools default to creating UTF-16 files. I disagree with your can be converted without data loss. If you need UTF- 16 then you need it. Also, if you are working in an international team and you have developers with other language Oss which have different code pages then what you see when you look at a UTF-8 file might be different than what I see. I don't follow. Both UTF-16 and UTF-8 are complete representations of the Unicode character set. Exactly the same code sequences can be represented in both encodings. You can convert from UTF-16 to UTF-8 and back and get exactly the same sequence of bytes. Ok, I have to back pedal here a bit. You are correct, UTF-8 is a Unicode format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday senior moment). What I recall being told by one of the subversion developers was that subversion only supported the ASCII character set and while UTF-8 was compatible with ASCII it didn't truly support Unicode files. However, this blog entry seems to dispute that: http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/ Would adding that mime-type to this file fix the blame issues this user is seeing? I think the user is just very lucky. Subversion does not actually try to interpret the svn:mime-type property, other than to determine whether to treat a file as text or binary. (The comment is correct in that the proper parameter is charset=, not encoding=, but that's not important for this discussion). Subversion's merge algorithm depends on being able to detect line endings in the file, and always scans the file as a sequence of bytes. There are several ways to represent line endings in a UTF-16 file (shown here as hex byte sequences): * 00 0A (Unix newline, UTF16-BE) * 00 0D 00 0A (Windows newline, UTF16-BE) * 0A 00 (Unix newline, UTF16-LE) * 0D 00 0A 00 (Windows newline, UTF16-LE) * 24 24 (Unicode newline, same in LE and BE) Subversion, however, expects one of the following newline sequences: * 0A (Unix newline) * 0D 0A (Windows newline) My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII newline character, are interpreted as the end-of-line markers, and the zero bytes are treated as part of the text. In most cases, the result will be close to correct, as long as there are no conflicts in the merge -- because Subversion will not emit conflict markers in UTF-16. Of course, if someone used the U+2424 newline code point instead, then in the worst case, the whole file would be interpreted as a single line. -- Brane Great information.. thanks for that. Bottom line is use UTF-8 for your text files and svn will be happy and work correctly. How hard would it be to create
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 18:12, Bob Archer wrote: On 11.10.2013 17:19, Bob Archer wrote: On 11.10.2013 16:55, Bob Archer wrote: On 11.10.2013 15:58, Bob Archer wrote: On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote: I assume he was asking how to fix the blame. Cause, sure, he could open the file, convert it back to UTF-8 with CRLF line endings... and commit it... of course, now blame is going to show him on every line, since he just changed every line. That's exactly what I meant. You're correct with how the blame is handled. I committed the UTF-8 copy to a test branch, diff'd, and it showed every line as being changed. Unfortunately it looks like this is our best option. Yep, we have done the same thing. As a matter of fact, I just over the past few days rescripted all our database scripts to be UTF-8 since merging them just doesn't work correctly when they are UTF-16 even if you remove the binary mime type. On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote: At current blame is not UTF-16 aware. It's not just blame that isn't... the diff engine, or whatever detects file types always considers UTF-16 files to be binary. If you add a UTF-16 file you see that svn adds the application/octet-stream mime type. There is an issue in the bug database about this from when I reported/complained about it... however it hasn't been addressed. I'm surprised still at this time that svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, etc. It's quite simple: no-one has written the necessary code. While I can understand it's an interesting feature for Windows users, most Subversion developers have other things to do. This being a volunteer project, and most of us do not use Windows, you can hardly expect anyone to spend several weeks on solving a problem that has a perfectly simple workaround. Since UFT-8 and UTF-16 can be interchanged without data loss, there are other, much more important things to do in Subversion. I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in non-Windows OSes. A large number of dev tools that I work with on Windows, especially the Microsoft tools default to creating UTF-16 files. I disagree with your can be converted without data loss. If you need UTF- 16 then you need it. Also, if you are working in an international team and you have developers with other language Oss which have different code pages then what you see when you look at a UTF-8 file might be different than what I see. I don't follow. Both UTF-16 and UTF-8 are complete representations of the Unicode character set. Exactly the same code sequences can be represented in both encodings. You can convert from UTF-16 to UTF-8 and back and get exactly the same sequence of bytes. Ok, I have to back pedal here a bit. You are correct, UTF-8 is a Unicode format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday senior moment). What I recall being told by one of the subversion developers was that subversion only supported the ASCII character set and while UTF-8 was compatible with ASCII it didn't truly support Unicode files. However, this blog entry seems to dispute that: http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/ Would adding that mime-type to this file fix the blame issues this user is seeing? I think the user is just very lucky. Subversion does not actually try to interpret the svn:mime-type property, other than to determine whether to treat a file as text or binary. (The comment is correct in that the proper parameter is charset=, not encoding=, but that's not important for this discussion). Subversion's merge algorithm depends on being able to detect line endings in the file, and always scans the file as a sequence of bytes. There are several ways to represent line endings in a UTF-16 file (shown here as hex byte sequences): * 00 0A (Unix newline, UTF16-BE) * 00 0D 00 0A (Windows newline, UTF16-BE) * 0A 00 (Unix newline, UTF16-LE) * 0D 00 0A 00 (Windows newline, UTF16-LE) * 24 24 (Unicode newline, same in LE and BE) Subversion, however, expects one of the following newline sequences: * 0A (Unix newline) * 0D 0A (Windows newline) My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII newline character, are interpreted as the end-of-line markers, and the zero bytes are treated as part of the text. In most cases, the result will be close to correct, as long as there are no conflicts in the merge -- because Subversion will not emit conflict markers in UTF-16. Of course, if someone used the U+2424 newline code point instead, then in the worst case, the whole file would be interpreted as a single line. -- Brane Great information.. thanks for that. Bottom line is use UTF-8 for your text files and svn will be happy and work correctly. How hard would it be to create a warning on an add that a
Re: SVN Blame Returns Corrupt Data
On Fri, 11 Oct 2013 17:43:30 +, Branko ??ibej wrote: ... Of course, if someone used the U+2424 newline code point instead, then in the worst case, the whole file would be interpreted as a single line. And SVN would be right, as U+2424 is 'SYMBOL FOR NEWLINE', which is actually a printable character, not a control charactor. Andreas -- Totally trivial. Famous last words. From: Linus Torvalds torvalds@*.org Date: Fri, 22 Jan 2010 07:29:21 -0800
Re: SVN Blame Returns Corrupt Data
On 10/11/13 9:22 AM, Branko Čibej wrote: You'd have to extend Subversion's file type detection to detect UTF-16. See svn_io_detect_mimetype2 in line in this file: http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup Subversion currently only looks at the first 1k Bytes of a file. It may be enough to check that this initial part of the file contains only valid UTF-16 (BE or LE) codes. Even if all we looked for is the BOM it might be helpful enough. I suspect the development tools producing UTF-16 are including BOMs. Windows seems to be fond of including them, Notepad puts one even on UTF-8.
Re: SVN Blame Returns Corrupt Data
On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote: On 10/11/13 9:22 AM, Branko Čibej wrote: You'd have to extend Subversion's file type detection to detect UTF-16. See svn_io_detect_mimetype2 in line in this file: http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup Subversion currently only looks at the first 1k Bytes of a file. It may be enough to check that this initial part of the file contains only valid UTF-16 (BE or LE) codes. Even if all we looked for is the BOM it might be helpful enough. I suspect the development tools producing UTF-16 are including BOMs. Windows seems to be fond of including them, Notepad puts one even on UTF-8. Couldn't Subversion automatically convert UTF-16 files to UTF-8 before processing them for diff/merge/blame, and convert output written to the original files back to UTF-16? That would require some work because existing streams, strings, and files passed around in the code would need to be wrapped so that translation to/from the internal from/to the external encoding is seamless. But I don't see why such an approach couldn't be made to work in principle. It might even result in some spring cleaning in the code base and pave the way for improved handling of file formats such as XML for diff and merge. What do you think? Is it worth adding this to our project ideas page?
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 18:52, Ben Reser wrote: On 10/11/13 9:22 AM, Branko Čibej wrote: You'd have to extend Subversion's file type detection to detect UTF-16. See svn_io_detect_mimetype2 in line in this file: http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup Subversion currently only looks at the first 1k Bytes of a file. It may be enough to check that this initial part of the file contains only valid UTF-16 (BE or LE) codes. Even if all we looked for is the BOM it might be helpful enough. I suspect the development tools producing UTF-16 are including BOMs. Windows seems to be fond of including them, Notepad puts one even on UTF-8. That would work only on Windows. On other platforms, you typically don't get a BOM (actually, a zero-width non-breaking space) at the beginning of a file. Granted, other platforms most likely use UTF-8 in any case. -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 19:25, Stefan Sperling wrote: On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote: On 10/11/13 9:22 AM, Branko Čibej wrote: You'd have to extend Subversion's file type detection to detect UTF-16. See svn_io_detect_mimetype2 in line in this file: http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup Subversion currently only looks at the first 1k Bytes of a file. It may be enough to check that this initial part of the file contains only valid UTF-16 (BE or LE) codes. Even if all we looked for is the BOM it might be helpful enough. I suspect the development tools producing UTF-16 are including BOMs. Windows seems to be fond of including them, Notepad puts one even on UTF-8. Couldn't Subversion automatically convert UTF-16 files to UTF-8 before processing them for diff/merge/blame, and convert output written to the original files back to UTF-16? That would be less work than supporting whitespace compression, etc. in UTF-16, but we'd still have to detect U+2424 as an end-of-line marker in UTF-8 text. Still, we'd actually have to correctly identify UTF-16 content first, and handle invalid byte sequences. That would require some work because existing streams, strings, and files passed around in the code would need to be wrapped so that translation to/from the internal from/to the external encoding is seamless. But I don't see why such an approach couldn't be made to work in principle. It might even result in some spring cleaning in the code base and pave the way for improved handling of file formats such as XML for diff and merge. Can't see what XML has to do with it. The diff algorithm already uses a tokenizer; and for XML, that should be good enough most of the time. What do you think? Is it worth adding this to our project ideas page? It's already here: http://subversion.tigris.org/issues/show_bug.cgi?id=2194 -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
Re: SVN Blame Returns Corrupt Data
On 11.10.2013 18:23, Andreas Krey wrote: On Fri, 11 Oct 2013 17:43:30 +, Branko ??ibej wrote: ... Of course, if someone used the U+2424 newline code point instead, then in the worst case, the whole file would be interpreted as a single line. And SVN would be right, as U+2424 is 'SYMBOL FOR NEWLINE', which is actually a printable character, not a control charactor. Meh, you're right ... it's U+0085 (next line), U+2028 (line separator) or U+2029 (paragraph separator). I don't know what came over me; sorry for misleading everyone. -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
Re: SVN Blame Returns Corrupt Data
On 10/11/13 10:25 AM, Stefan Sperling wrote: Couldn't Subversion automatically convert UTF-16 files to UTF-8 before processing them for diff/merge/blame, and convert output written to the original files back to UTF-16? That's what the patch I pointed out did. Nobody seemed to object to the idea at the time it was posted, though I think Branko does bring up some interesting questions about handling the unicode control characters.