subject:"SVN Blame Returns Corrupt Data"

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread T.J. Perovich

On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote:

I assume he was asking how to fix the blame. Cause, sure, he could open
the file, convert it back to UTF-8 with CRLF line endings... and commit
it... of course, now blame is going to show him on every line, since he
just changed every line.

That's exactly what I meant. You're correct with how the blame is handled.
I committed the UTF-8 copy to a test branch, diff'd, and it showed every
line as being changed. Unfortunately it looks like this is our best option.

On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:

At current blame is not UTF-16 aware.

About a year ago there was a patch (actually they just reposted an entire
copy of blame.c) posted that helped, but it really didn't go anywhere
since the
original poster didn't continue following up.

Perhaps you'll find the above useful. Patches are of course welcome.

I'll take a look at that this weekend. Thanks for all the input everyone.

On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:

On 10/10/13 12:17 PM, T.J. Perovich wrote:
On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt
subversion-20...@ryandesign.com
mailto:subversion-20...@ryandesign.com wrote:
Sounds like you've converted the file from UTF-8 to UTF-16.

Thanks, you're absolutely right. It changed from UTF-8 to UTF-16LE.

Any idea how to go about fixing it elegantly? We have about 3 months of
commits since this happened. Diff's in the GUIs have been working fine
and we
don't blame too often which is why it was never noticed.

At current blame is not UTF-16 aware.

About a year ago there was a patch (actually they just reposted an entire
copy of blame.c) posted that helped, but it really didn't go anywhere
since the
original poster didn't continue following up.

https://mail-archives.apache.org/mod_mbox/subversion-dev/201207.mbox/%3CCAAF0CB13B282344AF95AD2DE3D1962215627A3C%40DAG-B.nexon.corp%3E

and the followup

https://mail-archives.apache.org/mod_mbox/subversion-dev/201208.mbox/%3ccab84ubvvrhffqyea5pf5gstmpxz+rh2jkvdvcqscocjv+rq...@mail.gmail.com%3E

Perhaps you'll find the above useful. Patches are of course welcome.

RE: SVN Blame Returns Corrupt Data

2013-10-11 Thread Bob Archer

 On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote:
 I assume he was asking how to fix the blame. Cause, sure, he could open
 the file, convert it back to UTF-8 with CRLF line endings... and commit it... 
 of
 course, now blame is going to show him on every line, since he just changed
 every line.
 
 That's exactly what I meant.  You're correct with how the blame is handled.  I
 committed the UTF-8 copy to a test branch, diff'd, and it showed every line
 as being changed.  Unfortunately it looks like this is our best option.

Yep, we have done the same thing. As a matter of fact, I just over the past few 
days rescripted all our database scripts to be UTF-8 since merging them just 
doesn't work correctly when they are UTF-16 even if you remove the binary mime 
type.

 
 
 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
 At current blame is not UTF-16 aware.

It's not just blame that isn't... the diff engine, or whatever detects file 
types always considers UTF-16 files to be binary. If you add a UTF-16 file 
you see that svn adds the application/octet-stream mime type.  There is an 
issue in the bug database about this from when I reported/complained about 
it... however it hasn't been addressed. I'm surprised still at this time that 
svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, 
etc.

BOb


 
 About a year ago there was a patch (actually they just reposted an entire
 copy of blame.c) posted that helped, but it really didn't go anywhere since
 the original poster didn't continue following up.
 
 Perhaps you'll find the above useful.  Patches are of course welcome.
 
 I'll take a look at that this weekend.  Thanks for all the input everyone.
 
 
 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
 On 10/10/13 12:17 PM, T.J. Perovich wrote:
  On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt
  subversion-20...@ryandesign.com mailto:subversion-
 20...@ryandesign.com wrote:
 Sounds like you've converted the file from UTF-8 to UTF-16.
 
  Thanks, you're absolutely right.  It changed from UTF-8 to UTF-16LE.
 
  Any idea how to go about fixing it elegantly?  We have about 3 months
  of commits since this happened.  Diff's in the GUIs have been working
  fine and we don't blame too often which is why it was never noticed.
 At current blame is not UTF-16 aware.
 
 About a year ago there was a patch (actually they just reposted an entire
 copy of blame.c) posted that helped, but it really didn't go anywhere since
 the original poster didn't continue following up.
 
 https://mail-archives.apache.org/mod_mbox/subversion-
 dev/201207.mbox/%3CCAAF0CB13B282344AF95AD2DE3D1962215627A3C%40
 DAG-B.nexon.corp%3E
 
 and the followup
 
 https://mail-archives.apache.org/mod_mbox/subversion-
 dev/201208.mbox/%3CCAB84uBVVrHFfQyEA5pF5gStMpXz+RH2jKvdvCQsCO
 cjv+rq...@mail.gmail.com%3E
 
 Perhaps you'll find the above useful.  Patches are of course welcome.

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej

On 11.10.2013 15:58, Bob Archer wrote:
 On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote:
 I assume he was asking how to fix the blame. Cause, sure, he could open
 the file, convert it back to UTF-8 with CRLF line endings... and commit 
 it... of
 course, now blame is going to show him on every line, since he just changed
 every line.

 That's exactly what I meant.  You're correct with how the blame is handled.  
 I
 committed the UTF-8 copy to a test branch, diff'd, and it showed every line
 as being changed.  Unfortunately it looks like this is our best option.
 Yep, we have done the same thing. As a matter of fact, I just over the past 
 few days rescripted all our database scripts to be UTF-8 since merging them 
 just doesn't work correctly when they are UTF-16 even if you remove the 
 binary mime type.


 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
 At current blame is not UTF-16 aware.
 It's not just blame that isn't... the diff engine, or whatever detects file 
 types always considers UTF-16 files to be binary. If you add a UTF-16 file 
 you see that svn adds the application/octet-stream mime type.  There is an 
 issue in the bug database about this from when I reported/complained about 
 it... however it hasn't been addressed. I'm surprised still at this time that 
 svn still can't support UTF-16 text files as text wrt adding, diffing, 
 blaming, etc.

It's quite simple: no-one has written the necessary code. While I can
understand it's an interesting feature for Windows users, most
Subversion developers have other things to do. This being a volunteer
project, and most of us do not use Windows, you can hardly expect anyone
to spend several weeks on solving a problem that has a perfectly simple
workaround. Since UFT-8 and UTF-16 can be interchanged without data
loss, there are other, much more important things to do in Subversion.

To turn your argument around: I'm surprised no Windows user has yet
written a patch for Subversion to make it support UTF-16 ...

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com

RE: SVN Blame Returns Corrupt Data

2013-10-11 Thread Bob Archer

 On 11.10.2013 15:58, Bob Archer wrote:
  On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com
 wrote:
  I assume he was asking how to fix the blame. Cause, sure, he could
  open the file, convert it back to UTF-8 with CRLF line endings... and
  commit it... of course, now blame is going to show him on every line,
  since he just changed every line.
 
  That's exactly what I meant.  You're correct with how the blame is
  handled.  I committed the UTF-8 copy to a test branch, diff'd, and it
  showed every line as being changed.  Unfortunately it looks like this is 
  our
 best option.
  Yep, we have done the same thing. As a matter of fact, I just over the past
 few days rescripted all our database scripts to be UTF-8 since merging them
 just doesn't work correctly when they are UTF-16 even if you remove the
 binary mime type.
 
 
  On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
  At current blame is not UTF-16 aware.
  It's not just blame that isn't... the diff engine, or whatever detects file
 types always considers UTF-16 files to be binary. If you add a UTF-16 file
 you see that svn adds the application/octet-stream mime type.  There is an
 issue in the bug database about this from when I reported/complained about
 it... however it hasn't been addressed. I'm surprised still at this time that 
 svn
 still can't support UTF-16 text files as text wrt adding, diffing, blaming, 
 etc.
 
 It's quite simple: no-one has written the necessary code. While I can
 understand it's an interesting feature for Windows users, most Subversion
 developers have other things to do. This being a volunteer project, and most
 of us do not use Windows, you can hardly expect anyone to spend several
 weeks on solving a problem that has a perfectly simple workaround. Since
 UFT-8 and UTF-16 can be interchanged without data loss, there are other,
 much more important things to do in Subversion.

I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in 
non-Windows OSes. A large number of dev tools that I work with on Windows, 
especially the Microsoft tools default to creating UTF-16 files.  

I disagree with your can be converted without data loss. If you need UTF-16 
then you need it. Also, if you are working in an international team and you 
have developers with other language Oss which have different code pages then 
what you see when you look at a UTF-8 file might be different than what I see. 

So, when I say I'm surprised I only say that with the knowledge that the 
internet has made the world very flat and I'm sure there is much more 
collaboration amoungs people that use different languages and work on apps that 
need to deal with international languages, etc. I'm not dissing the devs in any 
way.

 To turn your argument around: I'm surprised no Windows user has yet
 written a patch for Subversion to make it support UTF-16 ...

If I knew how to I would. While I work with C# and I'm sure C is similar it is 
probably much different. If a svn dev would mentor me through it, and perhaps 
tell me what modules would need to be modified I would be happy to take a whack 
at it. 

BOb

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej

On 11.10.2013 16:55, Bob Archer wrote:
 On 11.10.2013 15:58, Bob Archer wrote:
 On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com
 wrote:
 I assume he was asking how to fix the blame. Cause, sure, he could
 open the file, convert it back to UTF-8 with CRLF line endings... and
 commit it... of course, now blame is going to show him on every line,
 since he just changed every line.

 That's exactly what I meant.  You're correct with how the blame is
 handled.  I committed the UTF-8 copy to a test branch, diff'd, and it
 showed every line as being changed.  Unfortunately it looks like this is 
 our
 best option.
 Yep, we have done the same thing. As a matter of fact, I just over the past
 few days rescripted all our database scripts to be UTF-8 since merging them
 just doesn't work correctly when they are UTF-16 even if you remove the
 binary mime type.
 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
 At current blame is not UTF-16 aware.
 It's not just blame that isn't... the diff engine, or whatever detects file
 types always considers UTF-16 files to be binary. If you add a UTF-16 file
 you see that svn adds the application/octet-stream mime type.  There is an
 issue in the bug database about this from when I reported/complained about
 it... however it hasn't been addressed. I'm surprised still at this time 
 that svn
 still can't support UTF-16 text files as text wrt adding, diffing, blaming, 
 etc.

 It's quite simple: no-one has written the necessary code. While I can
 understand it's an interesting feature for Windows users, most Subversion
 developers have other things to do. This being a volunteer project, and most
 of us do not use Windows, you can hardly expect anyone to spend several
 weeks on solving a problem that has a perfectly simple workaround. Since
 UFT-8 and UTF-16 can be interchanged without data loss, there are other,
 much more important things to do in Subversion.
 I appreciate all that you said. I didn't expect that UTF-16 was so uncommon 
 in non-Windows OSes. A large number of dev tools that I work with on Windows, 
 especially the Microsoft tools default to creating UTF-16 files.  

 I disagree with your can be converted without data loss. If you need UTF-16 
 then you need it. Also, if you are working in an international team and you 
 have developers with other language Oss which have different code pages then 
 what you see when you look at a UTF-8 file might be different than what I see.

I don't follow. Both UTF-16 and UTF-8 are complete representations of
the Unicode character set. Exactly the same code sequences can be
represented in both encodings. You can convert from UTF-16 to UTF-8 and
back and get exactly the same sequence of bytes.

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com

RE: SVN Blame Returns Corrupt Data

2013-10-11 Thread Bob Archer

 On 11.10.2013 16:55, Bob Archer wrote:
  On 11.10.2013 15:58, Bob Archer wrote:
  On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com
  wrote:
  I assume he was asking how to fix the blame. Cause, sure, he
  could open the file, convert it back to UTF-8 with CRLF line
  endings... and commit it... of course, now blame is going to show
  him on every line, since he just changed every line.
 
  That's exactly what I meant.  You're correct with how the blame is
  handled.  I committed the UTF-8 copy to a test branch, diff'd, and
  it showed every line as being changed.  Unfortunately it looks like
  this is our
  best option.
  Yep, we have done the same thing. As a matter of fact, I just over
  the past
  few days rescripted all our database scripts to be UTF-8 since
  merging them just doesn't work correctly when they are UTF-16 even if
  you remove the binary mime type.
  On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
  At current blame is not UTF-16 aware.
  It's not just blame that isn't... the diff engine, or whatever
  detects file
  types always considers UTF-16 files to be binary. If you add a
  UTF-16 file you see that svn adds the application/octet-stream mime
  type.  There is an issue in the bug database about this from when I
  reported/complained about it... however it hasn't been addressed. I'm
  surprised still at this time that svn still can't support UTF-16 text 
  files as
 text wrt adding, diffing, blaming, etc.
 
  It's quite simple: no-one has written the necessary code. While I can
  understand it's an interesting feature for Windows users, most
  Subversion developers have other things to do. This being a volunteer
  project, and most of us do not use Windows, you can hardly expect
  anyone to spend several weeks on solving a problem that has a
  perfectly simple workaround. Since
  UFT-8 and UTF-16 can be interchanged without data loss, there are
  other, much more important things to do in Subversion.
  I appreciate all that you said. I didn't expect that UTF-16 was so uncommon
 in non-Windows OSes. A large number of dev tools that I work with on
 Windows, especially the Microsoft tools default to creating UTF-16 files.
 
  I disagree with your can be converted without data loss. If you need UTF-
 16 then you need it. Also, if you are working in an international team and you
 have developers with other language Oss which have different code pages
 then what you see when you look at a UTF-8 file might be different than
 what I see.
 
 I don't follow. Both UTF-16 and UTF-8 are complete representations of the
 Unicode character set. Exactly the same code sequences can be represented
 in both encodings. You can convert from UTF-16 to UTF-8 and back and get
 exactly the same sequence of bytes.
 

Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode 
format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday 
senior moment). What I recall being told by one of the subversion developers 
was that subversion only supported the ASCII character set and while UTF-8 was 
compatible with ASCII it didn't truly support Unicode files. 

However, this blog entry seems to dispute that:

http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/

Would adding that mime-type to this file fix the blame issues this user is 
seeing?

BOb

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej

On 11.10.2013 17:19, Bob Archer wrote:
 On 11.10.2013 16:55, Bob Archer wrote:
 On 11.10.2013 15:58, Bob Archer wrote:
 On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com
 wrote:
 I assume he was asking how to fix the blame. Cause, sure, he
 could open the file, convert it back to UTF-8 with CRLF line
 endings... and commit it... of course, now blame is going to show
 him on every line, since he just changed every line.

 That's exactly what I meant.  You're correct with how the blame is
 handled.  I committed the UTF-8 copy to a test branch, diff'd, and
 it showed every line as being changed.  Unfortunately it looks like
 this is our
 best option.
 Yep, we have done the same thing. As a matter of fact, I just over
 the past
 few days rescripted all our database scripts to be UTF-8 since
 merging them just doesn't work correctly when they are UTF-16 even if
 you remove the binary mime type.
 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
 At current blame is not UTF-16 aware.
 It's not just blame that isn't... the diff engine, or whatever
 detects file
 types always considers UTF-16 files to be binary. If you add a
 UTF-16 file you see that svn adds the application/octet-stream mime
 type.  There is an issue in the bug database about this from when I
 reported/complained about it... however it hasn't been addressed. I'm
 surprised still at this time that svn still can't support UTF-16 text 
 files as
 text wrt adding, diffing, blaming, etc.
 It's quite simple: no-one has written the necessary code. While I can
 understand it's an interesting feature for Windows users, most
 Subversion developers have other things to do. This being a volunteer
 project, and most of us do not use Windows, you can hardly expect
 anyone to spend several weeks on solving a problem that has a
 perfectly simple workaround. Since
 UFT-8 and UTF-16 can be interchanged without data loss, there are
 other, much more important things to do in Subversion.
 I appreciate all that you said. I didn't expect that UTF-16 was so uncommon
 in non-Windows OSes. A large number of dev tools that I work with on
 Windows, especially the Microsoft tools default to creating UTF-16 files.
 I disagree with your can be converted without data loss. If you need UTF-
 16 then you need it. Also, if you are working in an international team and 
 you
 have developers with other language Oss which have different code pages
 then what you see when you look at a UTF-8 file might be different than
 what I see.

 I don't follow. Both UTF-16 and UTF-8 are complete representations of the
 Unicode character set. Exactly the same code sequences can be represented
 in both encodings. You can convert from UTF-16 to UTF-8 and back and get
 exactly the same sequence of bytes.

 Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode 
 format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday 
 senior moment). What I recall being told by one of the subversion developers 
 was that subversion only supported the ASCII character set and while UTF-8 
 was compatible with ASCII it didn't truly support Unicode files. 

 However, this blog entry seems to dispute that:

 http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/

 Would adding that mime-type to this file fix the blame issues this user is 
 seeing?

I think the user is just very lucky. Subversion does not actually try to
interpret the svn:mime-type property, other than to determine whether to
treat a file as text or binary. (The comment is correct in that the
proper parameter is charset=, not encoding=, but that's not important
for this discussion).

Subversion's merge algorithm depends on being able to detect line
endings in the file, and always scans the file as a sequence of bytes.
There are several ways to represent line endings in a UTF-16 file (shown
here as hex byte sequences):

  * 00 0A (Unix newline, UTF16-BE)
  * 00 0D 00 0A (Windows newline, UTF16-BE)
  * 0A 00 (Unix newline, UTF16-LE)
  * 0D 00 0A 00 (Windows newline, UTF16-LE)
  * 24 24 (Unicode newline, same in LE and BE)

Subversion, however, expects one of the following newline sequences:

  * 0A (Unix newline)
  * 0D 0A (Windows newline)

My best guess as to what's happening is that the 0A bytes, a.k.a. the
ASCII newline character, are interpreted as the end-of-line markers, and
the zero bytes are treated as part of the text. In most cases, the
result will be close to correct, as long as there are no conflicts in
the merge -- because Subversion will not emit conflict markers in UTF-16.

Of course, if someone used the U+2424 newline code point instead, then
in the worst case, the whole file would be interpreted as a single line.

-- Brane

-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com

RE: SVN Blame Returns Corrupt Data

2013-10-11 Thread Bob Archer

 On 11.10.2013 17:19, Bob Archer wrote:
  On 11.10.2013 16:55, Bob Archer wrote:
  On 11.10.2013 15:58, Bob Archer wrote:
  On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer
 bob.arc...@amsi.com
  wrote:
  I assume he was asking how to fix the blame. Cause, sure, he
  could open the file, convert it back to UTF-8 with CRLF line
  endings... and commit it... of course, now blame is going to show
  him on every line, since he just changed every line.
 
  That's exactly what I meant.  You're correct with how the blame
  is handled.  I committed the UTF-8 copy to a test branch, diff'd,
  and it showed every line as being changed.  Unfortunately it
  looks like this is our
  best option.
  Yep, we have done the same thing. As a matter of fact, I just over
  the past
  few days rescripted all our database scripts to be UTF-8 since
  merging them just doesn't work correctly when they are UTF-16 even
  if you remove the binary mime type.
  On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
  At current blame is not UTF-16 aware.
  It's not just blame that isn't... the diff engine, or whatever
  detects file
  types always considers UTF-16 files to be binary. If you add a
  UTF-16 file you see that svn adds the application/octet-stream mime
  type.  There is an issue in the bug database about this from when I
  reported/complained about it... however it hasn't been addressed.
  I'm surprised still at this time that svn still can't support
  UTF-16 text files as
  text wrt adding, diffing, blaming, etc.
  It's quite simple: no-one has written the necessary code. While I
  can understand it's an interesting feature for Windows users, most
  Subversion developers have other things to do. This being a
  volunteer project, and most of us do not use Windows, you can
  hardly expect anyone to spend several weeks on solving a problem
  that has a perfectly simple workaround. Since
  UFT-8 and UTF-16 can be interchanged without data loss, there are
  other, much more important things to do in Subversion.
  I appreciate all that you said. I didn't expect that UTF-16 was so
  uncommon
  in non-Windows OSes. A large number of dev tools that I work with on
  Windows, especially the Microsoft tools default to creating UTF-16 files.
  I disagree with your can be converted without data loss. If you
  need UTF-
  16 then you need it. Also, if you are working in an international
  team and you have developers with other language Oss which have
  different code pages then what you see when you look at a UTF-8 file
  might be different than what I see.
 
  I don't follow. Both UTF-16 and UTF-8 are complete representations of
  the Unicode character set. Exactly the same code sequences can be
  represented in both encodings. You can convert from UTF-16 to UTF-8
  and back and get exactly the same sequence of bytes.
 
  Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode
 format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday
 senior moment). What I recall being told by one of the subversion
 developers was that subversion only supported the ASCII character set and
 while UTF-8 was compatible with ASCII it didn't truly support Unicode files.
 
  However, this blog entry seems to dispute that:
 
  http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/
 
  Would adding that mime-type to this file fix the blame issues this user is
 seeing?
 
 I think the user is just very lucky. Subversion does not actually try to 
 interpret
 the svn:mime-type property, other than to determine whether to treat a file
 as text or binary. (The comment is correct in that the proper parameter is
 charset=, not encoding=, but that's not important for this discussion).
 
 Subversion's merge algorithm depends on being able to detect line endings
 in the file, and always scans the file as a sequence of bytes.
 There are several ways to represent line endings in a UTF-16 file (shown here
 as hex byte sequences):
 
   * 00 0A (Unix newline, UTF16-BE)
   * 00 0D 00 0A (Windows newline, UTF16-BE)
   * 0A 00 (Unix newline, UTF16-LE)
   * 0D 00 0A 00 (Windows newline, UTF16-LE)
   * 24 24 (Unicode newline, same in LE and BE)
 
 Subversion, however, expects one of the following newline sequences:
 
   * 0A (Unix newline)
   * 0D 0A (Windows newline)
 
 My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII
 newline character, are interpreted as the end-of-line markers, and the zero
 bytes are treated as part of the text. In most cases, the result will be 
 close to
 correct, as long as there are no conflicts in the merge -- because Subversion
 will not emit conflict markers in UTF-16.
 
 Of course, if someone used the U+2424 newline code point instead, then in
 the worst case, the whole file would be interpreted as a single line.
 
 -- Brane

Great information.. thanks for that.

Bottom line is use UTF-8 for your text files and svn will be happy and work 
correctly. How hard would it be to create

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej

On 11.10.2013 18:12, Bob Archer wrote:
 On 11.10.2013 17:19, Bob Archer wrote:
 On 11.10.2013 16:55, Bob Archer wrote:
 On 11.10.2013 15:58, Bob Archer wrote:
 On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer
 bob.arc...@amsi.com
 wrote:
 I assume he was asking how to fix the blame. Cause, sure, he
 could open the file, convert it back to UTF-8 with CRLF line
 endings... and commit it... of course, now blame is going to show
 him on every line, since he just changed every line.

 That's exactly what I meant.  You're correct with how the blame
 is handled.  I committed the UTF-8 copy to a test branch, diff'd,
 and it showed every line as being changed.  Unfortunately it
 looks like this is our
 best option.
 Yep, we have done the same thing. As a matter of fact, I just over
 the past
 few days rescripted all our database scripts to be UTF-8 since
 merging them just doesn't work correctly when they are UTF-16 even
 if you remove the binary mime type.
 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
 At current blame is not UTF-16 aware.
 It's not just blame that isn't... the diff engine, or whatever
 detects file
 types always considers UTF-16 files to be binary. If you add a
 UTF-16 file you see that svn adds the application/octet-stream mime
 type.  There is an issue in the bug database about this from when I
 reported/complained about it... however it hasn't been addressed.
 I'm surprised still at this time that svn still can't support
 UTF-16 text files as
 text wrt adding, diffing, blaming, etc.
 It's quite simple: no-one has written the necessary code. While I
 can understand it's an interesting feature for Windows users, most
 Subversion developers have other things to do. This being a
 volunteer project, and most of us do not use Windows, you can
 hardly expect anyone to spend several weeks on solving a problem
 that has a perfectly simple workaround. Since
 UFT-8 and UTF-16 can be interchanged without data loss, there are
 other, much more important things to do in Subversion.
 I appreciate all that you said. I didn't expect that UTF-16 was so
 uncommon
 in non-Windows OSes. A large number of dev tools that I work with on
 Windows, especially the Microsoft tools default to creating UTF-16 files.
 I disagree with your can be converted without data loss. If you
 need UTF-
 16 then you need it. Also, if you are working in an international
 team and you have developers with other language Oss which have
 different code pages then what you see when you look at a UTF-8 file
 might be different than what I see.

 I don't follow. Both UTF-16 and UTF-8 are complete representations of
 the Unicode character set. Exactly the same code sequences can be
 represented in both encodings. You can convert from UTF-16 to UTF-8
 and back and get exactly the same sequence of bytes.

 Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode
 format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday
 senior moment). What I recall being told by one of the subversion
 developers was that subversion only supported the ASCII character set and
 while UTF-8 was compatible with ASCII it didn't truly support Unicode files.
 However, this blog entry seems to dispute that:

 http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/

 Would adding that mime-type to this file fix the blame issues this user is
 seeing?

 I think the user is just very lucky. Subversion does not actually try to 
 interpret
 the svn:mime-type property, other than to determine whether to treat a file
 as text or binary. (The comment is correct in that the proper parameter is
 charset=, not encoding=, but that's not important for this discussion).

 Subversion's merge algorithm depends on being able to detect line endings
 in the file, and always scans the file as a sequence of bytes.
 There are several ways to represent line endings in a UTF-16 file (shown here
 as hex byte sequences):

   * 00 0A (Unix newline, UTF16-BE)
   * 00 0D 00 0A (Windows newline, UTF16-BE)
   * 0A 00 (Unix newline, UTF16-LE)
   * 0D 00 0A 00 (Windows newline, UTF16-LE)
   * 24 24 (Unicode newline, same in LE and BE)

 Subversion, however, expects one of the following newline sequences:

   * 0A (Unix newline)
   * 0D 0A (Windows newline)

 My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII
 newline character, are interpreted as the end-of-line markers, and the zero
 bytes are treated as part of the text. In most cases, the result will be 
 close to
 correct, as long as there are no conflicts in the merge -- because Subversion
 will not emit conflict markers in UTF-16.

 Of course, if someone used the U+2424 newline code point instead, then in
 the worst case, the whole file would be interpreted as a single line.

 -- Brane
 Great information.. thanks for that.

 Bottom line is use UTF-8 for your text files and svn will be happy and work 
 correctly. How hard would it be to create a warning on an add that a

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Andreas Krey

On Fri, 11 Oct 2013 17:43:30 +, Branko ??ibej wrote:
...
 Of course, if someone used the U+2424 newline code point instead, then
 in the worst case, the whole file would be interpreted as a single line.

And SVN would be right, as U+2424 is 'SYMBOL FOR NEWLINE', which is
actually a printable character, not a control charactor.

Andreas

-- 
Totally trivial. Famous last words.
From: Linus Torvalds torvalds@*.org
Date: Fri, 22 Jan 2010 07:29:21 -0800

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Ben Reser

On 10/11/13 9:22 AM, Branko Čibej wrote:
 You'd have to extend Subversion's file type detection to detect UTF-16.
 See svn_io_detect_mimetype2 in line  in this file:
 
 http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
 Subversion currently only looks at the first 1k Bytes of a file. It may
 be enough to check that this initial part of the file contains only
 valid UTF-16 (BE or LE) codes.

Even if all we looked for is the BOM it might be helpful enough.  I suspect the
development tools producing UTF-16 are including BOMs.  Windows seems to be
fond of including them, Notepad puts one even on UTF-8.

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Stefan Sperling

On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote:
 On 10/11/13 9:22 AM, Branko Čibej wrote:
  You'd have to extend Subversion's file type detection to detect UTF-16.
  See svn_io_detect_mimetype2 in line  in this file:
  
  http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
  Subversion currently only looks at the first 1k Bytes of a file. It may
  be enough to check that this initial part of the file contains only
  valid UTF-16 (BE or LE) codes.
 
 Even if all we looked for is the BOM it might be helpful enough.  I suspect 
 the
 development tools producing UTF-16 are including BOMs.  Windows seems to be
 fond of including them, Notepad puts one even on UTF-8.

Couldn't Subversion automatically convert UTF-16 files to UTF-8 before
processing them for diff/merge/blame, and convert output written to
the original files back to UTF-16?

That would require some work because existing streams, strings, and files
passed around in the code would need to be wrapped so that translation
to/from the internal from/to the external encoding is seamless.

But I don't see why such an approach couldn't be made to work in principle.
It might even result in some spring cleaning in the code base and pave the
way for improved handling of file formats such as XML for diff and merge.

What do you think? Is it worth adding this to our project ideas page?

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej

On 11.10.2013 18:52, Ben Reser wrote:
 On 10/11/13 9:22 AM, Branko Čibej wrote:
 You'd have to extend Subversion's file type detection to detect UTF-16.
 See svn_io_detect_mimetype2 in line  in this file:

 http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
 Subversion currently only looks at the first 1k Bytes of a file. It may
 be enough to check that this initial part of the file contains only
 valid UTF-16 (BE or LE) codes.
 Even if all we looked for is the BOM it might be helpful enough.  I suspect 
 the
 development tools producing UTF-16 are including BOMs.  Windows seems to be
 fond of including them, Notepad puts one even on UTF-8.

That would work only on Windows. On other platforms, you typically don't
get a BOM (actually, a zero-width non-breaking space) at the beginning
of a file. Granted, other platforms most likely use UTF-8 in any case.

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej

On 11.10.2013 19:25, Stefan Sperling wrote:
 On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote:
 On 10/11/13 9:22 AM, Branko Čibej wrote:
 You'd have to extend Subversion's file type detection to detect UTF-16.
 See svn_io_detect_mimetype2 in line  in this file:

 http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
 Subversion currently only looks at the first 1k Bytes of a file. It may
 be enough to check that this initial part of the file contains only
 valid UTF-16 (BE or LE) codes.
 Even if all we looked for is the BOM it might be helpful enough.  I suspect 
 the
 development tools producing UTF-16 are including BOMs.  Windows seems to be
 fond of including them, Notepad puts one even on UTF-8.
 Couldn't Subversion automatically convert UTF-16 files to UTF-8 before
 processing them for diff/merge/blame, and convert output written to
 the original files back to UTF-16?

That would be less work than supporting whitespace compression, etc. in
UTF-16, but we'd still have to detect U+2424 as an end-of-line marker in
UTF-8 text.

Still, we'd actually have to correctly identify UTF-16 content first,
and handle invalid byte sequences.

 That would require some work because existing streams, strings, and files
 passed around in the code would need to be wrapped so that translation
 to/from the internal from/to the external encoding is seamless.

 But I don't see why such an approach couldn't be made to work in principle.
 It might even result in some spring cleaning in the code base and pave the
 way for improved handling of file formats such as XML for diff and merge.

Can't see what XML has to do with it. The diff algorithm already uses a
tokenizer; and for XML, that should be good enough most of the time.

 What do you think? Is it worth adding this to our project ideas page?

It's already here: http://subversion.tigris.org/issues/show_bug.cgi?id=2194

-- Brane

-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej

On 11.10.2013 18:23, Andreas Krey wrote:
 On Fri, 11 Oct 2013 17:43:30 +, Branko ??ibej wrote:
 ...
 Of course, if someone used the U+2424 newline code point instead, then
 in the worst case, the whole file would be interpreted as a single line.
 And SVN would be right, as U+2424 is 'SYMBOL FOR NEWLINE', which is
 actually a printable character, not a control charactor.

Meh, you're right ... it's U+0085 (next line), U+2028 (line separator)
or U+2029 (paragraph separator). I don't know what came over me; sorry
for misleading everyone.

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Ben Reser

On 10/11/13 10:25 AM, Stefan Sperling wrote:
 Couldn't Subversion automatically convert UTF-16 files to UTF-8 before
 processing them for diff/merge/blame, and convert output written to
 the original files back to UTF-16?

That's what the patch I pointed out did.  Nobody seemed to object to the idea
at the time it was posted, though I think Branko does bring up some interesting
questions about handling the unicode control characters.

SVN Blame Returns Corrupt Data

2013-10-10 Thread T.J. Perovich

I'm having trouble running svn blame on a particular file.  It's returning
garbage.

In TortoiseBlame:
3341  TJP  ÿþO
3341  TJP

In the command line:
3341TJP  ■O
3341TJP


The file is 10.1k lines, not 2.  If I run the blame from revision 0 to 3341
it returns the correct information.

In WinMerge and TortoiseMerge, diffing the files shows about 10 lines
changing between 3340 and 3341 (it was merge).  However, the command line
diff shows the entire contents being changed with spaces between every
character. So End Class reads E n d   C l a s s, etc..  Diffing a merge
post-rev# 3341 show the same spaces between every letter.

svn diff -r 3341:3489 svn://...

@@ -20032,7 +20058,7 @@

  F i l l _ d d l L o c a t i o n ( )
  F i l l _ d d l C o u n t r y ( )


Another strange thing is it's marking these as lines 20,032 and 20,058.
 But in Notepad++ they are lines 10,026 and 10,031.  The line numbers in
pre-rev# 3341 diffs match up between the Notepad++ and command line fine.

Any ideas on what happened or how to fix it?  I'm most concerned about
getting the blame working again.  Please let me know if you need more
information or if I was unclear with anything.  Thanks in advance!

Re: SVN Blame Returns Corrupt Data

2013-10-10 Thread Ryan Schmidt


On Oct 10, 2013, at 11:29, T.J. Perovich tjperov...@gmail.com wrote:

 I'm having trouble running svn blame on a particular file.  It's returning 
 garbage.
 
 In TortoiseBlame:
 3341  TJP  ÿþO
 3341  TJP  
 
 In the command line:
 3341TJP  ■O
 3341TJP
 
 
 The file is 10.1k lines, not 2.  If I run the blame from revision 0 to 3341 
 it returns the correct information. 
 
 In WinMerge and TortoiseMerge, diffing the files shows about 10 lines 
 changing between 3340 and 3341 (it was merge).  However, the command line 
 diff shows the entire contents being changed with spaces between every 
 character. So End Class reads E n d   C l a s s, etc..  Diffing a merge 
 post-rev# 3341 show the same spaces between every letter.  
 
 svn diff -r 3341:3489 svn://...
 
 @@ -20032,7 +20058,7 @@
 
   F i l l _ d d l L o c a t i o n ( )
   F i l l _ d d l C o u n t r y ( )  

Sounds like you've converted the file from UTF-8 to UTF-16.


 Another strange thing is it's marking these as lines 20,032 and 20,058.  But 
 in Notepad++ they are lines 10,026 and 10,031.  The line numbers in pre-rev# 
 3341 diffs match up between the Notepad++ and command line fine.

Sounds like the line endings changed as well.

RE: SVN Blame Returns Corrupt Data

2013-10-10 Thread Bob Archer

 On Oct 10, 2013, at 11:29, T.J. Perovich tjperov...@gmail.com wrote:
 
  I'm having trouble running svn blame on a particular file.  It's returning
 garbage.
 
  In TortoiseBlame:
  3341  TJP  ÿþO
  3341  TJP
 
  In the command line:
  3341TJP  ■O
  3341TJP
 
 
  The file is 10.1k lines, not 2.  If I run the blame from revision 0 to 3341 
  it
 returns the correct information.
 
  In WinMerge and TortoiseMerge, diffing the files shows about 10 lines
 changing between 3340 and 3341 (it was merge).  However, the command
 line diff shows the entire contents being changed with spaces between
 every character. So End Class reads E n d   C l a s s, etc..  Diffing a 
 merge
 post-rev# 3341 show the same spaces between every letter.
 
  svn diff -r 3341:3489 svn://...
 
  @@ -20032,7 +20058,7 @@
 
F i l l _ d d l L o c a t i o n ( )
F i l l _ d d l C o u n t r y ( )
 
 Sounds like you've converted the file from UTF-8 to UTF-16.
 
 
  Another strange thing is it's marking these as lines 20,032 and 20,058.  
  But in
 Notepad++ they are lines 10,026 and 10,031.  The line numbers in pre-rev#
 3341 diffs match up between the Notepad++ and command line fine.
 
 Sounds like the line endings changed as well.
 

Sigh... if only svn would support Unicode encodings.

BOb

Re: SVN Blame Returns Corrupt Data

2013-10-10 Thread T.J. Perovich

On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt 
subversion-20...@ryandesign.com wrote:
Sounds like you've converted the file from UTF-8 to UTF-16.

Thanks, you're absolutely right.  It changed from UTF-8 to UTF-16LE.

Any idea how to go about fixing it elegantly?  We have about 3 months of
commits since this happened.  Diff's in the GUIs have been working fine and
we don't blame too often which is why it was never noticed.


On Thu, Oct 10, 2013 at 3:15 PM, Bob Archer bob.arc...@amsi.com wrote:

  On Oct 10, 2013, at 11:29, T.J. Perovich tjperov...@gmail.com wrote:
 
   I'm having trouble running svn blame on a particular file.  It's
 returning
  garbage.
  
   In TortoiseBlame:
   3341  TJP  ÿþO
   3341  TJP
  
   In the command line:
   3341TJP  ■O
   3341TJP
  
  
   The file is 10.1k lines, not 2.  If I run the blame from revision 0 to
 3341 it
  returns the correct information.
  
   In WinMerge and TortoiseMerge, diffing the files shows about 10 lines
  changing between 3340 and 3341 (it was merge).  However, the command
  line diff shows the entire contents being changed with spaces between
  every character. So End Class reads E n d   C l a s s, etc..
  Diffing a merge
  post-rev# 3341 show the same spaces between every letter.
  
   svn diff -r 3341:3489 svn://...
  
   @@ -20032,7 +20058,7 @@
  
 F i l l _ d d l L o c a t i o n ( )
 F i l l _ d d l C o u n t r y ( )
 
  Sounds like you've converted the file from UTF-8 to UTF-16.
 
 
   Another strange thing is it's marking these as lines 20,032 and
 20,058.  But in
  Notepad++ they are lines 10,026 and 10,031.  The line numbers in pre-rev#
  3341 diffs match up between the Notepad++ and command line fine.
 
  Sounds like the line endings changed as well.
 

 Sigh... if only svn would support Unicode encodings.

 BOb

Re: SVN Blame Returns Corrupt Data

2013-10-10 Thread Thorsten Schöning

Guten Tag T.J. Perovich,
am Donnerstag, 10. Oktober 2013 um 21:17 schrieben Sie:

 Any idea how to go about fixing it elegantly?

Simply convert it back using your method of choice, Notepad++ should
be able to handle this.

Mit freundlichen Grüßen,

Thorsten Schöning

-- 
Thorsten Schöning   E-Mail:thorsten.schoen...@am-soft.de
AM-SoFT IT-Systeme  http://www.AM-SoFT.de/

Telefon...05151-  9468- 55
Fax...05151-  9468- 88
Mobil..0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow

RE: SVN Blame Returns Corrupt Data

2013-10-10 Thread Bob Archer

 Guten Tag T.J. Perovich,
 am Donnerstag, 10. Oktober 2013 um 21:17 schrieben Sie:
 
  Any idea how to go about fixing it elegantly?
 
 Simply convert it back using your method of choice, Notepad++ should be
 able to handle this.
 
 Mit freundlichen Grüßen,
 
 Thorsten Schöning

I assume he was asking how to fix the blame. Cause, sure, he could open the 
file, convert it back to UTF-8 with CRLF line endings... and commit it... of 
course, now blame is going to show him on every line, since he just changed 
every line. 

However, at this point blame is probably wrong anyway, because it is showing 
every line has been changed by whomever changed all the line endings.

Bottom line, I think he stuck with it.

BOb

Re: SVN Blame Returns Corrupt Data

2013-10-10 Thread Ben Reser

On 10/10/13 12:17 PM, T.J. Perovich wrote:
 On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt subversion-20...@ryandesign.com
 mailto:subversion-20...@ryandesign.com wrote:
Sounds like you've converted the file from UTF-8 to UTF-16.
 
 Thanks, you're absolutely right.  It changed from UTF-8 to UTF-16LE.  
 
 Any idea how to go about fixing it elegantly?  We have about 3 months of
 commits since this happened.  Diff's in the GUIs have been working fine and we
 don't blame too often which is why it was never noticed.

At current blame is not UTF-16 aware.

About a year ago there was a patch (actually they just reposted an entire
copy of blame.c) posted that helped, but it really didn't go anywhere since the
original poster didn't continue following up.

https://mail-archives.apache.org/mod_mbox/subversion-dev/201207.mbox/%3CCAAF0CB13B282344AF95AD2DE3D1962215627A3C%40DAG-B.nexon.corp%3E

and the followup

https://mail-archives.apache.org/mod_mbox/subversion-dev/201208.mbox/%3ccab84ubvvrhffqyea5pf5gstmpxz+rh2jkvdvcqscocjv+rq...@mail.gmail.com%3E

Perhaps you'll find the above useful.  Patches are of course welcome.

Re: SVN Blame Returns Corrupt Data

RE: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

RE: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

RE: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

RE: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

RE: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

RE: SVN Blame Returns Corrupt Data

Re: SVN Blame Returns Corrupt Data

23 matches

Site Navigation

Mail list logo

Footer information