Re: SVN Blame Returns Corrupt Data

2013-10-10 Thread Ryan Schmidt

On Oct 10, 2013, at 11:29, T.J. Perovich  wrote:

> I'm having trouble running svn blame on a particular file.  It's returning 
> garbage.
> 
> In TortoiseBlame:
> 3341  TJP  ÿþO
> 3341  TJP  
> 
> In the command line:
> 3341TJP  ■O
> 3341TJP
> 
> 
> The file is 10.1k lines, not 2.  If I run the blame from revision 0 to 3341 
> it returns the correct information. 
> 
> In WinMerge and TortoiseMerge, diffing the files shows about 10 lines 
> changing between 3340 and 3341 (it was merge).  However, the command line 
> diff shows the entire contents being changed with spaces between every 
> character. So "End Class" reads "E n d   C l a s s", etc..  Diffing a merge 
> post-rev# 3341 show the same spaces between every letter.  
> 
> svn diff -r 3341:3489 svn://...
> 
> @@ -20032,7 +20058,7 @@
> 
>   F i l l _ d d l L o c a t i o n ( )
>   F i l l _ d d l C o u n t r y ( )  

Sounds like you've converted the file from UTF-8 to UTF-16.


> Another strange thing is it's marking these as lines 20,032 and 20,058.  But 
> in Notepad++ they are lines 10,026 and 10,031.  The line numbers in pre-rev# 
> 3341 diffs match up between the Notepad++ and command line fine.

Sounds like the line endings changed as well.




RE: SVN Blame Returns Corrupt Data

2013-10-10 Thread Bob Archer
> On Oct 10, 2013, at 11:29, T.J. Perovich  wrote:
> 
> > I'm having trouble running svn blame on a particular file.  It's returning
> garbage.
> >
> > In TortoiseBlame:
> > 3341  TJP  ÿþO
> > 3341  TJP
> >
> > In the command line:
> > 3341TJP  ■O
> > 3341TJP
> >
> >
> > The file is 10.1k lines, not 2.  If I run the blame from revision 0 to 3341 
> > it
> returns the correct information.
> >
> > In WinMerge and TortoiseMerge, diffing the files shows about 10 lines
> changing between 3340 and 3341 (it was merge).  However, the command
> line diff shows the entire contents being changed with spaces between
> every character. So "End Class" reads "E n d   C l a s s", etc..  Diffing a 
> merge
> post-rev# 3341 show the same spaces between every letter.
> >
> > svn diff -r 3341:3489 svn://...
> >
> > @@ -20032,7 +20058,7 @@
> >
> >   F i l l _ d d l L o c a t i o n ( )
> >   F i l l _ d d l C o u n t r y ( )
> 
> Sounds like you've converted the file from UTF-8 to UTF-16.
> 
> 
> > Another strange thing is it's marking these as lines 20,032 and 20,058.  
> > But in
> Notepad++ they are lines 10,026 and 10,031.  The line numbers in pre-rev#
> 3341 diffs match up between the Notepad++ and command line fine.
> 
> Sounds like the line endings changed as well.
> 

Sigh... if only svn would support Unicode encodings.

BOb



Re: SVN Blame Returns Corrupt Data

2013-10-10 Thread T.J. Perovich
On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt <
subversion-20...@ryandesign.com> wrote:
>Sounds like you've converted the file from UTF-8 to UTF-16.

Thanks, you're absolutely right.  It changed from UTF-8 to UTF-16LE.

Any idea how to go about fixing it elegantly?  We have about 3 months of
commits since this happened.  Diff's in the GUIs have been working fine and
we don't blame too often which is why it was never noticed.


On Thu, Oct 10, 2013 at 3:15 PM, Bob Archer  wrote:

> > On Oct 10, 2013, at 11:29, T.J. Perovich  wrote:
> >
> > > I'm having trouble running svn blame on a particular file.  It's
> returning
> > garbage.
> > >
> > > In TortoiseBlame:
> > > 3341  TJP  ÿþO
> > > 3341  TJP
> > >
> > > In the command line:
> > > 3341TJP  ■O
> > > 3341TJP
> > >
> > >
> > > The file is 10.1k lines, not 2.  If I run the blame from revision 0 to
> 3341 it
> > returns the correct information.
> > >
> > > In WinMerge and TortoiseMerge, diffing the files shows about 10 lines
> > changing between 3340 and 3341 (it was merge).  However, the command
> > line diff shows the entire contents being changed with spaces between
> > every character. So "End Class" reads "E n d   C l a s s", etc..
>  Diffing a merge
> > post-rev# 3341 show the same spaces between every letter.
> > >
> > > svn diff -r 3341:3489 svn://...
> > >
> > > @@ -20032,7 +20058,7 @@
> > >
> > >   F i l l _ d d l L o c a t i o n ( )
> > >   F i l l _ d d l C o u n t r y ( )
> >
> > Sounds like you've converted the file from UTF-8 to UTF-16.
> >
> >
> > > Another strange thing is it's marking these as lines 20,032 and
> 20,058.  But in
> > Notepad++ they are lines 10,026 and 10,031.  The line numbers in pre-rev#
> > 3341 diffs match up between the Notepad++ and command line fine.
> >
> > Sounds like the line endings changed as well.
> >
>
> Sigh... if only svn would support Unicode encodings.
>
> BOb
>
>


Re: SVN Blame Returns Corrupt Data

2013-10-10 Thread Thorsten Schöning
Guten Tag T.J. Perovich,
am Donnerstag, 10. Oktober 2013 um 21:17 schrieben Sie:

> Any idea how to go about fixing it elegantly?

Simply convert it back using your method of choice, Notepad++ should
be able to handle this.

Mit freundlichen Grüßen,

Thorsten Schöning

-- 
Thorsten Schöning   E-Mail:thorsten.schoen...@am-soft.de
AM-SoFT IT-Systeme  http://www.AM-SoFT.de/

Telefon...05151-  9468- 55
Fax...05151-  9468- 88
Mobil..0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow



RE: SVN Blame Returns Corrupt Data

2013-10-10 Thread Bob Archer
> Guten Tag T.J. Perovich,
> am Donnerstag, 10. Oktober 2013 um 21:17 schrieben Sie:
> 
> > Any idea how to go about fixing it elegantly?
> 
> Simply convert it back using your method of choice, Notepad++ should be
> able to handle this.
> 
> Mit freundlichen Grüßen,
> 
> Thorsten Schöning

I assume he was asking how to "fix" the blame. Cause, sure, he could open the 
file, convert it back to UTF-8 with CRLF line endings... and commit it... of 
course, now blame is going to show him on every line, since he just changed 
every line. 

However, at this point blame is probably wrong anyway, because it is showing 
every line has been changed by whomever changed all the line endings.

Bottom line, I think he stuck with it.

BOb



Re: SVN Blame Returns Corrupt Data

2013-10-10 Thread Ben Reser
On 10/10/13 12:17 PM, T.J. Perovich wrote:
> On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt  > wrote:
>>Sounds like you've converted the file from UTF-8 to UTF-16.
> 
> Thanks, you're absolutely right.  It changed from UTF-8 to UTF-16LE.  
> 
> Any idea how to go about fixing it elegantly?  We have about 3 months of
> commits since this happened.  Diff's in the GUIs have been working fine and we
> don't blame too often which is why it was never noticed.

At current blame is not UTF-16 aware.

About a year ago there was a "patch" (actually they just reposted an entire
copy of blame.c) posted that helped, but it really didn't go anywhere since the
original poster didn't continue following up.

https://mail-archives.apache.org/mod_mbox/subversion-dev/201207.mbox/%3CCAAF0CB13B282344AF95AD2DE3D1962215627A3C%40DAG-B.nexon.corp%3E

and the followup

https://mail-archives.apache.org/mod_mbox/subversion-dev/201208.mbox/%3ccab84ubvvrhffqyea5pf5gstmpxz+rh2jkvdvcqscocjv+rq...@mail.gmail.com%3E

Perhaps you'll find the above useful.  Patches are of course welcome.


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread T.J. Perovich
On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer  wrote:

> I assume he was asking how to "fix" the blame. Cause, sure, he could open
> the file, convert it back to UTF-8 with CRLF line endings... and commit
> it... of course, now blame is going to show him on every line, since he
> just changed every line.
>
>
That's exactly what I meant.  You're correct with how the blame is handled.
 I committed the UTF-8 copy to a test branch, diff'd, and it showed every
line as being changed.  Unfortunately it looks like this is our best option.


On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser  wrote:

> At current blame is not UTF-16 aware.
>
> About a year ago there was a "patch" (actually they just reposted an entire
> copy of blame.c) posted that helped, but it really didn't go anywhere
> since the
> original poster didn't continue following up.
>
> Perhaps you'll find the above useful.  Patches are of course welcome.
>

I'll take a look at that this weekend.  Thanks for all the input everyone.



On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser  wrote:

> On 10/10/13 12:17 PM, T.J. Perovich wrote:
> > On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt <
> subversion-20...@ryandesign.com
> > > wrote:
> >>Sounds like you've converted the file from UTF-8 to UTF-16.
> >
> > Thanks, you're absolutely right.  It changed from UTF-8 to UTF-16LE.
> >
> > Any idea how to go about fixing it elegantly?  We have about 3 months of
> > commits since this happened.  Diff's in the GUIs have been working fine
> and we
> > don't blame too often which is why it was never noticed.
>
> At current blame is not UTF-16 aware.
>
> About a year ago there was a "patch" (actually they just reposted an entire
> copy of blame.c) posted that helped, but it really didn't go anywhere
> since the
> original poster didn't continue following up.
>
>
> https://mail-archives.apache.org/mod_mbox/subversion-dev/201207.mbox/%3CCAAF0CB13B282344AF95AD2DE3D1962215627A3C%40DAG-B.nexon.corp%3E
>
> and the followup
>
>
> https://mail-archives.apache.org/mod_mbox/subversion-dev/201208.mbox/%3ccab84ubvvrhffqyea5pf5gstmpxz+rh2jkvdvcqscocjv+rq...@mail.gmail.com%3E
>
> Perhaps you'll find the above useful.  Patches are of course welcome.
>


RE: SVN Blame Returns Corrupt Data

2013-10-11 Thread Bob Archer
> On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer  wrote:
> I assume he was asking how to "fix" the blame. Cause, sure, he could open
> the file, convert it back to UTF-8 with CRLF line endings... and commit it... 
> of
> course, now blame is going to show him on every line, since he just changed
> every line.
> 
> That's exactly what I meant.  You're correct with how the blame is handled.  I
> committed the UTF-8 copy to a test branch, diff'd, and it showed every line
> as being changed.  Unfortunately it looks like this is our best option.

Yep, we have done the same thing. As a matter of fact, I just over the past few 
days rescripted all our database scripts to be UTF-8 since merging them just 
doesn't work correctly when they are UTF-16 even if you remove the binary mime 
type.

> 
> 
> On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser  wrote:
> At current blame is not UTF-16 aware.

It's not just blame that isn't... the diff engine, or whatever detects file 
types always considers UTF-16 files to be binary. If you "add" a UTF-16 file 
you see that svn adds the application/octet-stream mime type.  There is an 
issue in the bug database about this from when I reported/complained about 
it... however it hasn't been addressed. I'm surprised still at this time that 
svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, 
etc.

BOb


> 
> About a year ago there was a "patch" (actually they just reposted an entire
> copy of blame.c) posted that helped, but it really didn't go anywhere since
> the original poster didn't continue following up.
> 
> Perhaps you'll find the above useful.  Patches are of course welcome.
> 
> I'll take a look at that this weekend.  Thanks for all the input everyone.
> 
> 
> On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser  wrote:
> On 10/10/13 12:17 PM, T.J. Perovich wrote:
> > On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt
> > mailto:subversion-
> 20...@ryandesign.com>> wrote:
> >>Sounds like you've converted the file from UTF-8 to UTF-16.
> >
> > Thanks, you're absolutely right.  It changed from UTF-8 to UTF-16LE.
> >
> > Any idea how to go about fixing it elegantly?  We have about 3 months
> > of commits since this happened.  Diff's in the GUIs have been working
> > fine and we don't blame too often which is why it was never noticed.
> At current blame is not UTF-16 aware.
> 
> About a year ago there was a "patch" (actually they just reposted an entire
> copy of blame.c) posted that helped, but it really didn't go anywhere since
> the original poster didn't continue following up.
> 
> https://mail-archives.apache.org/mod_mbox/subversion-
> dev/201207.mbox/%3CCAAF0CB13B282344AF95AD2DE3D1962215627A3C%40
> DAG-B.nexon.corp%3E
> 
> and the followup
> 
> https://mail-archives.apache.org/mod_mbox/subversion-
> dev/201208.mbox/%3CCAB84uBVVrHFfQyEA5pF5gStMpXz+RH2jKvdvCQsCO
> cjv+rq...@mail.gmail.com%3E
> 
> Perhaps you'll find the above useful.  Patches are of course welcome.



Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 15:58, Bob Archer wrote:
>> On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer  wrote:
>> I assume he was asking how to "fix" the blame. Cause, sure, he could open
>> the file, convert it back to UTF-8 with CRLF line endings... and commit 
>> it... of
>> course, now blame is going to show him on every line, since he just changed
>> every line.
>>
>> That's exactly what I meant.  You're correct with how the blame is handled.  
>> I
>> committed the UTF-8 copy to a test branch, diff'd, and it showed every line
>> as being changed.  Unfortunately it looks like this is our best option.
> Yep, we have done the same thing. As a matter of fact, I just over the past 
> few days rescripted all our database scripts to be UTF-8 since merging them 
> just doesn't work correctly when they are UTF-16 even if you remove the 
> binary mime type.
>
>>
>> On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser  wrote:
>> At current blame is not UTF-16 aware.
> It's not just blame that isn't... the diff engine, or whatever detects file 
> types always considers UTF-16 files to be binary. If you "add" a UTF-16 file 
> you see that svn adds the application/octet-stream mime type.  There is an 
> issue in the bug database about this from when I reported/complained about 
> it... however it hasn't been addressed. I'm surprised still at this time that 
> svn still can't support UTF-16 text files as text wrt adding, diffing, 
> blaming, etc.

It's quite simple: no-one has written the necessary code. While I can
understand it's an interesting feature for Windows users, most
Subversion developers have other things to do. This being a volunteer
project, and most of us do not use Windows, you can hardly expect anyone
to spend several weeks on solving a problem that has a perfectly simple
workaround. Since UFT-8 and UTF-16 can be interchanged without data
loss, there are other, much more important things to do in Subversion.

To turn your argument around: I'm surprised no Windows user has yet
written a patch for Subversion to make it support UTF-16 ...

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com


RE: SVN Blame Returns Corrupt Data

2013-10-11 Thread Bob Archer
> On 11.10.2013 15:58, Bob Archer wrote:
> >> On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer 
> wrote:
> >> I assume he was asking how to "fix" the blame. Cause, sure, he could
> >> open the file, convert it back to UTF-8 with CRLF line endings... and
> >> commit it... of course, now blame is going to show him on every line,
> >> since he just changed every line.
> >>
> >> That's exactly what I meant.  You're correct with how the blame is
> >> handled.  I committed the UTF-8 copy to a test branch, diff'd, and it
> >> showed every line as being changed.  Unfortunately it looks like this is 
> >> our
> best option.
> > Yep, we have done the same thing. As a matter of fact, I just over the past
> few days rescripted all our database scripts to be UTF-8 since merging them
> just doesn't work correctly when they are UTF-16 even if you remove the
> binary mime type.
> >
> >>
> >> On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser  wrote:
> >> At current blame is not UTF-16 aware.
> > It's not just blame that isn't... the diff engine, or whatever detects file
> types always considers UTF-16 files to be binary. If you "add" a UTF-16 file
> you see that svn adds the application/octet-stream mime type.  There is an
> issue in the bug database about this from when I reported/complained about
> it... however it hasn't been addressed. I'm surprised still at this time that 
> svn
> still can't support UTF-16 text files as text wrt adding, diffing, blaming, 
> etc.
> 
> It's quite simple: no-one has written the necessary code. While I can
> understand it's an interesting feature for Windows users, most Subversion
> developers have other things to do. This being a volunteer project, and most
> of us do not use Windows, you can hardly expect anyone to spend several
> weeks on solving a problem that has a perfectly simple workaround. Since
> UFT-8 and UTF-16 can be interchanged without data loss, there are other,
> much more important things to do in Subversion.

I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in 
non-Windows OSes. A large number of dev tools that I work with on Windows, 
especially the Microsoft tools default to creating UTF-16 files.  

I disagree with your "can be converted without data loss". If you need UTF-16 
then you need it. Also, if you are working in an international team and you 
have developers with other language Oss which have different code pages then 
what you see when you look at a UTF-8 file might be different than what I see. 

So, when I say "I'm surprised" I only say that with the knowledge that the 
internet has made the world very flat and I'm sure there is much more 
collaboration amoungs people that use different languages and work on apps that 
need to deal with international languages, etc. I'm not dissing the devs in any 
way.

> To turn your argument around: I'm surprised no Windows user has yet
> written a patch for Subversion to make it support UTF-16 ...

If I knew how to I would. While I work with C# and I'm sure C is similar it is 
probably much different. If a svn dev would mentor me through it, and perhaps 
tell me what modules would need to be modified I would be happy to take a whack 
at it. 

BOb



Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 16:55, Bob Archer wrote:
>> On 11.10.2013 15:58, Bob Archer wrote:
 On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer 
>> wrote:
 I assume he was asking how to "fix" the blame. Cause, sure, he could
 open the file, convert it back to UTF-8 with CRLF line endings... and
 commit it... of course, now blame is going to show him on every line,
 since he just changed every line.

 That's exactly what I meant.  You're correct with how the blame is
 handled.  I committed the UTF-8 copy to a test branch, diff'd, and it
 showed every line as being changed.  Unfortunately it looks like this is 
 our
>> best option.
>>> Yep, we have done the same thing. As a matter of fact, I just over the past
>> few days rescripted all our database scripts to be UTF-8 since merging them
>> just doesn't work correctly when they are UTF-16 even if you remove the
>> binary mime type.
 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser  wrote:
 At current blame is not UTF-16 aware.
>>> It's not just blame that isn't... the diff engine, or whatever detects file
>> types always considers UTF-16 files to be binary. If you "add" a UTF-16 file
>> you see that svn adds the application/octet-stream mime type.  There is an
>> issue in the bug database about this from when I reported/complained about
>> it... however it hasn't been addressed. I'm surprised still at this time 
>> that svn
>> still can't support UTF-16 text files as text wrt adding, diffing, blaming, 
>> etc.
>>
>> It's quite simple: no-one has written the necessary code. While I can
>> understand it's an interesting feature for Windows users, most Subversion
>> developers have other things to do. This being a volunteer project, and most
>> of us do not use Windows, you can hardly expect anyone to spend several
>> weeks on solving a problem that has a perfectly simple workaround. Since
>> UFT-8 and UTF-16 can be interchanged without data loss, there are other,
>> much more important things to do in Subversion.
> I appreciate all that you said. I didn't expect that UTF-16 was so uncommon 
> in non-Windows OSes. A large number of dev tools that I work with on Windows, 
> especially the Microsoft tools default to creating UTF-16 files.  
>
> I disagree with your "can be converted without data loss". If you need UTF-16 
> then you need it. Also, if you are working in an international team and you 
> have developers with other language Oss which have different code pages then 
> what you see when you look at a UTF-8 file might be different than what I see.

I don't follow. Both UTF-16 and UTF-8 are complete representations of
the Unicode character set. Exactly the same code sequences can be
represented in both encodings. You can convert from UTF-16 to UTF-8 and
back and get exactly the same sequence of bytes.

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com


RE: SVN Blame Returns Corrupt Data

2013-10-11 Thread Bob Archer
> On 11.10.2013 16:55, Bob Archer wrote:
> >> On 11.10.2013 15:58, Bob Archer wrote:
>  On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer 
> >> wrote:
>  I assume he was asking how to "fix" the blame. Cause, sure, he
>  could open the file, convert it back to UTF-8 with CRLF line
>  endings... and commit it... of course, now blame is going to show
>  him on every line, since he just changed every line.
> 
>  That's exactly what I meant.  You're correct with how the blame is
>  handled.  I committed the UTF-8 copy to a test branch, diff'd, and
>  it showed every line as being changed.  Unfortunately it looks like
>  this is our
> >> best option.
> >>> Yep, we have done the same thing. As a matter of fact, I just over
> >>> the past
> >> few days rescripted all our database scripts to be UTF-8 since
> >> merging them just doesn't work correctly when they are UTF-16 even if
> >> you remove the binary mime type.
>  On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser  wrote:
>  At current blame is not UTF-16 aware.
> >>> It's not just blame that isn't... the diff engine, or whatever
> >>> detects file
> >> types always considers UTF-16 files to be binary. If you "add" a
> >> UTF-16 file you see that svn adds the application/octet-stream mime
> >> type.  There is an issue in the bug database about this from when I
> >> reported/complained about it... however it hasn't been addressed. I'm
> >> surprised still at this time that svn still can't support UTF-16 text 
> >> files as
> text wrt adding, diffing, blaming, etc.
> >>
> >> It's quite simple: no-one has written the necessary code. While I can
> >> understand it's an interesting feature for Windows users, most
> >> Subversion developers have other things to do. This being a volunteer
> >> project, and most of us do not use Windows, you can hardly expect
> >> anyone to spend several weeks on solving a problem that has a
> >> perfectly simple workaround. Since
> >> UFT-8 and UTF-16 can be interchanged without data loss, there are
> >> other, much more important things to do in Subversion.
> > I appreciate all that you said. I didn't expect that UTF-16 was so uncommon
> in non-Windows OSes. A large number of dev tools that I work with on
> Windows, especially the Microsoft tools default to creating UTF-16 files.
> >
> > I disagree with your "can be converted without data loss". If you need UTF-
> 16 then you need it. Also, if you are working in an international team and you
> have developers with other language Oss which have different code pages
> then what you see when you look at a UTF-8 file might be different than
> what I see.
> 
> I don't follow. Both UTF-16 and UTF-8 are complete representations of the
> Unicode character set. Exactly the same code sequences can be represented
> in both encodings. You can convert from UTF-16 to UTF-8 and back and get
> exactly the same sequence of bytes.
> 

Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode 
format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday 
senior moment). What I recall being told by one of the subversion developers 
was that subversion only supported the ASCII character set and while UTF-8 was 
compatible with ASCII it didn't truly support Unicode files. 

However, this blog entry seems to dispute that:

http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/

Would adding that mime-type to this file fix the blame issues this user is 
seeing?

BOb



Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 17:19, Bob Archer wrote:
>> On 11.10.2013 16:55, Bob Archer wrote:
 On 11.10.2013 15:58, Bob Archer wrote:
>> On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer 
 wrote:
>> I assume he was asking how to "fix" the blame. Cause, sure, he
>> could open the file, convert it back to UTF-8 with CRLF line
>> endings... and commit it... of course, now blame is going to show
>> him on every line, since he just changed every line.
>>
>> That's exactly what I meant.  You're correct with how the blame is
>> handled.  I committed the UTF-8 copy to a test branch, diff'd, and
>> it showed every line as being changed.  Unfortunately it looks like
>> this is our
 best option.
> Yep, we have done the same thing. As a matter of fact, I just over
> the past
 few days rescripted all our database scripts to be UTF-8 since
 merging them just doesn't work correctly when they are UTF-16 even if
 you remove the binary mime type.
>> On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser  wrote:
>> At current blame is not UTF-16 aware.
> It's not just blame that isn't... the diff engine, or whatever
> detects file
 types always considers UTF-16 files to be binary. If you "add" a
 UTF-16 file you see that svn adds the application/octet-stream mime
 type.  There is an issue in the bug database about this from when I
 reported/complained about it... however it hasn't been addressed. I'm
 surprised still at this time that svn still can't support UTF-16 text 
 files as
>> text wrt adding, diffing, blaming, etc.
 It's quite simple: no-one has written the necessary code. While I can
 understand it's an interesting feature for Windows users, most
 Subversion developers have other things to do. This being a volunteer
 project, and most of us do not use Windows, you can hardly expect
 anyone to spend several weeks on solving a problem that has a
 perfectly simple workaround. Since
 UFT-8 and UTF-16 can be interchanged without data loss, there are
 other, much more important things to do in Subversion.
>>> I appreciate all that you said. I didn't expect that UTF-16 was so uncommon
>> in non-Windows OSes. A large number of dev tools that I work with on
>> Windows, especially the Microsoft tools default to creating UTF-16 files.
>>> I disagree with your "can be converted without data loss". If you need UTF-
>> 16 then you need it. Also, if you are working in an international team and 
>> you
>> have developers with other language Oss which have different code pages
>> then what you see when you look at a UTF-8 file might be different than
>> what I see.
>>
>> I don't follow. Both UTF-16 and UTF-8 are complete representations of the
>> Unicode character set. Exactly the same code sequences can be represented
>> in both encodings. You can convert from UTF-16 to UTF-8 and back and get
>> exactly the same sequence of bytes.
>>
> Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode 
> format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday 
> senior moment). What I recall being told by one of the subversion developers 
> was that subversion only supported the ASCII character set and while UTF-8 
> was compatible with ASCII it didn't truly support Unicode files. 
>
> However, this blog entry seems to dispute that:
>
> http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/
>
> Would adding that mime-type to this file fix the blame issues this user is 
> seeing?

I think the user is just very lucky. Subversion does not actually try to
interpret the svn:mime-type property, other than to determine whether to
treat a file as text or binary. (The comment is correct in that the
proper parameter is charset=, not encoding=, but that's not important
for this discussion).

Subversion's merge algorithm depends on being able to detect line
endings in the file, and always scans the file as a sequence of bytes.
There are several ways to represent line endings in a UTF-16 file (shown
here as hex byte sequences):

  * 00 0A (Unix newline, UTF16-BE)
  * 00 0D 00 0A (Windows newline, UTF16-BE)
  * 0A 00 (Unix newline, UTF16-LE)
  * 0D 00 0A 00 (Windows newline, UTF16-LE)
  * 24 24 (Unicode newline, same in LE and BE)

Subversion, however, expects one of the following newline sequences:

  * 0A (Unix newline)
  * 0D 0A (Windows newline)

My best guess as to what's happening is that the 0A bytes, a.k.a. the
ASCII newline character, are interpreted as the end-of-line markers, and
the zero bytes are treated as part of the text. In most cases, the
result will be close to correct, as long as there are no conflicts in
the merge -- because Subversion will not emit conflict markers in UTF-16.

Of course, if someone used the U+2424 newline code point instead, then
in the worst case, the whole file would be interpreted as a single line.

-- Brane

-- 
Branko Čibej | Director of Subversion
WANdisco // Non

RE: SVN Blame Returns Corrupt Data

2013-10-11 Thread Bob Archer
> On 11.10.2013 17:19, Bob Archer wrote:
> >> On 11.10.2013 16:55, Bob Archer wrote:
>  On 11.10.2013 15:58, Bob Archer wrote:
> >> On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer
> 
>  wrote:
> >> I assume he was asking how to "fix" the blame. Cause, sure, he
> >> could open the file, convert it back to UTF-8 with CRLF line
> >> endings... and commit it... of course, now blame is going to show
> >> him on every line, since he just changed every line.
> >>
> >> That's exactly what I meant.  You're correct with how the blame
> >> is handled.  I committed the UTF-8 copy to a test branch, diff'd,
> >> and it showed every line as being changed.  Unfortunately it
> >> looks like this is our
>  best option.
> > Yep, we have done the same thing. As a matter of fact, I just over
> > the past
>  few days rescripted all our database scripts to be UTF-8 since
>  merging them just doesn't work correctly when they are UTF-16 even
>  if you remove the binary mime type.
> >> On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser  wrote:
> >> At current blame is not UTF-16 aware.
> > It's not just blame that isn't... the diff engine, or whatever
> > detects file
>  types always considers UTF-16 files to be binary. If you "add" a
>  UTF-16 file you see that svn adds the application/octet-stream mime
>  type.  There is an issue in the bug database about this from when I
>  reported/complained about it... however it hasn't been addressed.
>  I'm surprised still at this time that svn still can't support
>  UTF-16 text files as
> >> text wrt adding, diffing, blaming, etc.
>  It's quite simple: no-one has written the necessary code. While I
>  can understand it's an interesting feature for Windows users, most
>  Subversion developers have other things to do. This being a
>  volunteer project, and most of us do not use Windows, you can
>  hardly expect anyone to spend several weeks on solving a problem
>  that has a perfectly simple workaround. Since
>  UFT-8 and UTF-16 can be interchanged without data loss, there are
>  other, much more important things to do in Subversion.
> >>> I appreciate all that you said. I didn't expect that UTF-16 was so
> >>> uncommon
> >> in non-Windows OSes. A large number of dev tools that I work with on
> >> Windows, especially the Microsoft tools default to creating UTF-16 files.
> >>> I disagree with your "can be converted without data loss". If you
> >>> need UTF-
> >> 16 then you need it. Also, if you are working in an international
> >> team and you have developers with other language Oss which have
> >> different code pages then what you see when you look at a UTF-8 file
> >> might be different than what I see.
> >>
> >> I don't follow. Both UTF-16 and UTF-8 are complete representations of
> >> the Unicode character set. Exactly the same code sequences can be
> >> represented in both encodings. You can convert from UTF-16 to UTF-8
> >> and back and get exactly the same sequence of bytes.
> >>
> > Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode
> format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday
> senior moment). What I recall being told by one of the subversion
> developers was that subversion only supported the ASCII character set and
> while UTF-8 was compatible with ASCII it didn't truly support Unicode files.
> >
> > However, this blog entry seems to dispute that:
> >
> > http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/
> >
> > Would adding that mime-type to this file fix the blame issues this user is
> seeing?
> 
> I think the user is just very lucky. Subversion does not actually try to 
> interpret
> the svn:mime-type property, other than to determine whether to treat a file
> as text or binary. (The comment is correct in that the proper parameter is
> charset=, not encoding=, but that's not important for this discussion).
> 
> Subversion's merge algorithm depends on being able to detect line endings
> in the file, and always scans the file as a sequence of bytes.
> There are several ways to represent line endings in a UTF-16 file (shown here
> as hex byte sequences):
> 
>   * 00 0A (Unix newline, UTF16-BE)
>   * 00 0D 00 0A (Windows newline, UTF16-BE)
>   * 0A 00 (Unix newline, UTF16-LE)
>   * 0D 00 0A 00 (Windows newline, UTF16-LE)
>   * 24 24 (Unicode newline, same in LE and BE)
> 
> Subversion, however, expects one of the following newline sequences:
> 
>   * 0A (Unix newline)
>   * 0D 0A (Windows newline)
> 
> My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII
> newline character, are interpreted as the end-of-line markers, and the zero
> bytes are treated as part of the text. In most cases, the result will be 
> close to
> correct, as long as there are no conflicts in the merge -- because Subversion
> will not emit conflict markers in UTF-16.
> 
> Of course, if someone use

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 18:12, Bob Archer wrote:
>> On 11.10.2013 17:19, Bob Archer wrote:
 On 11.10.2013 16:55, Bob Archer wrote:
>> On 11.10.2013 15:58, Bob Archer wrote:
 On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer
>> 
>> wrote:
 I assume he was asking how to "fix" the blame. Cause, sure, he
 could open the file, convert it back to UTF-8 with CRLF line
 endings... and commit it... of course, now blame is going to show
 him on every line, since he just changed every line.

 That's exactly what I meant.  You're correct with how the blame
 is handled.  I committed the UTF-8 copy to a test branch, diff'd,
 and it showed every line as being changed.  Unfortunately it
 looks like this is our
>> best option.
>>> Yep, we have done the same thing. As a matter of fact, I just over
>>> the past
>> few days rescripted all our database scripts to be UTF-8 since
>> merging them just doesn't work correctly when they are UTF-16 even
>> if you remove the binary mime type.
 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser  wrote:
 At current blame is not UTF-16 aware.
>>> It's not just blame that isn't... the diff engine, or whatever
>>> detects file
>> types always considers UTF-16 files to be binary. If you "add" a
>> UTF-16 file you see that svn adds the application/octet-stream mime
>> type.  There is an issue in the bug database about this from when I
>> reported/complained about it... however it hasn't been addressed.
>> I'm surprised still at this time that svn still can't support
>> UTF-16 text files as
 text wrt adding, diffing, blaming, etc.
>> It's quite simple: no-one has written the necessary code. While I
>> can understand it's an interesting feature for Windows users, most
>> Subversion developers have other things to do. This being a
>> volunteer project, and most of us do not use Windows, you can
>> hardly expect anyone to spend several weeks on solving a problem
>> that has a perfectly simple workaround. Since
>> UFT-8 and UTF-16 can be interchanged without data loss, there are
>> other, much more important things to do in Subversion.
> I appreciate all that you said. I didn't expect that UTF-16 was so
> uncommon
 in non-Windows OSes. A large number of dev tools that I work with on
 Windows, especially the Microsoft tools default to creating UTF-16 files.
> I disagree with your "can be converted without data loss". If you
> need UTF-
 16 then you need it. Also, if you are working in an international
 team and you have developers with other language Oss which have
 different code pages then what you see when you look at a UTF-8 file
 might be different than what I see.

 I don't follow. Both UTF-16 and UTF-8 are complete representations of
 the Unicode character set. Exactly the same code sequences can be
 represented in both encodings. You can convert from UTF-16 to UTF-8
 and back and get exactly the same sequence of bytes.

>>> Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode
>> format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday
>> senior moment). What I recall being told by one of the subversion
>> developers was that subversion only supported the ASCII character set and
>> while UTF-8 was compatible with ASCII it didn't truly support Unicode files.
>>> However, this blog entry seems to dispute that:
>>>
>>> http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/
>>>
>>> Would adding that mime-type to this file fix the blame issues this user is
>> seeing?
>>
>> I think the user is just very lucky. Subversion does not actually try to 
>> interpret
>> the svn:mime-type property, other than to determine whether to treat a file
>> as text or binary. (The comment is correct in that the proper parameter is
>> charset=, not encoding=, but that's not important for this discussion).
>>
>> Subversion's merge algorithm depends on being able to detect line endings
>> in the file, and always scans the file as a sequence of bytes.
>> There are several ways to represent line endings in a UTF-16 file (shown here
>> as hex byte sequences):
>>
>>   * 00 0A (Unix newline, UTF16-BE)
>>   * 00 0D 00 0A (Windows newline, UTF16-BE)
>>   * 0A 00 (Unix newline, UTF16-LE)
>>   * 0D 00 0A 00 (Windows newline, UTF16-LE)
>>   * 24 24 (Unicode newline, same in LE and BE)
>>
>> Subversion, however, expects one of the following newline sequences:
>>
>>   * 0A (Unix newline)
>>   * 0D 0A (Windows newline)
>>
>> My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII
>> newline character, are interpreted as the end-of-line markers, and the zero
>> bytes are treated as part of the text. In most cases, the result will be 
>> close to
>> correct, as long as there are no conflicts in the merge -- because Subversion
>> will no

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Andreas Krey
On Fri, 11 Oct 2013 17:43:30 +, Branko ??ibej wrote:
...
> Of course, if someone used the U+2424 newline code point instead, then
> in the worst case, the whole file would be interpreted as a single line.

And SVN would be right, as U+2424 is 'SYMBOL FOR NEWLINE', which is
actually a printable character, not a control charactor.

Andreas

-- 
"Totally trivial. Famous last words."
From: Linus Torvalds 
Date: Fri, 22 Jan 2010 07:29:21 -0800


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Ben Reser
On 10/11/13 9:22 AM, Branko Čibej wrote:
> You'd have to extend Subversion's file type detection to detect UTF-16.
> See svn_io_detect_mimetype2 in line  in this file:
> 
> http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
> Subversion currently only looks at the first 1k Bytes of a file. It may
> be enough to check that this initial part of the file contains only
> valid UTF-16 (BE or LE) codes.

Even if all we looked for is the BOM it might be helpful enough.  I suspect the
development tools producing UTF-16 are including BOMs.  Windows seems to be
fond of including them, Notepad puts one even on UTF-8.


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Stefan Sperling
On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote:
> On 10/11/13 9:22 AM, Branko Čibej wrote:
> > You'd have to extend Subversion's file type detection to detect UTF-16.
> > See svn_io_detect_mimetype2 in line  in this file:
> > 
> > http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
> > Subversion currently only looks at the first 1k Bytes of a file. It may
> > be enough to check that this initial part of the file contains only
> > valid UTF-16 (BE or LE) codes.
> 
> Even if all we looked for is the BOM it might be helpful enough.  I suspect 
> the
> development tools producing UTF-16 are including BOMs.  Windows seems to be
> fond of including them, Notepad puts one even on UTF-8.

Couldn't Subversion automatically convert UTF-16 files to UTF-8 before
processing them for diff/merge/blame, and convert output written to
the original files back to UTF-16?

That would require some work because existing streams, strings, and files
passed around in the code would need to be wrapped so that translation
to/from the internal from/to the external encoding is seamless.

But I don't see why such an approach couldn't be made to work in principle.
It might even result in some spring cleaning in the code base and pave the
way for improved handling of file formats such as XML for diff and merge.

What do you think? Is it worth adding this to our project ideas page?


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 18:52, Ben Reser wrote:
> On 10/11/13 9:22 AM, Branko Čibej wrote:
>> You'd have to extend Subversion's file type detection to detect UTF-16.
>> See svn_io_detect_mimetype2 in line  in this file:
>>
>> http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
>> Subversion currently only looks at the first 1k Bytes of a file. It may
>> be enough to check that this initial part of the file contains only
>> valid UTF-16 (BE or LE) codes.
> Even if all we looked for is the BOM it might be helpful enough.  I suspect 
> the
> development tools producing UTF-16 are including BOMs.  Windows seems to be
> fond of including them, Notepad puts one even on UTF-8.

That would work only on Windows. On other platforms, you typically don't
get a BOM (actually, a zero-width non-breaking space) at the beginning
of a file. Granted, other platforms most likely use UTF-8 in any case.

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 19:25, Stefan Sperling wrote:
> On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote:
>> On 10/11/13 9:22 AM, Branko Čibej wrote:
>>> You'd have to extend Subversion's file type detection to detect UTF-16.
>>> See svn_io_detect_mimetype2 in line  in this file:
>>>
>>> http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
>>> Subversion currently only looks at the first 1k Bytes of a file. It may
>>> be enough to check that this initial part of the file contains only
>>> valid UTF-16 (BE or LE) codes.
>> Even if all we looked for is the BOM it might be helpful enough.  I suspect 
>> the
>> development tools producing UTF-16 are including BOMs.  Windows seems to be
>> fond of including them, Notepad puts one even on UTF-8.
> Couldn't Subversion automatically convert UTF-16 files to UTF-8 before
> processing them for diff/merge/blame, and convert output written to
> the original files back to UTF-16?

That would be less work than supporting whitespace compression, etc. in
UTF-16, but we'd still have to detect U+2424 as an end-of-line marker in
UTF-8 text.

Still, we'd actually have to correctly identify UTF-16 content first,
and handle invalid byte sequences.

> That would require some work because existing streams, strings, and files
> passed around in the code would need to be wrapped so that translation
> to/from the internal from/to the external encoding is seamless.
>
> But I don't see why such an approach couldn't be made to work in principle.
> It might even result in some spring cleaning in the code base and pave the
> way for improved handling of file formats such as XML for diff and merge.

Can't see what XML has to do with it. The diff algorithm already uses a
tokenizer; and for XML, that should be good enough most of the time.

> What do you think? Is it worth adding this to our project ideas page?

It's already here: http://subversion.tigris.org/issues/show_bug.cgi?id=2194

-- Brane

-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 18:23, Andreas Krey wrote:
> On Fri, 11 Oct 2013 17:43:30 +, Branko ??ibej wrote:
> ...
>> Of course, if someone used the U+2424 newline code point instead, then
>> in the worst case, the whole file would be interpreted as a single line.
> And SVN would be right, as U+2424 is 'SYMBOL FOR NEWLINE', which is
> actually a printable character, not a control charactor.

Meh, you're right ... it's U+0085 (next line), U+2028 (line separator)
or U+2029 (paragraph separator). I don't know what came over me; sorry
for misleading everyone.

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Ben Reser
On 10/11/13 10:25 AM, Stefan Sperling wrote:
> Couldn't Subversion automatically convert UTF-16 files to UTF-8 before
> processing them for diff/merge/blame, and convert output written to
> the original files back to UTF-16?

That's what the patch I pointed out did.  Nobody seemed to object to the idea
at the time it was posted, though I think Branko does bring up some interesting
questions about handling the unicode control characters.