Fresh checkout shows modified files

2013-10-11 Thread Stümpfig , Thomas
Hi all,
for some directories I get modified working copies right after a fresh checkout.
This seems very strange to me. I do not get any error message. My Platform is 
Windows 7 64 Bit. I am using TortoiseSVN 1.8.1, Build 24570 - 64 Bit as client. 
Visual SVN as server. But also tortoise svn command line tools are producing 
the same results.

How can I start to debug? Network doesn't seem to be the issue. Am I right?

Mit freundlichen Grüßen

Thomas Stümpfig
Senior Presales Consultant
Global Sales  Services
Product Lifecycle Management

Siemens Industry Sector
Siemens Industry Software GmbH  Co. KG
Franz-Geuer-Str. 10
50823  Cologne, Germany
Tel.  :+49 (2153) 9107117
Fax  :+49 (221) 20802 699
Mobile :+49 (175) 2205712
thomas.stuemp...@siemens.com
www.siemens.com/plm


-
Siemens Industry Software GmbH  Co. KG; Anschrift: Franz-Geuer-Str. 10, 50674 
Köln;
Kommanditgesellschaft: Sitz der Gesellschaft: Köln; Registergericht: 
Amtsgericht Köln, HRA 28227;
Geschäftsführung und persönlich haftender Gesellschafter: Siemens Industry 
Software Management GmbH;
Geschäftsführer: Urban August, Daniel Trebes; Sitz der Gesellschaft: Köln; 
Registergericht: Amtsgericht Köln, HRB 70858


Re: Fresh checkout shows modified files

2013-10-11 Thread Stefan Sperling
On Fri, Oct 11, 2013 at 09:41:33AM +, Stümpfig, Thomas wrote:
 Hi all,
 for some directories I get modified working copies right after a fresh 
 checkout.
 This seems very strange to me. I do not get any error message. My Platform is 
 Windows 7 64 Bit. I am using TortoiseSVN 1.8.1, Build 24570 - 64 Bit as 
 client. Visual SVN as server. But also tortoise svn command line tools are 
 producing the same results.
 
 How can I start to debug? Network doesn't seem to be the issue. Am I right?

Does 'svn cleanup' fix it?

If so, my guess would be that timestamps of the on-disk files differ
from what was recorded in .svn/wc.db. This difference would cause the
modification check logic to flag files as modified.

'svn cleanup' syncs the on-disk timestamps with the ones stored
in meta data.


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread T.J. Perovich
On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote:

 I assume he was asking how to fix the blame. Cause, sure, he could open
 the file, convert it back to UTF-8 with CRLF line endings... and commit
 it... of course, now blame is going to show him on every line, since he
 just changed every line.


That's exactly what I meant.  You're correct with how the blame is handled.
 I committed the UTF-8 copy to a test branch, diff'd, and it showed every
line as being changed.  Unfortunately it looks like this is our best option.


On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:

 At current blame is not UTF-16 aware.

 About a year ago there was a patch (actually they just reposted an entire
 copy of blame.c) posted that helped, but it really didn't go anywhere
 since the
 original poster didn't continue following up.

 Perhaps you'll find the above useful.  Patches are of course welcome.


I'll take a look at that this weekend.  Thanks for all the input everyone.



On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:

 On 10/10/13 12:17 PM, T.J. Perovich wrote:
  On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt 
 subversion-20...@ryandesign.com
  mailto:subversion-20...@ryandesign.com wrote:
 Sounds like you've converted the file from UTF-8 to UTF-16.
 
  Thanks, you're absolutely right.  It changed from UTF-8 to UTF-16LE.
 
  Any idea how to go about fixing it elegantly?  We have about 3 months of
  commits since this happened.  Diff's in the GUIs have been working fine
 and we
  don't blame too often which is why it was never noticed.

 At current blame is not UTF-16 aware.

 About a year ago there was a patch (actually they just reposted an entire
 copy of blame.c) posted that helped, but it really didn't go anywhere
 since the
 original poster didn't continue following up.


 https://mail-archives.apache.org/mod_mbox/subversion-dev/201207.mbox/%3CCAAF0CB13B282344AF95AD2DE3D1962215627A3C%40DAG-B.nexon.corp%3E

 and the followup


 https://mail-archives.apache.org/mod_mbox/subversion-dev/201208.mbox/%3ccab84ubvvrhffqyea5pf5gstmpxz+rh2jkvdvcqscocjv+rq...@mail.gmail.com%3E

 Perhaps you'll find the above useful.  Patches are of course welcome.



RE: SVN Blame Returns Corrupt Data

2013-10-11 Thread Bob Archer
 On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote:
 I assume he was asking how to fix the blame. Cause, sure, he could open
 the file, convert it back to UTF-8 with CRLF line endings... and commit it... 
 of
 course, now blame is going to show him on every line, since he just changed
 every line.
 
 That's exactly what I meant.  You're correct with how the blame is handled.  I
 committed the UTF-8 copy to a test branch, diff'd, and it showed every line
 as being changed.  Unfortunately it looks like this is our best option.

Yep, we have done the same thing. As a matter of fact, I just over the past few 
days rescripted all our database scripts to be UTF-8 since merging them just 
doesn't work correctly when they are UTF-16 even if you remove the binary mime 
type.

 
 
 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
 At current blame is not UTF-16 aware.

It's not just blame that isn't... the diff engine, or whatever detects file 
types always considers UTF-16 files to be binary. If you add a UTF-16 file 
you see that svn adds the application/octet-stream mime type.  There is an 
issue in the bug database about this from when I reported/complained about 
it... however it hasn't been addressed. I'm surprised still at this time that 
svn still can't support UTF-16 text files as text wrt adding, diffing, blaming, 
etc.

BOb


 
 About a year ago there was a patch (actually they just reposted an entire
 copy of blame.c) posted that helped, but it really didn't go anywhere since
 the original poster didn't continue following up.
 
 Perhaps you'll find the above useful.  Patches are of course welcome.
 
 I'll take a look at that this weekend.  Thanks for all the input everyone.
 
 
 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
 On 10/10/13 12:17 PM, T.J. Perovich wrote:
  On Thu, Oct 10, 2013 at 2:27 PM, Ryan Schmidt
  subversion-20...@ryandesign.com mailto:subversion-
 20...@ryandesign.com wrote:
 Sounds like you've converted the file from UTF-8 to UTF-16.
 
  Thanks, you're absolutely right.  It changed from UTF-8 to UTF-16LE.
 
  Any idea how to go about fixing it elegantly?  We have about 3 months
  of commits since this happened.  Diff's in the GUIs have been working
  fine and we don't blame too often which is why it was never noticed.
 At current blame is not UTF-16 aware.
 
 About a year ago there was a patch (actually they just reposted an entire
 copy of blame.c) posted that helped, but it really didn't go anywhere since
 the original poster didn't continue following up.
 
 https://mail-archives.apache.org/mod_mbox/subversion-
 dev/201207.mbox/%3CCAAF0CB13B282344AF95AD2DE3D1962215627A3C%40
 DAG-B.nexon.corp%3E
 
 and the followup
 
 https://mail-archives.apache.org/mod_mbox/subversion-
 dev/201208.mbox/%3CCAB84uBVVrHFfQyEA5pF5gStMpXz+RH2jKvdvCQsCO
 cjv+rq...@mail.gmail.com%3E
 
 Perhaps you'll find the above useful.  Patches are of course welcome.



Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 15:58, Bob Archer wrote:
 On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com wrote:
 I assume he was asking how to fix the blame. Cause, sure, he could open
 the file, convert it back to UTF-8 with CRLF line endings... and commit 
 it... of
 course, now blame is going to show him on every line, since he just changed
 every line.

 That's exactly what I meant.  You're correct with how the blame is handled.  
 I
 committed the UTF-8 copy to a test branch, diff'd, and it showed every line
 as being changed.  Unfortunately it looks like this is our best option.
 Yep, we have done the same thing. As a matter of fact, I just over the past 
 few days rescripted all our database scripts to be UTF-8 since merging them 
 just doesn't work correctly when they are UTF-16 even if you remove the 
 binary mime type.


 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
 At current blame is not UTF-16 aware.
 It's not just blame that isn't... the diff engine, or whatever detects file 
 types always considers UTF-16 files to be binary. If you add a UTF-16 file 
 you see that svn adds the application/octet-stream mime type.  There is an 
 issue in the bug database about this from when I reported/complained about 
 it... however it hasn't been addressed. I'm surprised still at this time that 
 svn still can't support UTF-16 text files as text wrt adding, diffing, 
 blaming, etc.

It's quite simple: no-one has written the necessary code. While I can
understand it's an interesting feature for Windows users, most
Subversion developers have other things to do. This being a volunteer
project, and most of us do not use Windows, you can hardly expect anyone
to spend several weeks on solving a problem that has a perfectly simple
workaround. Since UFT-8 and UTF-16 can be interchanged without data
loss, there are other, much more important things to do in Subversion.

To turn your argument around: I'm surprised no Windows user has yet
written a patch for Subversion to make it support UTF-16 ...

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com


RE: SVN Blame Returns Corrupt Data

2013-10-11 Thread Bob Archer
 On 11.10.2013 15:58, Bob Archer wrote:
  On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com
 wrote:
  I assume he was asking how to fix the blame. Cause, sure, he could
  open the file, convert it back to UTF-8 with CRLF line endings... and
  commit it... of course, now blame is going to show him on every line,
  since he just changed every line.
 
  That's exactly what I meant.  You're correct with how the blame is
  handled.  I committed the UTF-8 copy to a test branch, diff'd, and it
  showed every line as being changed.  Unfortunately it looks like this is 
  our
 best option.
  Yep, we have done the same thing. As a matter of fact, I just over the past
 few days rescripted all our database scripts to be UTF-8 since merging them
 just doesn't work correctly when they are UTF-16 even if you remove the
 binary mime type.
 
 
  On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
  At current blame is not UTF-16 aware.
  It's not just blame that isn't... the diff engine, or whatever detects file
 types always considers UTF-16 files to be binary. If you add a UTF-16 file
 you see that svn adds the application/octet-stream mime type.  There is an
 issue in the bug database about this from when I reported/complained about
 it... however it hasn't been addressed. I'm surprised still at this time that 
 svn
 still can't support UTF-16 text files as text wrt adding, diffing, blaming, 
 etc.
 
 It's quite simple: no-one has written the necessary code. While I can
 understand it's an interesting feature for Windows users, most Subversion
 developers have other things to do. This being a volunteer project, and most
 of us do not use Windows, you can hardly expect anyone to spend several
 weeks on solving a problem that has a perfectly simple workaround. Since
 UFT-8 and UTF-16 can be interchanged without data loss, there are other,
 much more important things to do in Subversion.

I appreciate all that you said. I didn't expect that UTF-16 was so uncommon in 
non-Windows OSes. A large number of dev tools that I work with on Windows, 
especially the Microsoft tools default to creating UTF-16 files.  

I disagree with your can be converted without data loss. If you need UTF-16 
then you need it. Also, if you are working in an international team and you 
have developers with other language Oss which have different code pages then 
what you see when you look at a UTF-8 file might be different than what I see. 

So, when I say I'm surprised I only say that with the knowledge that the 
internet has made the world very flat and I'm sure there is much more 
collaboration amoungs people that use different languages and work on apps that 
need to deal with international languages, etc. I'm not dissing the devs in any 
way.

 To turn your argument around: I'm surprised no Windows user has yet
 written a patch for Subversion to make it support UTF-16 ...

If I knew how to I would. While I work with C# and I'm sure C is similar it is 
probably much different. If a svn dev would mentor me through it, and perhaps 
tell me what modules would need to be modified I would be happy to take a whack 
at it. 

BOb



Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 16:55, Bob Archer wrote:
 On 11.10.2013 15:58, Bob Archer wrote:
 On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com
 wrote:
 I assume he was asking how to fix the blame. Cause, sure, he could
 open the file, convert it back to UTF-8 with CRLF line endings... and
 commit it... of course, now blame is going to show him on every line,
 since he just changed every line.

 That's exactly what I meant.  You're correct with how the blame is
 handled.  I committed the UTF-8 copy to a test branch, diff'd, and it
 showed every line as being changed.  Unfortunately it looks like this is 
 our
 best option.
 Yep, we have done the same thing. As a matter of fact, I just over the past
 few days rescripted all our database scripts to be UTF-8 since merging them
 just doesn't work correctly when they are UTF-16 even if you remove the
 binary mime type.
 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
 At current blame is not UTF-16 aware.
 It's not just blame that isn't... the diff engine, or whatever detects file
 types always considers UTF-16 files to be binary. If you add a UTF-16 file
 you see that svn adds the application/octet-stream mime type.  There is an
 issue in the bug database about this from when I reported/complained about
 it... however it hasn't been addressed. I'm surprised still at this time 
 that svn
 still can't support UTF-16 text files as text wrt adding, diffing, blaming, 
 etc.

 It's quite simple: no-one has written the necessary code. While I can
 understand it's an interesting feature for Windows users, most Subversion
 developers have other things to do. This being a volunteer project, and most
 of us do not use Windows, you can hardly expect anyone to spend several
 weeks on solving a problem that has a perfectly simple workaround. Since
 UFT-8 and UTF-16 can be interchanged without data loss, there are other,
 much more important things to do in Subversion.
 I appreciate all that you said. I didn't expect that UTF-16 was so uncommon 
 in non-Windows OSes. A large number of dev tools that I work with on Windows, 
 especially the Microsoft tools default to creating UTF-16 files.  

 I disagree with your can be converted without data loss. If you need UTF-16 
 then you need it. Also, if you are working in an international team and you 
 have developers with other language Oss which have different code pages then 
 what you see when you look at a UTF-8 file might be different than what I see.

I don't follow. Both UTF-16 and UTF-8 are complete representations of
the Unicode character set. Exactly the same code sequences can be
represented in both encodings. You can convert from UTF-16 to UTF-8 and
back and get exactly the same sequence of bytes.

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com


RE: SVN Blame Returns Corrupt Data

2013-10-11 Thread Bob Archer
 On 11.10.2013 16:55, Bob Archer wrote:
  On 11.10.2013 15:58, Bob Archer wrote:
  On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com
  wrote:
  I assume he was asking how to fix the blame. Cause, sure, he
  could open the file, convert it back to UTF-8 with CRLF line
  endings... and commit it... of course, now blame is going to show
  him on every line, since he just changed every line.
 
  That's exactly what I meant.  You're correct with how the blame is
  handled.  I committed the UTF-8 copy to a test branch, diff'd, and
  it showed every line as being changed.  Unfortunately it looks like
  this is our
  best option.
  Yep, we have done the same thing. As a matter of fact, I just over
  the past
  few days rescripted all our database scripts to be UTF-8 since
  merging them just doesn't work correctly when they are UTF-16 even if
  you remove the binary mime type.
  On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
  At current blame is not UTF-16 aware.
  It's not just blame that isn't... the diff engine, or whatever
  detects file
  types always considers UTF-16 files to be binary. If you add a
  UTF-16 file you see that svn adds the application/octet-stream mime
  type.  There is an issue in the bug database about this from when I
  reported/complained about it... however it hasn't been addressed. I'm
  surprised still at this time that svn still can't support UTF-16 text 
  files as
 text wrt adding, diffing, blaming, etc.
 
  It's quite simple: no-one has written the necessary code. While I can
  understand it's an interesting feature for Windows users, most
  Subversion developers have other things to do. This being a volunteer
  project, and most of us do not use Windows, you can hardly expect
  anyone to spend several weeks on solving a problem that has a
  perfectly simple workaround. Since
  UFT-8 and UTF-16 can be interchanged without data loss, there are
  other, much more important things to do in Subversion.
  I appreciate all that you said. I didn't expect that UTF-16 was so uncommon
 in non-Windows OSes. A large number of dev tools that I work with on
 Windows, especially the Microsoft tools default to creating UTF-16 files.
 
  I disagree with your can be converted without data loss. If you need UTF-
 16 then you need it. Also, if you are working in an international team and you
 have developers with other language Oss which have different code pages
 then what you see when you look at a UTF-8 file might be different than
 what I see.
 
 I don't follow. Both UTF-16 and UTF-8 are complete representations of the
 Unicode character set. Exactly the same code sequences can be represented
 in both encodings. You can convert from UTF-16 to UTF-8 and back and get
 exactly the same sequence of bytes.
 

Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode 
format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday 
senior moment). What I recall being told by one of the subversion developers 
was that subversion only supported the ASCII character set and while UTF-8 was 
compatible with ASCII it didn't truly support Unicode files. 

However, this blog entry seems to dispute that:

http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/

Would adding that mime-type to this file fix the blame issues this user is 
seeing?

BOb



Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 17:19, Bob Archer wrote:
 On 11.10.2013 16:55, Bob Archer wrote:
 On 11.10.2013 15:58, Bob Archer wrote:
 On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer bob.arc...@amsi.com
 wrote:
 I assume he was asking how to fix the blame. Cause, sure, he
 could open the file, convert it back to UTF-8 with CRLF line
 endings... and commit it... of course, now blame is going to show
 him on every line, since he just changed every line.

 That's exactly what I meant.  You're correct with how the blame is
 handled.  I committed the UTF-8 copy to a test branch, diff'd, and
 it showed every line as being changed.  Unfortunately it looks like
 this is our
 best option.
 Yep, we have done the same thing. As a matter of fact, I just over
 the past
 few days rescripted all our database scripts to be UTF-8 since
 merging them just doesn't work correctly when they are UTF-16 even if
 you remove the binary mime type.
 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
 At current blame is not UTF-16 aware.
 It's not just blame that isn't... the diff engine, or whatever
 detects file
 types always considers UTF-16 files to be binary. If you add a
 UTF-16 file you see that svn adds the application/octet-stream mime
 type.  There is an issue in the bug database about this from when I
 reported/complained about it... however it hasn't been addressed. I'm
 surprised still at this time that svn still can't support UTF-16 text 
 files as
 text wrt adding, diffing, blaming, etc.
 It's quite simple: no-one has written the necessary code. While I can
 understand it's an interesting feature for Windows users, most
 Subversion developers have other things to do. This being a volunteer
 project, and most of us do not use Windows, you can hardly expect
 anyone to spend several weeks on solving a problem that has a
 perfectly simple workaround. Since
 UFT-8 and UTF-16 can be interchanged without data loss, there are
 other, much more important things to do in Subversion.
 I appreciate all that you said. I didn't expect that UTF-16 was so uncommon
 in non-Windows OSes. A large number of dev tools that I work with on
 Windows, especially the Microsoft tools default to creating UTF-16 files.
 I disagree with your can be converted without data loss. If you need UTF-
 16 then you need it. Also, if you are working in an international team and 
 you
 have developers with other language Oss which have different code pages
 then what you see when you look at a UTF-8 file might be different than
 what I see.

 I don't follow. Both UTF-16 and UTF-8 are complete representations of the
 Unicode character set. Exactly the same code sequences can be represented
 in both encodings. You can convert from UTF-16 to UTF-8 and back and get
 exactly the same sequence of bytes.

 Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode 
 format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday 
 senior moment). What I recall being told by one of the subversion developers 
 was that subversion only supported the ASCII character set and while UTF-8 
 was compatible with ASCII it didn't truly support Unicode files. 

 However, this blog entry seems to dispute that:

 http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/

 Would adding that mime-type to this file fix the blame issues this user is 
 seeing?

I think the user is just very lucky. Subversion does not actually try to
interpret the svn:mime-type property, other than to determine whether to
treat a file as text or binary. (The comment is correct in that the
proper parameter is charset=, not encoding=, but that's not important
for this discussion).

Subversion's merge algorithm depends on being able to detect line
endings in the file, and always scans the file as a sequence of bytes.
There are several ways to represent line endings in a UTF-16 file (shown
here as hex byte sequences):

  * 00 0A (Unix newline, UTF16-BE)
  * 00 0D 00 0A (Windows newline, UTF16-BE)
  * 0A 00 (Unix newline, UTF16-LE)
  * 0D 00 0A 00 (Windows newline, UTF16-LE)
  * 24 24 (Unicode newline, same in LE and BE)

Subversion, however, expects one of the following newline sequences:

  * 0A (Unix newline)
  * 0D 0A (Windows newline)

My best guess as to what's happening is that the 0A bytes, a.k.a. the
ASCII newline character, are interpreted as the end-of-line markers, and
the zero bytes are treated as part of the text. In most cases, the
result will be close to correct, as long as there are no conflicts in
the merge -- because Subversion will not emit conflict markers in UTF-16.

Of course, if someone used the U+2424 newline code point instead, then
in the worst case, the whole file would be interpreted as a single line.

-- Brane

-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com


RE: SVN Blame Returns Corrupt Data

2013-10-11 Thread Bob Archer
 On 11.10.2013 17:19, Bob Archer wrote:
  On 11.10.2013 16:55, Bob Archer wrote:
  On 11.10.2013 15:58, Bob Archer wrote:
  On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer
 bob.arc...@amsi.com
  wrote:
  I assume he was asking how to fix the blame. Cause, sure, he
  could open the file, convert it back to UTF-8 with CRLF line
  endings... and commit it... of course, now blame is going to show
  him on every line, since he just changed every line.
 
  That's exactly what I meant.  You're correct with how the blame
  is handled.  I committed the UTF-8 copy to a test branch, diff'd,
  and it showed every line as being changed.  Unfortunately it
  looks like this is our
  best option.
  Yep, we have done the same thing. As a matter of fact, I just over
  the past
  few days rescripted all our database scripts to be UTF-8 since
  merging them just doesn't work correctly when they are UTF-16 even
  if you remove the binary mime type.
  On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
  At current blame is not UTF-16 aware.
  It's not just blame that isn't... the diff engine, or whatever
  detects file
  types always considers UTF-16 files to be binary. If you add a
  UTF-16 file you see that svn adds the application/octet-stream mime
  type.  There is an issue in the bug database about this from when I
  reported/complained about it... however it hasn't been addressed.
  I'm surprised still at this time that svn still can't support
  UTF-16 text files as
  text wrt adding, diffing, blaming, etc.
  It's quite simple: no-one has written the necessary code. While I
  can understand it's an interesting feature for Windows users, most
  Subversion developers have other things to do. This being a
  volunteer project, and most of us do not use Windows, you can
  hardly expect anyone to spend several weeks on solving a problem
  that has a perfectly simple workaround. Since
  UFT-8 and UTF-16 can be interchanged without data loss, there are
  other, much more important things to do in Subversion.
  I appreciate all that you said. I didn't expect that UTF-16 was so
  uncommon
  in non-Windows OSes. A large number of dev tools that I work with on
  Windows, especially the Microsoft tools default to creating UTF-16 files.
  I disagree with your can be converted without data loss. If you
  need UTF-
  16 then you need it. Also, if you are working in an international
  team and you have developers with other language Oss which have
  different code pages then what you see when you look at a UTF-8 file
  might be different than what I see.
 
  I don't follow. Both UTF-16 and UTF-8 are complete representations of
  the Unicode character set. Exactly the same code sequences can be
  represented in both encodings. You can convert from UTF-16 to UTF-8
  and back and get exactly the same sequence of bytes.
 
  Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode
 format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday
 senior moment). What I recall being told by one of the subversion
 developers was that subversion only supported the ASCII character set and
 while UTF-8 was compatible with ASCII it didn't truly support Unicode files.
 
  However, this blog entry seems to dispute that:
 
  http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/
 
  Would adding that mime-type to this file fix the blame issues this user is
 seeing?
 
 I think the user is just very lucky. Subversion does not actually try to 
 interpret
 the svn:mime-type property, other than to determine whether to treat a file
 as text or binary. (The comment is correct in that the proper parameter is
 charset=, not encoding=, but that's not important for this discussion).
 
 Subversion's merge algorithm depends on being able to detect line endings
 in the file, and always scans the file as a sequence of bytes.
 There are several ways to represent line endings in a UTF-16 file (shown here
 as hex byte sequences):
 
   * 00 0A (Unix newline, UTF16-BE)
   * 00 0D 00 0A (Windows newline, UTF16-BE)
   * 0A 00 (Unix newline, UTF16-LE)
   * 0D 00 0A 00 (Windows newline, UTF16-LE)
   * 24 24 (Unicode newline, same in LE and BE)
 
 Subversion, however, expects one of the following newline sequences:
 
   * 0A (Unix newline)
   * 0D 0A (Windows newline)
 
 My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII
 newline character, are interpreted as the end-of-line markers, and the zero
 bytes are treated as part of the text. In most cases, the result will be 
 close to
 correct, as long as there are no conflicts in the merge -- because Subversion
 will not emit conflict markers in UTF-16.
 
 Of course, if someone used the U+2424 newline code point instead, then in
 the worst case, the whole file would be interpreted as a single line.
 
 -- Brane

Great information.. thanks for that.

Bottom line is use UTF-8 for your text files and svn will be happy and work 
correctly. How hard would it be to create 

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 18:12, Bob Archer wrote:
 On 11.10.2013 17:19, Bob Archer wrote:
 On 11.10.2013 16:55, Bob Archer wrote:
 On 11.10.2013 15:58, Bob Archer wrote:
 On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer
 bob.arc...@amsi.com
 wrote:
 I assume he was asking how to fix the blame. Cause, sure, he
 could open the file, convert it back to UTF-8 with CRLF line
 endings... and commit it... of course, now blame is going to show
 him on every line, since he just changed every line.

 That's exactly what I meant.  You're correct with how the blame
 is handled.  I committed the UTF-8 copy to a test branch, diff'd,
 and it showed every line as being changed.  Unfortunately it
 looks like this is our
 best option.
 Yep, we have done the same thing. As a matter of fact, I just over
 the past
 few days rescripted all our database scripts to be UTF-8 since
 merging them just doesn't work correctly when they are UTF-16 even
 if you remove the binary mime type.
 On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser b...@reser.org wrote:
 At current blame is not UTF-16 aware.
 It's not just blame that isn't... the diff engine, or whatever
 detects file
 types always considers UTF-16 files to be binary. If you add a
 UTF-16 file you see that svn adds the application/octet-stream mime
 type.  There is an issue in the bug database about this from when I
 reported/complained about it... however it hasn't been addressed.
 I'm surprised still at this time that svn still can't support
 UTF-16 text files as
 text wrt adding, diffing, blaming, etc.
 It's quite simple: no-one has written the necessary code. While I
 can understand it's an interesting feature for Windows users, most
 Subversion developers have other things to do. This being a
 volunteer project, and most of us do not use Windows, you can
 hardly expect anyone to spend several weeks on solving a problem
 that has a perfectly simple workaround. Since
 UFT-8 and UTF-16 can be interchanged without data loss, there are
 other, much more important things to do in Subversion.
 I appreciate all that you said. I didn't expect that UTF-16 was so
 uncommon
 in non-Windows OSes. A large number of dev tools that I work with on
 Windows, especially the Microsoft tools default to creating UTF-16 files.
 I disagree with your can be converted without data loss. If you
 need UTF-
 16 then you need it. Also, if you are working in an international
 team and you have developers with other language Oss which have
 different code pages then what you see when you look at a UTF-8 file
 might be different than what I see.

 I don't follow. Both UTF-16 and UTF-8 are complete representations of
 the Unicode character set. Exactly the same code sequences can be
 represented in both encodings. You can convert from UTF-16 to UTF-8
 and back and get exactly the same sequence of bytes.

 Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode
 format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday
 senior moment). What I recall being told by one of the subversion
 developers was that subversion only supported the ASCII character set and
 while UTF-8 was compatible with ASCII it didn't truly support Unicode files.
 However, this blog entry seems to dispute that:

 http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/

 Would adding that mime-type to this file fix the blame issues this user is
 seeing?

 I think the user is just very lucky. Subversion does not actually try to 
 interpret
 the svn:mime-type property, other than to determine whether to treat a file
 as text or binary. (The comment is correct in that the proper parameter is
 charset=, not encoding=, but that's not important for this discussion).

 Subversion's merge algorithm depends on being able to detect line endings
 in the file, and always scans the file as a sequence of bytes.
 There are several ways to represent line endings in a UTF-16 file (shown here
 as hex byte sequences):

   * 00 0A (Unix newline, UTF16-BE)
   * 00 0D 00 0A (Windows newline, UTF16-BE)
   * 0A 00 (Unix newline, UTF16-LE)
   * 0D 00 0A 00 (Windows newline, UTF16-LE)
   * 24 24 (Unicode newline, same in LE and BE)

 Subversion, however, expects one of the following newline sequences:

   * 0A (Unix newline)
   * 0D 0A (Windows newline)

 My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII
 newline character, are interpreted as the end-of-line markers, and the zero
 bytes are treated as part of the text. In most cases, the result will be 
 close to
 correct, as long as there are no conflicts in the merge -- because Subversion
 will not emit conflict markers in UTF-16.

 Of course, if someone used the U+2424 newline code point instead, then in
 the worst case, the whole file would be interpreted as a single line.

 -- Brane
 Great information.. thanks for that.

 Bottom line is use UTF-8 for your text files and svn will be happy and work 
 correctly. How hard would it be to create a warning on an add that a 

Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Andreas Krey
On Fri, 11 Oct 2013 17:43:30 +, Branko ??ibej wrote:
...
 Of course, if someone used the U+2424 newline code point instead, then
 in the worst case, the whole file would be interpreted as a single line.

And SVN would be right, as U+2424 is 'SYMBOL FOR NEWLINE', which is
actually a printable character, not a control charactor.

Andreas

-- 
Totally trivial. Famous last words.
From: Linus Torvalds torvalds@*.org
Date: Fri, 22 Jan 2010 07:29:21 -0800


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Ben Reser
On 10/11/13 9:22 AM, Branko Čibej wrote:
 You'd have to extend Subversion's file type detection to detect UTF-16.
 See svn_io_detect_mimetype2 in line  in this file:
 
 http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
 Subversion currently only looks at the first 1k Bytes of a file. It may
 be enough to check that this initial part of the file contains only
 valid UTF-16 (BE or LE) codes.

Even if all we looked for is the BOM it might be helpful enough.  I suspect the
development tools producing UTF-16 are including BOMs.  Windows seems to be
fond of including them, Notepad puts one even on UTF-8.


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Stefan Sperling
On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote:
 On 10/11/13 9:22 AM, Branko Čibej wrote:
  You'd have to extend Subversion's file type detection to detect UTF-16.
  See svn_io_detect_mimetype2 in line  in this file:
  
  http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
  Subversion currently only looks at the first 1k Bytes of a file. It may
  be enough to check that this initial part of the file contains only
  valid UTF-16 (BE or LE) codes.
 
 Even if all we looked for is the BOM it might be helpful enough.  I suspect 
 the
 development tools producing UTF-16 are including BOMs.  Windows seems to be
 fond of including them, Notepad puts one even on UTF-8.

Couldn't Subversion automatically convert UTF-16 files to UTF-8 before
processing them for diff/merge/blame, and convert output written to
the original files back to UTF-16?

That would require some work because existing streams, strings, and files
passed around in the code would need to be wrapped so that translation
to/from the internal from/to the external encoding is seamless.

But I don't see why such an approach couldn't be made to work in principle.
It might even result in some spring cleaning in the code base and pave the
way for improved handling of file formats such as XML for diff and merge.

What do you think? Is it worth adding this to our project ideas page?


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 18:52, Ben Reser wrote:
 On 10/11/13 9:22 AM, Branko Čibej wrote:
 You'd have to extend Subversion's file type detection to detect UTF-16.
 See svn_io_detect_mimetype2 in line  in this file:

 http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
 Subversion currently only looks at the first 1k Bytes of a file. It may
 be enough to check that this initial part of the file contains only
 valid UTF-16 (BE or LE) codes.
 Even if all we looked for is the BOM it might be helpful enough.  I suspect 
 the
 development tools producing UTF-16 are including BOMs.  Windows seems to be
 fond of including them, Notepad puts one even on UTF-8.

That would work only on Windows. On other platforms, you typically don't
get a BOM (actually, a zero-width non-breaking space) at the beginning
of a file. Granted, other platforms most likely use UTF-8 in any case.

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 19:25, Stefan Sperling wrote:
 On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote:
 On 10/11/13 9:22 AM, Branko Čibej wrote:
 You'd have to extend Subversion's file type detection to detect UTF-16.
 See svn_io_detect_mimetype2 in line  in this file:

 http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
 Subversion currently only looks at the first 1k Bytes of a file. It may
 be enough to check that this initial part of the file contains only
 valid UTF-16 (BE or LE) codes.
 Even if all we looked for is the BOM it might be helpful enough.  I suspect 
 the
 development tools producing UTF-16 are including BOMs.  Windows seems to be
 fond of including them, Notepad puts one even on UTF-8.
 Couldn't Subversion automatically convert UTF-16 files to UTF-8 before
 processing them for diff/merge/blame, and convert output written to
 the original files back to UTF-16?

That would be less work than supporting whitespace compression, etc. in
UTF-16, but we'd still have to detect U+2424 as an end-of-line marker in
UTF-8 text.

Still, we'd actually have to correctly identify UTF-16 content first,
and handle invalid byte sequences.

 That would require some work because existing streams, strings, and files
 passed around in the code would need to be wrapped so that translation
 to/from the internal from/to the external encoding is seamless.

 But I don't see why such an approach couldn't be made to work in principle.
 It might even result in some spring cleaning in the code base and pave the
 way for improved handling of file formats such as XML for diff and merge.

Can't see what XML has to do with it. The diff algorithm already uses a
tokenizer; and for XML, that should be good enough most of the time.

 What do you think? Is it worth adding this to our project ideas page?

It's already here: http://subversion.tigris.org/issues/show_bug.cgi?id=2194

-- Brane

-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Branko Čibej
On 11.10.2013 18:23, Andreas Krey wrote:
 On Fri, 11 Oct 2013 17:43:30 +, Branko ??ibej wrote:
 ...
 Of course, if someone used the U+2424 newline code point instead, then
 in the worst case, the whole file would be interpreted as a single line.
 And SVN would be right, as U+2424 is 'SYMBOL FOR NEWLINE', which is
 actually a printable character, not a control charactor.

Meh, you're right ... it's U+0085 (next line), U+2028 (line separator)
or U+2029 (paragraph separator). I don't know what came over me; sorry
for misleading everyone.

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. br...@wandisco.com


Re: SVN Blame Returns Corrupt Data

2013-10-11 Thread Ben Reser
On 10/11/13 10:25 AM, Stefan Sperling wrote:
 Couldn't Subversion automatically convert UTF-16 files to UTF-8 before
 processing them for diff/merge/blame, and convert output written to
 the original files back to UTF-16?

That's what the patch I pointed out did.  Nobody seemed to object to the idea
at the time it was posted, though I think Branko does bring up some interesting
questions about handling the unicode control characters.