Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-22 Thread Vincent Lefevre
On 2011-06-22 19:34:08 +0200, Stefan Sperling wrote:
> On Wed, Jun 22, 2011 at 07:09:22PM +0200, Andreas Krey wrote:
> > In my opinion it would be saner nowadays to assume file names to
> > be in utf8 and warn if they are not, and use the setting in LANG
> > for console I/O only.
> 
> This strategy may work well for applications starting out today.
> but it won't work for Subversion.
> 
> Not all operating systems have switched to UTF-8 as the default
> character set yet. ASCII is still the only encoding that works
> everywhere out of the box (especially on the console!).
> E.g. Debian switched to UTF-8 by default for the Etch release in 2008.
> http://www.debian.org/releases/etch/i386/release-notes/ch-whats-new.en.html
> Many unixy systems that aren't Linux have not switched to UTF-8 by
> default, and it is possible that some never will.

Debian still supports non-UTF-8 locales. That's useful when one
connects from a non-UTF-8 terminal with SSH. And that's precisely
why the user may need to use different locales on some machine
(just for consistent terminal I/O).

> Subversion is supposed to be portable across all these platforms and
> more.

Tracking the filename encoding or letting the user choose the filename
encoding wouldn't be against portability.

Also portable scripts need to use LC_ALL=C. And again, this breaks
svn as soon as a filename has non-ASCII characters (even though such
a filename doesn't appear anywhere in the svn arguments).

> I agree that locales aren't the ideal solution to this problem.
> But at least they are standardized by POSIX and can be expected
> to behave the same way everywhere.

Not everything is standardized (e.g. locale names and what they provide
are system specific). And under POSIX, each process can have its own
locale (and change it).

> And they allow Subversion users to say "yes, my system supports
> UTF-8, please use it".

But what if the system supports UTF-8, but the terminal doesn't?

> The best solution would be a standardised way of specifying
> filename encoding that works the same on all filesytem types in
> all operating systems. Alas, that doesn't exist :(

When there's no standard, let the user choose (with a good compromise
for the default behavior).

> I don't think the current solution is perfect. But it's a good
> compromise given the circumstances.

It really isn't. Tracking the filename encoding is a must.

-- 
Vincent Lefèvre  - Web: 
100% accessible validated (X)HTML - Blog: 
Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon)


Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-22 Thread Vincent Lefevre
On 2011-06-22 16:28:31 +0200, Stefan Sperling wrote:
> On Wed, Jun 22, 2011 at 03:42:42PM +0200, Vincent Lefevre wrote:
> > On 2011-06-15 12:29:37 +0200, Stefan Sperling wrote:
> > > Unicode, and it's quirk of allowing the *same* character to be encoded
> > > in *different* ways, came much later.
> > > 
> > > I think it is unfortunate that Apple broke with the concept that a
> > > filename is just a string of bytes.
> > 
> > It's also unfortunate that Subversion breaks this concept too. :)
> > 
> > I mean: do a checkout of a repository containing non-ASCII characters
> > under Linux. Then change the locales (e.g. ISO-8859-1 -> UTF-8). Do
> > an update. And see the errors...
> 
> I don't agree that this is the same problem. It's a different problem.

I'm not saying that's the same problem, but that Subversion doesn't
regard a filename as a string of bytes.

> Subversion is internally converting path names from the native encoding

If you regard a filename as a string of bytes, there isn't a concept
of native encoding.

> into UTF-8 and sends them to the repository because they are UTF-8-encoded
> there. This way, all encodings used on client systems can be represented
> in the repository. It works fine with client systems that do not support
> UTF-8 natively at all, as long as they use some encoding that iconv
> understands. And this is all happening *within* the application.
> The rules that svn uses to create filenames are clear and consistent.

There aren't consistent, because svn doesn't track the encoding used
to create the filenames. GNOME rules are consistent: the encoding is
always UTF-8.

> They require users not to flip locales willy-nilly, but that's the
> tradeoff with relying on the locale. Locales only support one encoding
> at a time.

Yes, but different processes can use different locales, and this breaks
svn. There's a good reason why locales are set via environment variables
(on POSIX systems) and not globally.

> What apple does is transform the byte sequence behind the
> application's back.

This is not behind application's back, because this is documented in
the API. The application writer should follow the API.

What's more important is that both Mac OS X and svn (e.g. under Linux)
can transform the byte sequence in the *user's* back. For Mac OS X,
this is related to the normalization form, and for svn, this is related
to the locales.

> So the application itself cannot rely on its *own* rules it was using to
> create filenames when it runs again and reads the names back from disk
> because the OS is interfering with these rules. 

I think it's great to have standards, system-wide conventions and things
like that to avoid applications choosing their own rules. So, I wouldn't
blame Mac OS X for that.

Because there are drawbacks to any choice regarding the filenames,
it would be better to make things configurable at the user level,
but hardcoding choices in applications is bad, IMHO.

-- 
Vincent Lefèvre  - Web: 
100% accessible validated (X)HTML - Blog: 
Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon)


Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-22 Thread Stefan Sperling
On Wed, Jun 22, 2011 at 07:09:22PM +0200, Andreas Krey wrote:
> In my opinion it would be saner nowadays to assume file names to
> be in utf8 and warn if they are not, and use the setting in LANG
> for console I/O only.

This strategy may work well for applications starting out today.
but it won't work for Subversion.

Not all operating systems have switched to UTF-8 as the default
character set yet. ASCII is still the only encoding that works
everywhere out of the box (especially on the console!).
E.g. Debian switched to UTF-8 by default for the Etch release in 2008.
http://www.debian.org/releases/etch/i386/release-notes/ch-whats-new.en.html
Many unixy systems that aren't Linux have not switched to UTF-8 by
default, and it is possible that some never will.
Subversion is supposed to be portable across all these platforms and
more.

Subversion was founded in 2000 and the first 1.0 release was in 2004,
which all subsequent 1.x releases need to stay compatible with (so
we cannot just change the default behaviour now).
UTF-8 is older than 2003, but received its most recent update in
RFC 3629 which is from 2003. It was fairly new stuff when Subversion
was originally developed, far from being the default on many systems.
Back in 2003 you couldn't just write UTF-8 to the screen or create
UTF-8 filenames and expect this to work well by default on all platforms.

I agree that locales aren't the ideal solution to this problem.
But at least they are standardized by POSIX and can be expected
to behave the same way everywhere. And they allow Subversion users
to say "yes, my system supports UTF-8, please use it".

The best solution would be a standardised way of specifying
filename encoding that works the same on all filesytem types in
all operating systems. Alas, that doesn't exist :(

I don't think the current solution is perfect. But it's a good
compromise given the circumstances.


Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-22 Thread Andreas Krey
On Wed, 22 Jun 2011 16:28:31 +, Stefan Sperling wrote:
...
> Subversion is internally converting path names from the native encoding

Except that LANG isn't *the* native encoding. It is at least debatable
whether it should be used to interpret file system name strings.
(And it's rather hacked, since it is 'LANGuage' and not encoding, even
though there are locales like en_us.utf8.)

In my opinion it would be saner nowadays to assume file names to
be in utf8 and warn if they are not, and use the setting in LANG
for console I/O only.

Andreas

-- 
"Totally trivial. Famous last words."
From: Linus Torvalds 
Date: Fri, 22 Jan 2010 07:29:21 -0800


Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-22 Thread Stefan Sperling
On Wed, Jun 22, 2011 at 03:42:42PM +0200, Vincent Lefevre wrote:
> On 2011-06-15 12:29:37 +0200, Stefan Sperling wrote:
> > Unicode, and it's quirk of allowing the *same* character to be encoded
> > in *different* ways, came much later.
> > 
> > I think it is unfortunate that Apple broke with the concept that a
> > filename is just a string of bytes.
> 
> It's also unfortunate that Subversion breaks this concept too. :)
> 
> I mean: do a checkout of a repository containing non-ASCII characters
> under Linux. Then change the locales (e.g. ISO-8859-1 -> UTF-8). Do
> an update. And see the errors...

I don't agree that this is the same problem. It's a different problem.

Subversion is internally converting path names from the native encoding
into UTF-8 and sends them to the repository because they are UTF-8-encoded
there. This way, all encodings used on client systems can be represented
in the repository. It works fine with client systems that do not support
UTF-8 natively at all, as long as they use some encoding that iconv
understands. And this is all happening *within* the application.
The rules that svn uses to create filenames are clear and consistent.
They require users not to flip locales willy-nilly, but that's the
tradeoff with relying on the locale. Locales only support one encoding
at a time.

What apple does is transform the byte sequence behind the application's back.
So the application itself cannot rely on its *own* rules it was using to
create filenames when it runs again and reads the names back from disk
because the OS is interfering with these rules. 

> > When they made this decision they probably considered that it might break
> > applications and decided that the applications would have to adjust.
> 
> One problem is that different applications encode accented characters
> (typed on the keyboard) differently: some of them use NFC, others use
> NFD. If they aren't regarded as equivalent, you get problems. And
> since Unicode doesn't standardize which one to use, one cannot blame
> the applications.

Yes, I fully agree here.


Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-22 Thread Vincent Lefevre
On 2011-06-15 12:29:37 +0200, Stefan Sperling wrote:
> Unicode, and it's quirk of allowing the *same* character to be encoded
> in *different* ways, came much later.
> 
> I think it is unfortunate that Apple broke with the concept that a
> filename is just a string of bytes.

It's also unfortunate that Subversion breaks this concept too. :)

I mean: do a checkout of a repository containing non-ASCII characters
under Linux. Then change the locales (e.g. ISO-8859-1 -> UTF-8). Do
an update. And see the errors...

> When they made this decision they probably considered that it might break
> applications and decided that the applications would have to adjust.

One problem is that different applications encode accented characters
(typed on the keyboard) differently: some of them use NFC, others use
NFD. If they aren't regarded as equivalent, you get problems. And
since Unicode doesn't standardize which one to use, one cannot blame
the applications.

-- 
Vincent Lefèvre  - Web: 
100% accessible validated (X)HTML - Blog: 
Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon)


Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-16 Thread Geoff Hoffman
On Thu, Jun 16, 2011 at 11:07 AM, B Smith-Mannschott
wrote:

> On Thu, Jun 16, 2011 at 18:24, Geoff Hoffman 
> wrote:
> >
> >
> > On Wed, Jun 15, 2011 at 11:19 PM, Markus Schaber <
> m.scha...@3s-software.com>
> > wrote:
> >>
> >> Hi, Geoff,
> >>
> >> Von: Geoff Hoffman [mailto:ghoff...@cardinalpath.com]
> >> >>> I have a file with some (I believe) Portuguese characters in the
> >> >>> filename that someone managed to store in the repo without any
> >> >>> problem,
> >> >>> and I checked it out without issues, too. However, now on my working
> >> >>> copy, it thinks that file is locally new.
> >> >> Maybe it helps if you use a repo browser to rename the file to an
> >> >> ASCII-Only name directly in the repository?
> >>
> >> > That's all I ever really wanted to do, but I cannot, at least, I don't
> >> > know how to type the characters in the
> >> > filename of the file in svn without copy-paste from the svn ls
> terminal
> >> > output on Mac OS X, which I think has
> >> > already converted the filename it just printed, so I get a file not
> >> > found error when I try to rename or delete
> >> > it. It may have worked if I had ssh'd into the RHEL server, not sure.
> >> > It's a bit unclear.
> >>
> >> I thought of some graphical repository browser (like the one built into
> >> TortoiseSVN for example, I guess such things also exist for MacOS), it
> lets
> >> you browse the repository and select the file to rename directly in the
> >> repository, without the need of a local checkout / working copy.
> >>
> >
> >
> > Yeah, if I had more time I probably should fiddle with it. Our one guy
> here
> > on Windows using Tortoise has no issues with the same file, so it is
> indeed
> > a problem specific to Mac, as Stefan pointed out. Given that the issue
> > presents itself in Terminal and NetBeans IDE, it's safe to say any other
> > graphical SVN client on Mac would complain, too, but I didn't test it.
> IIRC
> > the graphical clients are using the command line under the hood.
>
>
>
> On Thu, Jun 16, 2011 at 18:24, Geoff Hoffman 
> wrote:
> >
> >
> > On Wed, Jun 15, 2011 at 11:19 PM, Markus Schaber <
> m.scha...@3s-software.com>
> > wrote:
> >>
> >> Hi, Geoff,
> >>
> >> Von: Geoff Hoffman [mailto:ghoff...@cardinalpath.com]
> >> >>> I have a file with some (I believe) Portuguese characters in the
> >> >>> filename that someone managed to store in the repo without any
> >> >>> problem,
> >> >>> and I checked it out without issues, too. However, now on my working
> >> >>> copy, it thinks that file is locally new.
> >> >> Maybe it helps if you use a repo browser to rename the file to an
> >> >> ASCII-Only name directly in the repository?
> >>
> >> > That's all I ever really wanted to do, but I cannot, at least, I don't
> >> > know how to type the characters in the
> >> > filename of the file in svn without copy-paste from the svn ls
> terminal
> >> > output on Mac OS X, which I think has
> >> > already converted the filename it just printed, so I get a file not
> >> > found error when I try to rename or delete
> >> > it. It may have worked if I had ssh'd into the RHEL server, not sure.
> >> > It's a bit unclear.
> >>
> >> I thought of some graphical repository browser (like the one built into
> >> TortoiseSVN for example, I guess such things also exist for MacOS), it
> lets
> >> you browse the repository and select the file to rename directly in the
> >> repository, without the need of a local checkout / working copy.
> >>
> >
> >
> > Yeah, if I had more time I probably should fiddle with it. Our one guy
> here
> > on Windows using Tortoise has no issues with the same file, so it is
> indeed
> > a problem specific to Mac, as Stefan pointed out. Given that the issue
> > presents itself in Terminal and NetBeans IDE, it's safe to say any other
> > graphical SVN client on Mac would complain, too, but I didn't test it.
> IIRC
> > the graphical clients are using the command line under the hood.
>
> Yes, any graphical client working on a *working copy* on the mac would
> complain too.  But, a hypothetical graphical repo browser that
> operates directly on the repository isn't effected by HFS+'s unicode
> normalization.
>
> // ben
>



Ben, you're right.

SVNx doesn't have a rename feature that I could find, but I tried in
Versions (demo - w00t) and it worked.

Transcript log for repository "client" [svn+ssh://geoff@mycompany
/svn/repos/sites/client/trunk].
Subversion libraries version: 1.6.17

[Jun 16, 12:05:48] Renaming
"/branches/other/external_docs/Manual_Integração_A_T_2.0.pdf" to
"Manual_Integracao_A_T_2.0.pdf"...
Committed revision 696 by user "geoff".
[Jun 16, 12:06:01] Finished operation.


Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-16 Thread B Smith-Mannschott
On Thu, Jun 16, 2011 at 18:24, Geoff Hoffman  wrote:
>
>
> On Wed, Jun 15, 2011 at 11:19 PM, Markus Schaber 
> wrote:
>>
>> Hi, Geoff,
>>
>> Von: Geoff Hoffman [mailto:ghoff...@cardinalpath.com]
>> >>> I have a file with some (I believe) Portuguese characters in the
>> >>> filename that someone managed to store in the repo without any
>> >>> problem,
>> >>> and I checked it out without issues, too. However, now on my working
>> >>> copy, it thinks that file is locally new.
>> >> Maybe it helps if you use a repo browser to rename the file to an
>> >> ASCII-Only name directly in the repository?
>>
>> > That's all I ever really wanted to do, but I cannot, at least, I don't
>> > know how to type the characters in the
>> > filename of the file in svn without copy-paste from the svn ls terminal
>> > output on Mac OS X, which I think has
>> > already converted the filename it just printed, so I get a file not
>> > found error when I try to rename or delete
>> > it. It may have worked if I had ssh'd into the RHEL server, not sure.
>> > It's a bit unclear.
>>
>> I thought of some graphical repository browser (like the one built into
>> TortoiseSVN for example, I guess such things also exist for MacOS), it lets
>> you browse the repository and select the file to rename directly in the
>> repository, without the need of a local checkout / working copy.
>>
>
>
> Yeah, if I had more time I probably should fiddle with it. Our one guy here
> on Windows using Tortoise has no issues with the same file, so it is indeed
> a problem specific to Mac, as Stefan pointed out. Given that the issue
> presents itself in Terminal and NetBeans IDE, it's safe to say any other
> graphical SVN client on Mac would complain, too, but I didn't test it. IIRC
> the graphical clients are using the command line under the hood.



On Thu, Jun 16, 2011 at 18:24, Geoff Hoffman  wrote:
>
>
> On Wed, Jun 15, 2011 at 11:19 PM, Markus Schaber 
> wrote:
>>
>> Hi, Geoff,
>>
>> Von: Geoff Hoffman [mailto:ghoff...@cardinalpath.com]
>> >>> I have a file with some (I believe) Portuguese characters in the
>> >>> filename that someone managed to store in the repo without any
>> >>> problem,
>> >>> and I checked it out without issues, too. However, now on my working
>> >>> copy, it thinks that file is locally new.
>> >> Maybe it helps if you use a repo browser to rename the file to an
>> >> ASCII-Only name directly in the repository?
>>
>> > That's all I ever really wanted to do, but I cannot, at least, I don't
>> > know how to type the characters in the
>> > filename of the file in svn without copy-paste from the svn ls terminal
>> > output on Mac OS X, which I think has
>> > already converted the filename it just printed, so I get a file not
>> > found error when I try to rename or delete
>> > it. It may have worked if I had ssh'd into the RHEL server, not sure.
>> > It's a bit unclear.
>>
>> I thought of some graphical repository browser (like the one built into
>> TortoiseSVN for example, I guess such things also exist for MacOS), it lets
>> you browse the repository and select the file to rename directly in the
>> repository, without the need of a local checkout / working copy.
>>
>
>
> Yeah, if I had more time I probably should fiddle with it. Our one guy here
> on Windows using Tortoise has no issues with the same file, so it is indeed
> a problem specific to Mac, as Stefan pointed out. Given that the issue
> presents itself in Terminal and NetBeans IDE, it's safe to say any other
> graphical SVN client on Mac would complain, too, but I didn't test it. IIRC
> the graphical clients are using the command line under the hood.

Yes, any graphical client working on a *working copy* on the mac would
complain too.  But, a hypothetical graphical repo browser that
operates directly on the repository isn't effected by HFS+'s unicode
normalization.

// ben


Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-16 Thread Geoff Hoffman
On Wed, Jun 15, 2011 at 11:19 PM, Markus Schaber
wrote:

> Hi, Geoff,
>
> Von: Geoff Hoffman [mailto:ghoff...@cardinalpath.com]
> >>> I have a file with some (I believe) Portuguese characters in the
> >>> filename that someone managed to store in the repo without any problem,
> >>> and I checked it out without issues, too. However, now on my working
> >>> copy, it thinks that file is locally new.
> >> Maybe it helps if you use a repo browser to rename the file to an
> >> ASCII-Only name directly in the repository?
>
> > That's all I ever really wanted to do, but I cannot, at least, I don't
> know how to type the characters in the
> > filename of the file in svn without copy-paste from the svn ls terminal
> output on Mac OS X, which I think has
> > already converted the filename it just printed, so I get a file not found
> error when I try to rename or delete
> > it. It may have worked if I had ssh'd into the RHEL server, not sure.
> It's a bit unclear.
>
> I thought of some graphical repository browser (like the one built into
> TortoiseSVN for example, I guess such things also exist for MacOS), it lets
> you browse the repository and select the file to rename directly in the
> repository, without the need of a local checkout / working copy.
>
>

Yeah, if I had more time I probably should fiddle with it. Our one guy here
on Windows using Tortoise has no issues with the same file, so it is indeed
a problem specific to Mac, as Stefan pointed out. Given that the issue
presents itself in Terminal and NetBeans IDE, it's safe to say any other
graphical SVN client on Mac would complain, too, but I didn't test it. IIRC
the graphical clients are using the command line under the hood.


AW: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-15 Thread Markus Schaber
Hi, Geoff,

Von: Geoff Hoffman [mailto:ghoff...@cardinalpath.com] 
>>> I have a file with some (I believe) Portuguese characters in the
>>> filename that someone managed to store in the repo without any problem,
>>> and I checked it out without issues, too. However, now on my working
>>> copy, it thinks that file is locally new.
>> Maybe it helps if you use a repo browser to rename the file to an
>> ASCII-Only name directly in the repository?

> That's all I ever really wanted to do, but I cannot, at least, I don't know 
> how to type the characters in the
> filename of the file in svn without copy-paste from the svn ls terminal 
> output on Mac OS X, which I think has
> already converted the filename it just printed, so I get a file not found 
> error when I try to rename or delete
> it. It may have worked if I had ssh'd into the RHEL server, not sure. It's a 
> bit unclear. 

I thought of some graphical repository browser (like the one built into 
TortoiseSVN for example, I guess such things also exist for MacOS), it lets you 
browse the repository and select the file to rename directly in the repository, 
without the need of a local checkout / working copy.

Best regards

Markus Schaber
-- 
___
We software Automation.

3S-Smart Software Solutions GmbH
Markus Schaber | Developer
Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax 
+49-831-54031-50

Email: m.scha...@3s-software.com | Web: http://www.3s-software.com 
CoDeSys internet forum: http://forum.3s-software.com
Download CoDeSys sample projects: 
http://www.3s-software.com/index.shtml?sample_projects

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade 
register: Kempten HRB 6186 | Tax ID No.: DE 167014915 



Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-15 Thread Geoff Hoffman
On Tue, Jun 14, 2011 at 11:36 PM, Markus Schaber
wrote:

> Hi, Geoff,
>
> Von: Geoff Hoffman [mailto:ghoff...@cardinalpath.com]
>
> > I have a file with some (I believe) Portuguese characters in the
> filename that someone managed to store in the repo without any problem,
> and I checked it out without issues, too. However, now on my working
> copy, it thinks that file is locally new.
>
> Maybe it helps if you use a repo browser to rename the file to an
> ASCII-Only name directly in the repository?
>
> Regards,
> Markus Schaber
>


That's all I ever really wanted to do, but I cannot, at least, I don't know
how to type the characters in the filename of the file in svn without
copy-paste from the svn ls terminal output on Mac OS X, which I think has
already converted the filename it just printed, so I get a file not found
error when I try to rename or delete it. It may have worked if I had ssh'd
into the RHEL server, not sure. It's a bit unclear.

I was able to simply export the files, rename the files to ascii filenames
outside of SVN, trash the containing folder from the rep, recreate it, check
it out empty, add the ascii named files, and re-commit.


Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-15 Thread Stefan Sperling
On Wed, Jun 15, 2011 at 01:39:30AM -0500, Ryan Schmidt wrote:
> I would clarify this by saying the problem is that Subversion assumes
> that a filename submitted in one version of UTF-8 encoding will always
> stay in that version of UTF-8 encoding, and on the HFS+ filesystem,
> used by Mac OS X, that assumption is not necessarily true. (It
> normalizes all UTF-8 filenames to decomposed form.) Subversion would
> happily allow you to create two filenames that humans would consider
> identical (one with UTF-8 entities composed, one with UTF-8 entities
> decomposed). So clearly that's a bug in Subversion (or possibly apr or
> apr-util); it should normalize UTF-8 strings before running
> comparisons. It also seems like a bug in Windows and Linux
> filesystems; I assume they also let you create multiple files whose
> names look identical (but differ only in the composition of their
> UTF-8 characters). Mac OS X's is the only filesystem I know of that
> has fixed this bug -- which therefore exposes the problem when
> collaborating between Mac OS X systems (which have the fix) and other
> systems (which do not).

Traditionally there was no encoding information associated with filenames
on UNIX systems. The OS was supposed to store the filename under whatever
name the application passes in. This of course stems from the fact that
the only encoding on original UNIX was ASCII, so there was no problem with
this approach back then.

Unicode, and it's quirk of allowing the *same* character to be encoded
in *different* ways, came much later.

I think it is unfortunate that Apple broke with the concept that a
filename is just a string of bytes.
When they made this decision they probably considered that it might break
applications and decided that the applications would have to adjust.
But that is very, very hard for applications like Subversion which
need to guarantee backwards compatibility to a point where individual
bytes matter.

So what if two filenames looks identical to the user?
As long as nobody was changing the underlying byte string things were
working just work fine.

However, I also agree that we would be in a much better spot now if
Subversion had been normalising UTF-8 strings from the start.
This was an oversight made when the project started out.
But I doubt Subversion is the only project that missed these subtle
details of the unicode standard. From a software engineer's perspective,
it is a *very* unnatural for an encoding standard to contain ambiguous
representations of the same data. So I would not outright blame folks
for this oversight and call it a "bug" in the application or the OS.
There are many ways to point fingers here, including the standard committee.


Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-14 Thread Ryan Schmidt

On Jun 14, 2011, at 18:59, Stefan Sperling wrote:

> On Tue, Jun 14, 2011 at 04:24:46PM -0700, Geoff Hoffman wrote:
>> I have a file with some (I believe) Portuguese characters in the filename
>> that someone managed to store in the repo without any problem, and I checked
>> it out without issues, too. However, now on my working copy, it thinks that
>> file is locally new.
> 
>> MacbookPro:ClearSale geoffh$ ls -la
>  ^^^
> 
> It's a Mac, so please see this issue:
> http://subversion.tigris.org/issues/show_bug.cgi?id=2464
> and make sure to read the notes in this file:
> http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
> 
> Short summary:
> Do not use anything but ASCII in your filenames if you need things
> to work between Macs and other systems. The problem is that the Mac
> changes the filename in a subtle way.


I would clarify this by saying the problem is that Subversion assumes that a 
filename submitted in one version of UTF-8 encoding will always stay in that 
version of UTF-8 encoding, and on the HFS+ filesystem, used by Mac OS X, that 
assumption is not necessarily true. (It normalizes all UTF-8 filenames to 
decomposed form.) Subversion would happily allow you to create two filenames 
that humans would consider identical (one with UTF-8 entities composed, one 
with UTF-8 entities decomposed). So clearly that's a bug in Subversion (or 
possibly apr or apr-util); it should normalize UTF-8 strings before running 
comparisons. It also seems like a bug in Windows and Linux filesystems; I 
assume they also let you create multiple files whose names look identical (but 
differ only in the composition of their UTF-8 characters). Mac OS X's is the 
only filesystem I know of that has fixed this bug -- which therefore exposes 
the problem when collaborating between Mac OS X systems (which have the fix) 
and other systems (which do not).


Using only ASCII characters in your filenames is one way to combat the problem. 
This strategy works fine for me, but users not using primarily English might 
find that harder. If you want to continue using UTF-8 characters in filenames, 
you can get a version of Subversion for Mac OS X that attempts to work around 
this problem, by installing MacPorts and then running:

sudo port install subversion +unicode_path

The patch the +unicode_path variant applies is of course not officially 
supported.





AW: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-14 Thread Markus Schaber
Hi, Geoff,

Von: Geoff Hoffman [mailto:ghoff...@cardinalpath.com]

> I have a file with some (I believe) Portuguese characters in the
filename that someone managed to store in the repo without any problem,
and I checked it out without issues, too. However, now on my working
copy, it thinks that file is locally new.

Maybe it helps if you use a repo browser to rename the file to an
ASCII-Only name directly in the repository?

Regards,
Markus Schaber


Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-14 Thread Daniel Shahaf
Stefan Sperling wrote on Wed, Jun 15, 2011 at 01:59:18 +0200:
> On Tue, Jun 14, 2011 at 04:24:46PM -0700, Geoff Hoffman wrote:
> > I have a file with some (I believe) Portuguese characters in the filename
> > that someone managed to store in the repo without any problem, and I checked
> > it out without issues, too. However, now on my working copy, it thinks that
> > file is locally new.
> 
> > MacbookPro:ClearSale geoffh$ ls -la
>   ^^^
> 
> It's a Mac, so please see this issue:
> http://subversion.tigris.org/issues/show_bug.cgi?id=2464
> and make sure to read the notes in this file:
> http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
> 
> Short summary:
> Do not use anything but ASCII in your filenames if you need things
> to work between Macs and other systems. The problem is that the Mac
> changes the filename in a subtle way.

IIRC things work (with the then-current state of other OS's) if the file
is added on a mac.


Re: Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-14 Thread Stefan Sperling
On Tue, Jun 14, 2011 at 04:24:46PM -0700, Geoff Hoffman wrote:
> I have a file with some (I believe) Portuguese characters in the filename
> that someone managed to store in the repo without any problem, and I checked
> it out without issues, too. However, now on my working copy, it thinks that
> file is locally new.

> MacbookPro:ClearSale geoffh$ ls -la
  ^^^

It's a Mac, so please see this issue:
http://subversion.tigris.org/issues/show_bug.cgi?id=2464
and make sure to read the notes in this file:
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames

Short summary:
Do not use anything but ASCII in your filenames if you need things
to work between Macs and other systems. The problem is that the Mac
changes the filename in a subtle way.


Evil UTF-8 Character in filename in repo causing issues on my wc

2011-06-14 Thread Geoff Hoffman
I have a file with some (I believe) Portuguese characters in the filename
that someone managed to store in the repo without any problem, and I checked
it out without issues, too. However, now on my working copy, it thinks that
file is locally new.

I did an svn copy ok, but I can't seem to delete the evil one.

Thinking I can probably whack the directory completely and rebuild it, but
thought I'd mention it because I'm not sure if its a bug or a misconfigured
SVN server / client.


Netbeans is barfing on another one in a different directory:

Can't convert string from native encoding to 'UTF-8':
Brazil Air Plus XML Vers?\139o 2.0 .pdf

Even Terminal is having trouble:

MacbookPro:ClearSale geoffh$ ls -la
total 2160
drwxr-xr-x  8 geoffh  staff 272 Jun 14 16:09 .
drwxr-xr-x  6 geoffh  staff 204 Jun 13 21:26 ..
drwxr-xr-x  7 geoffh  staff 238 Jun 14 16:09 .svn
-rw-r--r--  1 geoffh  staff  705463 Jun 13 21:26 Integration_Manual_2.3.pdf
-rw-r--r--  1 geoffh  staff  127377 Jun 14 16:09
Manual_Integração_A_T-ClearSale_2.0.pdf
MacbookPro:ClearSale geoffh$ svn delete --force
Manual_Integração_A_T-ClearSale_2.0.pdf
MacbookPro:ClearSale geoffh$ svn status
!   Manual_Integração_A_T-ClearSale_2.0.pdf
MacbookPro:ClearSale geoffh$ ls -la
total 1904
drwxr-xr-x  7 geoffh  staff 238 Jun 14 16:11 .
drwxr-xr-x  6 geoffh  staff 204 Jun 13 21:26 ..
drwxr-xr-x  7 geoffh  staff 238 Jun 14 16:11 .svn
-rw-r--r--  1 geoffh  staff  705463 Jun 13 21:26 Integration_Manual_2.3.pdf
-rw-r--r--  1 geoffh  staff  127377 Jun 14 16:04
Manual_Integracao_A_T-ClearSale_2.0.pdf
MacbookPro:ClearSale geoffh$ svn status
!   Manual_Integração_A_T-ClearSale_2.0.pdf
MacbookPro:ClearSale geoffh$ svn commit . -m "Fixing evil PDF"
MacbookPro:ClearSale geoffh$ svn status
!   Manual_Integração_A_T-ClearSale_2.0.pdf
MacbookPro:ClearSale geoffh$


Now in NetBeans I get:

/Users/geoffh/Sites/zupper.ghoffman/external_docs/ClearSale/Manual_Integrac?a?o_A_T-ClearSale_2.0.pdf:
 (Not a versioned resource)

A problem occurred; see other errors for details

If I svn up it restores the evil file:

MacbookPro:ClearSale geoffh$ svn up
Restored 'Manual_Integração_A_T-ClearSale_2.0.pdf'
At revision 681.


Basically I can't delete or svn delete the offending file and successfully
commit.

Running Subversion 1.6.11 on RHEL 5.6