Re: Proposals for Arabic honorifics

2015-10-06 Thread Naz Gassiep
If there are no comments on this specific issue, could someone care to 
comment on the idea of writing a proposal that extends and existing 
proposal? Is this considered bad form, or is it OK so long as it doesn't 
unnecessarily raise conflicting proposals?

- Naz.


On 5/10/2015 6:39 PM, Naz Gassiep wrote:

Hi all,
We are considering writing a proposal for Arabic honorifics which are 
missing from Unicode. There are already a few in there, notably U+FDFA 
and U+FDFB.


There are two existing proposals, L2/14-147 and L2/14-152, which each 
propose additions. L2/14-147 proposes seventeen new characters and 
L2/14-152 proposes a further two.


There are a few other characters that are not included in these 
proposals, and I was considering preparing a proposal of my own. I 
will work with a team of people who are willing to contribute time to 
this work. We are considering two options:


1. Prepare an additional proposal for the characters that were missing 
from the existing spec and also from the two proposals mentioned above.
2. Prepare a collating proposal which rolls the two proposals as well 
as the others that we feel are missing into a single proposal.


Currently, we favour the second option. We would ensure that full 
descriptions, names, character properties, and detailed examples are 
provided for each character to substantiate its use in modern plain 
text. We would also suggest code points in line with the existing 
proposal L2/14-147.


We don't want to step on the toes of the original submitters, Roozbeh 
Pournader or Lateef Sagar Shaikh. We wish to be clear that we will 
draw on their existing proposals to the maximum extent possible to 
ensure that we do not submit a conflicting proposal, but a superset 
proposal that incorporates their proposals as well as the additional 
characters we have identified. We have evaluated these two, and a true 
superset proposal is possible such that no conflicts between either 
those two proposals or our own will materialize.


Are there any issues that we may face in preparing and submitting our 
proposal?

Any guidance from this mailing list would be highly valued.
Many thanks,
- Naz.




Re: Proposals for Arabic honorifics

2015-10-06 Thread Lisa Moore
Hello Naz,

Thank you for discussing your proposal on the unicode list.  Not all 
experts monitor that list.  That said, feel free to submit a proposal to 
"docsub...@unicode.org". 

Look forward to seeing your proposal.


Lisa 




From:   Naz Gassiep 
To: unicode@unicode.org
Date:   10/06/2015 08:50 PM
Subject:Re: Proposals for Arabic honorifics
Sent by:"Unicode" 



If there are no comments on this specific issue, could someone care to 
comment on the idea of writing a proposal that extends and existing 
proposal? Is this considered bad form, or is it OK so long as it doesn't 
unnecessarily raise conflicting proposals?
- Naz.


On 5/10/2015 6:39 PM, Naz Gassiep wrote:
> Hi all,
> We are considering writing a proposal for Arabic honorifics which are 
> missing from Unicode. There are already a few in there, notably U+FDFA 
> and U+FDFB.
>
> There are two existing proposals, L2/14-147 and L2/14-152, which each 
> propose additions. L2/14-147 proposes seventeen new characters and 
> L2/14-152 proposes a further two.
>
> There are a few other characters that are not included in these 
> proposals, and I was considering preparing a proposal of my own. I 
> will work with a team of people who are willing to contribute time to 
> this work. We are considering two options:
>
> 1. Prepare an additional proposal for the characters that were missing 
> from the existing spec and also from the two proposals mentioned above.
> 2. Prepare a collating proposal which rolls the two proposals as well 
> as the others that we feel are missing into a single proposal.
>
> Currently, we favour the second option. We would ensure that full 
> descriptions, names, character properties, and detailed examples are 
> provided for each character to substantiate its use in modern plain 
> text. We would also suggest code points in line with the existing 
> proposal L2/14-147.
>
> We don't want to step on the toes of the original submitters, Roozbeh 
> Pournader or Lateef Sagar Shaikh. We wish to be clear that we will 
> draw on their existing proposals to the maximum extent possible to 
> ensure that we do not submit a conflicting proposal, but a superset 
> proposal that incorporates their proposals as well as the additional 
> characters we have identified. We have evaluated these two, and a true 
> superset proposal is possible such that no conflicts between either 
> those two proposals or our own will materialize.
>
> Are there any issues that we may face in preparing and submitting our 
> proposal?
> Any guidance from this mailing list would be highly valued.
> Many thanks,
> - Naz.






Re: Unicode in passwords

2015-10-06 Thread Richard Wordingham
On Tue, 6 Oct 2015 11:21:42 +0200
Mark Davis ☕️  wrote:

> While I think that RFC is useful, it has been interesting just how
> many of the problems recounted on this list go far beyond it, often
> having to do with UI issues. It would be useful to have a paper
> somewhere that organizes all of the problems presented here, and
> maybe makes a stab at describing techniques for handling them.

Indeed, there are several different scenarios.  The most prototypical
are:

1) Initial access to a stand-alone computing device, the conventional
logging on. In this case, it is usually risky to use anything but
printable ASCII.

2) Internet passwords for use in privacy.  Basically any non-trivial
combination of characters should be acceptable, provided it will not be
mangled in transmission.  Under the rules of Unicode, this means that
the text should be normalised before becoming a mere sequence of bytes.

Note that in the second scenario, there is normally an 'administrator'
who can put things right.

Richard.



Re: Why Nothing Ever Goes Away

2015-10-06 Thread Richard Wordingham
On Tue, 6 Oct 2015 15:57:37 +0200
Philippe Verdy  wrote:

> My opinion of UTF-7 is that
> it was just a temporary and experimental solution to help system
> admins and developers adopt the new UCS, including for their old
> 7-bit environments.

If you have a human controlling the interpretation, UTF-7 was a good
way of avoiding data being mangled by interfaces that insisted that
unlabelled (indeed, sometimes, unlabellable) 8-bit text was UTF-8 or
conversely, Latin-1 or code page 1252.  The old Yahoo groups web
interface for senders was pretty much restricted to 8-bit ISO-2022
encodings without it.  C1 characters would be converted to Latin-1 on
the assumption that they were Windows 1252.  Browsers dropping UTF-7
support was a major inconvenience.

Richard.


Re: Unicode in passwords

2015-10-06 Thread Philippe Verdy
2015-10-06 21:57 GMT+02:00 Richard Wordingham <
richard.wording...@ntlworld.com>:

> It's an interesting issue for a password that one can't type.  It's by
> no means a guarantee, either.  I once specified a new a password that
> changed case in the middle not realising that I had started with caps
> lock on.  Consequently, both copies has the wrong capitalisation.  I
> was using a wireless keyboard, which to conserve battery power doesn't
> have a caps lock indicator.  (In the old days, caps lock would have
> physically locked, but that's not how keyboard drivers work nowadays.)
> It took a little while before it occurred to me that I might have had a
> problem with caps lock.
>

This is a demonstration that using case differences to add more
combinations in short passwords is a bad design. As well hiding typed input
is not a good idea: we need at least a pressable button to look/confirm
what we are typing.

Instead of lettercase combinations limited to ASCII, it is highly
preferable to extend the character repertoire to Unicode and accept letters
in NFKC form and unified by case folding (NOT conversion to lowercase or
uppercase, as it is not stable across Unicode versions).

So we should define here the usable set of characters (and define
characters that should be ignored and discarded if present on input). This
should be a profile in UAX #31 (and we should issue a strong warning
against the recent RFC that forgot the issue : its case-insensitive profile
based on NFC and conversion to lowercase is definitely broken !)


Re: Unicode in passwords

2015-10-06 Thread Richard Wordingham
On Tue,  6 Oct 2015 20:13:12 +0100 (BST)
Julian Bradfield  wrote:

> On 2015-10-06, Asmus Freytag (t)  wrote:
> > All browsers I use display spaces in input boxes, and put blobs for
> > hidden fields. Do you have evidence for broken input fields?
> > 
> > 
> > Network keys. That interface seems to consistently give people a
> > choice to reveal the key.
> 
> ? That's not broken in the way Philippe was discussing.

No, but if you make the password up as you type it, you might not then
notice that one accidentally typed a double space.

> > Copy-paste works on all my systems, too - do you have evidence of
> > broken copy-paste in this way?
> > 
> > 
> > I've seen input fields where sites don't allow paste on the
> > second copy (the confirmation copy).
> > 
> > Even for non-password things.
> 
> That's not relevantly broken, either - it's a design feature, to make
> sure you can type the password again (from finger memory!).

It's an interesting issue for a password that one can't type.  It's by
no means a guarantee, either.  I once specified a new a password that
changed case in the middle not realising that I had started with caps
lock on.  Consequently, both copies has the wrong capitalisation.  I
was using a wireless keyboard, which to conserve battery power doesn't
have a caps lock indicator.  (In the old days, caps lock would have
physically locked, but that's not how keyboard drivers work nowadays.)
It took a little while before it occurred to me that I might have had a
problem with caps lock.

Richard.


Re: Unicode in passwords

2015-10-06 Thread Philippe Verdy
2015-10-06 16:31 GMT+02:00 Julian Bradfield :

> On 2015-10-06, Philippe Verdy  wrote:
> > I don't think it is a good idea for tectual passwords to make differences
> > based on the number of spaces. Being plain text they are likely to be
> > displayed in utser interfaces in a way that the user will not see.
> Without
>
> This is true of all passwords. Passwords have to be typed by finger
> memory, not by looking at them (unless you're the type who puts them
> on sticky notes, in which case you type by looking at the text on the
> note). One doesn't normally see the characters, at best a count of
> characters.
>
> > trimming, users won't see the initial or final space, and the password
> > input method may not display them as well (e.g. in an HTML input form or
>
> All browsers I use display spaces in input boxes, and put blobs for
> hidden fields. Do you have evidence for broken input fields?
>

I was speaking of OUTPUT fields : you want to display passwords that are
stored somewhere (including in a text document stored in some safe place
such as an external flash drive). People can't remember many passwords.
Hiding them on screen is a fake security, what we need is complex passwords
(difficult to memoize so we need a wallet to store them but people will
also **printing** them and not store them in a electronic format), and many
passwords (one for each site or application requiring one). But they also
want to be able to type them correctly: long passwords hidden on screen
will not help much (Hidden passwords in input forms is just to avoid some
spying eyes on your screen, but people can still pay on your keystrokes...)

If people are concerned by eyes, they'll need to hide their keyboard input
(notably on touch screens!) but also their screen by first making sure
there's nobody around to look at what you do. If there's a camera, hiding
the password on screen will also no help, it will also be easy to see your
keystrokes.

Biometric identification is also another fake security (because it is
immutable, when passwords can be and should be changed regularly) and it is
extremely easy to duplicate a biometric data record (to be more effective,
the physical captor device should be internally secured and its internal
data instantly flushed in case of intrusion, and this device should be
securely authenticated in addition to performing the biometric check, but
the biometric data should not be transmitted, instead it should be used to
compute a secure hash from the hidden biometric data and negociated and
checked unique randomized data from the source requesting the access, it
should use public key encryption with a couple of public/private key pairs,
not symetric keys, or triple key pairs if using another independant third
party: the private keys will never be exchanged or duplicated). But some
time you'll need to reset those keys and the only tool you'll have will be
to use cleartext pass phrases, even if there's a physical device
identification, encryption with key pairs and the extremely private
biometric data.

Unfortunately biometric data is now shared with governmental third parties,
and even exchanged internationally (they are present on passports and
biometric passports are now mandatory for any one taking a plane
to/from/via the United States and now in many European countries as well;
DNA tracks are also very easyto capture. Biometric data is no longer a
private property, they cannot be used as secrets for access authentication
or signatures). There's still nothing to replace pass phrases and those
need to be user friendly for their legitimate owners.


Re: Unicode in passwords

2015-10-06 Thread Stephane Bortzmeyer
On Tue, Oct 06, 2015 at 12:57:51PM +0900,
 Yoriyuki Yamagata  wrote 
 a message of 33 lines which said:

> FYI, IETF is working on this issue.  See Internet Draft
> https://tools.ietf.org/html/draft-ietf-precis-saslprepbis-17 based
> on PRECIS framework RFC 7564 https://tools.ietf.org/html/rfc7564

As alreday mentioned on that list, the draft is no longer a draft, it
was published as a RFC, RFC 7613, two months ago



Re: Unicode in passwords

2015-10-06 Thread Julian Bradfield
On 2015-10-06, Asmus Freytag (t)  wrote:
> All browsers I use display spaces in input boxes, and put blobs for
> hidden fields. Do you have evidence for broken input fields?
> 
> 
> Network keys. That interface seems to consistently give people a
> choice to reveal the key.

? That's not broken in the way Philippe was discussing.

> Copy-paste works on all my systems, too - do you have evidence of
> broken copy-paste in this way?
> 
> 
> I've seen input fields where sites don't allow paste on the second
> copy (the confirmation copy).
> 
> Even for non-password things.

That's not relevantly broken, either - it's a design feature, to make
sure you can type the password again (from finger memory!).

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Unicode in passwords

2015-10-06 Thread Mark Davis ☕️
While I think that RFC is useful, it has been interesting just how many of
the problems recounted on this list go far beyond it, often having to do
with UI issues. It would be useful to have a paper somewhere that organizes
all of the problems presented here, and maybe makes a stab at describing
techniques for handling them.


Mark 

*— Il meglio è l’inimico del bene —*

On Tue, Oct 6, 2015 at 10:48 AM, Stephane Bortzmeyer 
wrote:

> On Tue, Oct 06, 2015 at 12:57:51PM +0900,
>  Yoriyuki Yamagata  wrote
>  a message of 33 lines which said:
>
> > FYI, IETF is working on this issue.  See Internet Draft
> > https://tools.ietf.org/html/draft-ietf-precis-saslprepbis-17 based
> > on PRECIS framework RFC 7564 https://tools.ietf.org/html/rfc7564
>
> As alreday mentioned on that list, the draft is no longer a draft, it
> was published as a RFC, RFC 7613, two months ago
> 
>


Re: Unicode in passwords

2015-10-06 Thread Julian Bradfield
On 2015-10-06, Philippe Verdy  wrote:
> Finally note that passwords are not necessarily single identifiers
> (whitespaces and word separators are accepted, but whitespaces should
> require special handling with trimming (at both ends) and compression of
> multiple occurences.

Why would you trim or compress whitespace? Using multiple spaces seems a
perfectly legitimate way of making a password harder to guess.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Unicode in passwords

2015-10-06 Thread Philippe Verdy
I don't think it is a good idea for tectual passwords to make differences
based on the number of spaces. Being plain text they are likely to be
displayed in utser interfaces in a way that the user will not see. Without
trimming, users won't see the initial or final space, and the password
input method may not display them as well (e.g. in an HTML input form or
when using a button to generate passphrases that users must then copy-paste
to their password manager or to some private text document). Some password
storages also will implicitly trim and compress those strings (e.g. in a
fixed-width column of a table in a database). There's also frequently no
visual hint when entering or displaying those spaces and compression occurs
implicitly, or pass phrases may be line wrapped in the middle where you
won't see the number of spaces.

2015-10-06 12:25 GMT+02:00 Julian Bradfield :

> On 2015-10-06, Philippe Verdy  wrote:
> > Finally note that passwords are not necessarily single identifiers
> > (whitespaces and word separators are accepted, but whitespaces should
> > require special handling with trimming (at both ends) and compression of
> > multiple occurences.
>
> Why would you trim or compress whitespace? Using multiple spaces seems a
> perfectly legitimate way of making a password harder to guess.
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>


Re: Unicode in passwords

2015-10-06 Thread Philippe Verdy
And there are severe issues in this RFC for its case mapping profile: it
requires converting "uppercase" characters to "lowercase", but these
properties are not stable (see for example the history of Cherokee letters,
changed from gc=Lo to gc=Lu when lowercase letters were added and with case
pairs added at the same time, see also the addition of the capital sharp S
for German).

That RFC should used used the Unicode "Case Folding" algorithm which is
stable (case folded strings are NOT necessarily all lowercase, they are
just warrantied to keep a single case variant, and case folding implies the
use of compatibility normalization forms, i.e. NFKC or NFKD, to get the
correct closure: the standard Unicode normalizations are also stable) !

2015-10-06 10:48 GMT+02:00 Stephane Bortzmeyer :

> On Tue, Oct 06, 2015 at 12:57:51PM +0900,
>  Yoriyuki Yamagata  wrote
>  a message of 33 lines which said:
>
> > FYI, IETF is working on this issue.  See Internet Draft
> > https://tools.ietf.org/html/draft-ietf-precis-saslprepbis-17 based
> > on PRECIS framework RFC 7564 https://tools.ietf.org/html/rfc7564
>
> As alreday mentioned on that list, the draft is no longer a draft, it
> was published as a RFC, RFC 7613, two months ago
> 
>


Re: Why Nothing Ever Goes Away

2015-10-06 Thread Philippe Verdy
2015-10-06 14:24 GMT+02:00 Sean Leonard :

> 2. The Unicode code charts are (deliberately) vague about U+0080, U+0081,
>> and U+0099. All other C1 control codes have aliases to the ISO 6429
>> set of control functions, but in ISO 6429, those three control codes don't
>> have any assigned functions (or names).
>>
>
> On 10/5/2015 3:57 PM, Philippe Verdy wrote:
>
>> Also the aliases for C1 controls were formally registered in 1983 only
>> for the two ranges U+0084..U+0097 and U+009B..U+009F for ISO 6429.
>>
>
> If I may, I would appreciate another history lesson:
> In ISO 2022 / 6429 land, it is apparent that the C1 controls are mainly
> aliases for ESC 4/0 - 5/15. ( @ through _ ) This might vary depending on
> what is loaded into the C1 register, but overall, it just seems like saving
> one byte.
>
> Why was C1 invented in the first place?
>

Look for the history of EBCDIC and its adaptation/conversion with ASCII
compatible encodings: round trip conversion wasneeded (using a only a
simple reordering of byte values, with no duplicates). EBCDIC has used many
controls that were not part of C0 and were kept in the C1 set. Ignore the
7-bit compatiblity encoding using pairs, they were only needed for ISO
2022, but ISO 6429 defines a profile where those longer sequences are not
needed and even forbidden in 8-bit contexts or in contexts where aliases
are undesirable and invalidated, such as security environments.

With your thoughts, I would conclude that assigning characters in the G1
set was also a duplicate, because it is reachable with a C0 "shifting"
control + a position of the G0 set. In that case ISO 8859-1 or Windows 1252
was also an unneeded duplication ! And we would live today in a 7-bit only
world.

C1 controls have their own identity. The 7-bit encoding using ESC is just a
hack to make them fit in 7-bit and it only works where the ESC control is
assumed to play this function according to ISO 2022, ISO 6429, or other
similar old 7-bit protocols such as Videotext (which was widely used in
France with the free "Minitel" terminal, long before the introduction of
the Internet to the general public around 1992-1995).

Today Videotext is definitely dead (the old call numbers for this slow
service are now definitely defunct, the Minitels are recycled wastes, they
stopped being distributed and replaced by applications on PC connected to
the Internet, but now all the old services are directly on the internet and
none of them use 7-bit encodings for their HTML pages, or their mobile
applications). France has also definitely abandoned its old French version
of ISO 646, there are no longer any printer supporting versions of ISO 646
other than ASCII, but they still support various 8-bit encodings.

7-bit encodings are things of the past (they were only justified at times
where communication links were slow and generated lots of transmission
errors, and the only implemented mecanism to check them was to use a single
parity bit per character. Today we transmit long datagrams and prefer using
checks codes for the whole (such as CRC, or autocorrecting codes). 8-bit
encodings are much easier and faster to process for transmitting not just
text but also binary data.

Let's forget the 7-bit world definitely. We have also abandonned the old
UTF-7 in Unicode ! I've not seen it used anywhere except in a few old
emails sent at end of the 90's, because many mail servers were still not
8-bit clean and silently transformed non-ASCII bytes in unpredictable ways
or using unspecified encodings, or just siltently dropped the high bit,
assuming it was just a parity bit : at that time, emails were not sent with
SMTP, but with the old UUCP protocol and could take weeks to be delivered
to the final recipient, as there was still no global routing infrastructure
and many hops were necessary via non-permanent modem links. My opinion of
UTF-7 is that it was just a temporary and experimental solution to help
system admins and developers adopt the new UCS, including for their old
7-bit environments.


Re: Why Nothing Ever Goes Away

2015-10-06 Thread Sean Leonard

2. The Unicode code charts are (deliberately) vague about U+0080, U+0081,
and U+0099. All other C1 control codes have aliases to the ISO 6429
set of control functions, but in ISO 6429, those three control codes 
don't

have any assigned functions (or names).


On 10/5/2015 3:57 PM, Philippe Verdy wrote:
Also the aliases for C1 controls were formally registered in 1983 only 
for the two ranges U+0084..U+0097 and U+009B..U+009F for ISO 6429.


If I may, I would appreciate another history lesson:
In ISO 2022 / 6429 land, it is apparent that the C1 controls are mainly 
aliases for ESC 4/0 - 5/15. ( @ through _ ) This might vary depending on 
what is loaded into the C1 register, but overall, it just seems like 
saving one byte.


Why was C1 invented in the first place?

And, why did Unicode deem it necessary to replicate the C1 block at 
0x80-0x9F, when all of the control characters (codes) were equally 
reachable via ESC 4/0 - 5/15? I understand why it is desirable to align 
U+ - U+007F with ASCII, and maybe even U+ - U+00FF with Latin-1 
(ISO-8859-1). But maybe Windows-1252, MacRoman, and all the other 
non-ISO-standardized 8-bit encodings got this much right: duplicating 
control codes is basically a waste of very precious character code real 
estate.


Sean

PS I was not able to turn up ISO 6429:1983, but I did find ECMA-48, 4th 
Ed., December 1986, which has the following text:

***
5.4 Elements of the C1 Set
These control functions are represented:
- In a 7-bit code by 2-character escape sequences of the form ESC Fe, 
where ESC is represented by bit combination 01/11 and Fe is represented 
by a bit combination from 04/00 to 05/15.

- In an 8-bit code by bit combinations from 08/00 to 09/15.
***

This text is seemingly repeated in many analogous standards ca. ~1974 - 
~1992.


PPS I happen to have a copy of ANSI X3.41-1974 "American National 
Standard Code Extension Techniques for Use with the 7-Bit Coded 
Character Set of [ASCII]". The invention/existence of C1 goes back to 
this time, as does the use of ESC Fe to invoke C1 characters in a 7-bit 
code, and 0x80-0x9F to invoke C1 characters in an 8-bit code. (See, in 
particular, Clauses 5.3.3.1 and 5.3.6). In particular, Clause 7.3.1.2 
says: "The use of ESC Fe sequence in an 8-bit environment is contrary to 
the intention of this standard but, should they occur, their meaning is 
the same as in the 7-bit environment."


I can appreciate why it was desirable to "fold" C1 in an 8-bit 
environment into a 7-bit environment with ESC Fe. (If, in fact, that was 
the direction of standardization: invent a new thing and then devise a 
coding to express the new thing in the old thing.) It is less obvious 
why Unicode adopted C1, however, when the trend was to jettison the 
94-character Tetris block assignments in favor of a wide-open field for 
character assignment. Except for the trend in Unicode to "avoid 
assigning characters when explicitly asked, unless someone implements 
them without asking, and the implementation catches on, and then just 
assign the whole lot of them, even when they overlap with existing 
assignments, and then invent composite characters, which further 
compound the possible overlapping combinations". 


Re: Unicode in passwords

2015-10-06 Thread Philippe Verdy
Note that Java strings DO allow the presence of lone surrogates, as well as
non-characters , because Java strings are unrestricted vectors of 16-bit
code units (non-BMP characters are handled as pairs of surrogates).

In those conditions, normalizing the Java string will leave those lone
surrogates (and non-characters) as is, or will throw an exception,
depending on the API used. Java strings do not have any implied encoding
(their "char" members are also unrestricted 16-bit code units, they have
some basic properties but only in BMP, defined in the builtin Character
class API: properties for non-BMP characters require using a library to
provide them, such as ICU4J).

This is essentially the same kind as C/C++ "wide" strings using 16-bit
wchar_t, except that:
- C/C++ wide strings do not allow the inclusion of U+ which is a
terminator, unless you use a string class keeping the actual string length
(and not just the allocated buffer length which may be larger).
- Java strings, including litterals, are immutable, and optionally atomized
into a global dictionary, which includes all string litterals to share the
storage space of multiple instances with equal contents, including across
distinct classes from distinct packages.
- This also true for string literals (which are all immutable and atomized,
and initialized from the compiled bytecode of classes using a modified
version of UTF-8 that preserves all 16-bit code units (including lone
surrogates and non-characters like U+), but also store U+ as
<0xC0,0x80>. This modified UTF-8 encoding is also what you get if you use
the JNI interface version with 8-bit string (this internally requires a
conversion by JNI, using a temporary buffer); if you use the JNI interface
version with 16-bit strings, you work directly with the internal 16-bit
java strings and there's no conversion: you'll also get the lone surrogates
and all non-characters and you are not restricted to only valid UTF-16.
- Java strings are commonly used for fast initialization of large immutable
binary arrays because the conversion from Modified-UTF-8 to 16-bit strings
does not require running any compîled bytecode (this is not true for other
static arrays which requires large code for array litterals and not
warrantied to be immutable: the alternative to this large compiled code is
to initialize those large static arrays by I*/O *from an external stream,
such as a file beside the class in the same package, and possibly packed in
the same JAR).

Java passwords are "strings" but then still allow them to include arbitrary
16-bit code units, even if they violate UTF-16 restrictions. You will not
get much difference is you use byte arrays, the only change being the
difference of size of code units. Between those two representation you are
free to convert them with ANY encodings pair, and not just assuming
UTF-8<>UTF-16.

However, for security reasons, it's best to avoid string litterals for
passwords, because they can be enumerated from the global dictionnary of
atomized strings, or directly by reading the byte code of the compiled
class where they are sored in modified-UTF-8 but loaded and used as
arbitrary 16-bit strings (but the same is true if you use a byte array
literal ! you can just parse the initilization byte code to get the list of
bytes). If passwords or authorization keys are stored somewhere (as strings
or as byte arrays) they should be encrypted into a safe storage and not in
static string litterals or byte array initializers (they will BOTH be clear
text in the bytecode of the compiled class).

In both cases, there is NO normalization applied implicitly or
checked/enforced by the API (the only check that occurs is at class loading
time for the Modified-UTF-8 encoding for string literals: if it is wrong
the class will not load at all, you'll get an invalid class exception;
there's no such ckeck at all for the encoding of byte array initializers,
the only checks are the validity of the java initializer byte code and
bounds of array indexes used by the initiliazer code).



2015-10-06 5:39 GMT+02:00 Martin J. Dürst :

> On 2015/10/01 13:11, Jonathan Rosenne wrote:
>
>> For languages such as Java, passwords should be handled as byte arrays
>> rather than strings. This may make it difficult to apply normalization.
>>
>
> Well, they should be received from the user interface as strings, then
> normalized, then converted to byte arrays using a well-defined single
> encoding. Somewhat tedious, but hopefully not difficult.
>
> Regards,   Martin.
>


Re: Unicode in passwords

2015-10-06 Thread Norbert Lindenberg

> On Oct 6, 2015, at 6:04 , Philippe Verdy  wrote:
> 
> In those conditions, normalizing the Java string will leave those lone 
> surrogates (and non-characters) as is, or will throw an exception, depending 
> on the API used. Java strings do not have any implied encoding (their "char" 
> members are also unrestricted 16-bit code units, they have some basic 
> properties but only in BMP, defined in the builtin Character class API: 
> properties for non-BMP characters require using a library to provide them, 
> such as ICU4J).

The Java Character class was enhanced in J2SE 5.0 to support supplementary 
characters. The String class was specified to be based on UTF-16, and string 
processing throughout the platform was updated to support supplementary 
characters based on UTF-16. These changes have been available to the public 
since 2004. For a summary, see
http://www.oracle.com/technetwork/articles/java/supplementary-142654.html

Norbert


Re: Why Nothing Ever Goes Away

2015-10-06 Thread Doug Ewell
Asmus Freytag (t) wrote:

> Nobody wanted to follow the IBM code page 437 (then still the most
> widely used single byte vendor standard).

Although to this day, the UN/LOCODE manual [1] still refers to 437 as
"the standard United States character set" and claims that it "conforms
to these ISO standards" (8859-1:1987 and 10646-1:1993).

[1]
http://www.unece.org/fileadmin/DAM/cefact/locode/2015-1_UNLOCODE_SecretariatNotes.pdf

> Also, the overloading of 0x80-0xFF by Windows did not happen all at
> once, earlier versions had left much of that space open,

And it's still not completely filled, in any of the 125x code pages
except for the quirky 1256.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Unicode in passwords

2015-10-06 Thread Asmus Freytag (t)

  
  
On 10/6/2015 7:31 AM, Julian Bradfield
  wrote:


  
All browsers I use display spaces in input boxes, and put blobs for
hidden fields. Do you have evidence for broken input fields?


Network keys. That interface seems to consistently give people a
choice to reveal the key.


  


  
when using a button to generate passphrases that users must then copy-paste
to their password manager or to some private text document).

  
  
Copy-paste works on all my systems, too - do you have evidence of
broken copy-paste in this way?


I've seen input fields where sites don't allow paste on the second
copy (the confirmation copy).

Even for non-password things.

A./

  



Re: Unicode in passwords

2015-10-06 Thread Julian Bradfield
On 2015-10-06, Philippe Verdy  wrote:
> I don't think it is a good idea for tectual passwords to make differences
> based on the number of spaces. Being plain text they are likely to be
> displayed in utser interfaces in a way that the user will not see. Without

This is true of all passwords. Passwords have to be typed by finger
memory, not by looking at them (unless you're the type who puts them
on sticky notes, in which case you type by looking at the text on the
note). One doesn't normally see the characters, at best a count of
characters.

> trimming, users won't see the initial or final space, and the password
> input method may not display them as well (e.g. in an HTML input form or

All browsers I use display spaces in input boxes, and put blobs for
hidden fields. Do you have evidence for broken input fields?

> when using a button to generate passphrases that users must then copy-paste
> to their password manager or to some private text document).

Copy-paste works on all my systems, too - do you have evidence of
broken copy-paste in this way?

> Some password
> storages also will implicitly trim and compress those strings (e.g. in a

If it compresses it on setting, but doesn't compress it on testing, or
vice versa, then that's a bug. If it does the same for setting and
testing, it doesn't matter (except to compromise the crack-resistance
of the password).

> fixed-width column of a table in a database). There's also frequently no
> visual hint when entering or displaying those spaces and compression occurs

Evidence? Maybe if you're typing a password into a Word document it's
hard to count spaces, but why would you be doing that?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Why Nothing Ever Goes Away

2015-10-06 Thread Asmus Freytag (t)

  
  
On 10/6/2015 5:24 AM, Sean Leonard
  wrote:

And,
  why did Unicode deem it necessary to replicate the C1 block at
  0x80-0x9F, when all of the control characters (codes) were equally
  reachable via ESC 4/0 - 5/15? I understand why it is desirable to
  align U+ - U+007F with ASCII, and maybe even U+ - U+00FF
  with Latin-1 (ISO-8859-1). But maybe Windows-1252, MacRoman, and
  all the other non-ISO-standardized 8-bit encodings got this much
  right: duplicating control codes is basically a waste of very
  precious character code real estate

Because Unicode aligns with ISO 8859-1, so that
  transcoding from that was a simple zero-fill to 16 bits.
  
  8859-1 was the most widely used single byte (full 8-bit) ISO
  standard at the time, and making that transition easy was
  beneficial, both practically and politically. 
  
  Vendor standards all disagreed on the upper range, and it would
  not have been feasible to single out any of them. Nobody wanted to
  follow the IBM code page 437 (then still the most widely used
  single byte vendor standard).

  
Note, that by "then" I refer to dates
  earlier than the dates of the final drafts, because may of those
  decisions date back to earlier periods where the drafts were first
  developed. Also, the overloading of
  0x80-0xFF by Windows did not happen all at once, earlier versions
  had left much of that space open, but then people realized that as
  long as you were still limited to 8 bits, throwing away 32 codes
  was an issue.
  
  Now, for Unicode, 32 out of 64K values (initially) or 1114112
  (now), don't matter, so being "clean" didn't cost much. (Note that
  even for UTF-8, there's no special benefit of a value being inside
  that second range of 128 codes.
  
  Finally, even if the range had not been dedicated to C1, the 32
  codes would have had to be given space, because the translation
  into ESC sequences is not universal, so, in transcoding data you
  needed to have a way to retain the difference between the raw code
  and the ESC sequence, or your round-trip would not be lossless.
  
  A./