Re: Best practice of using regex on identify none-ASCII email address

2013-11-02 Thread Daode
Philippe Verdy  wrote:
 |2013/11/2 Philippe Verdy 
 |> Note also that the SMTP server configured in ISO-8859-1 will not accept
 |>To: 
 |> but may accept
 |>To: 
 |> or only the raw form
 |>To: http://xn--glac-epa.example.net>>
 |> or could accept the last two as *distinct* addresses (ech one needing its
 |> own way to escape them in envelope formats.
 |
 |it should always accept the domain part "@xn--glac-epa.example.net"
 |everywhere "@glacé.example.net " is not
 |supported (when the SMTP server does not know IDN)

btw. i never liked IDNA, also because it is terrible to look at
for users unless you're using some software which actually knows
that it is looking at IDNA and is capable to decode it.
So yes, user-readability would be the thing i'd worry about.

(Once i've implemented our caching resolver back in 2004 (latest
RFC read was RFC 3845) i wondered why the unused bit wasn't used
to switch to an extended protocol with longer labels and names.
If that would have been done once the widespread use of UTF-8 was
clear (rather than forseeable), there would have been more
options.  It's messy, but M$ had an ugly solution for the short
forms when they've added long filenames, and that wouldn't have
been impossible for DNS either.  But that ship has sailed, too.)

In the end software will have to hide even more complexity,
whatever this will end up with (thus most likely percent
encoding).

--steffen
--- Begin Message ---
2013/11/2 Philippe Verdy 

> Note also that the SMTP server configured in ISO-8859-1 will not accept
>To: 
> but may accept
>To: 
> or only the raw form
>To: http://xn--glac-epa.example.net>>
> or could accept the last two as *distinct* addresses (ech one needing its
> own way to escape them in envelope formats.
>

it should always accept the domain part "@xn--glac-epa.example.net"
everywhere "@glacé.example.net " is not
supported (when the SMTP server does not know IDN)
--- End Message ---


Re: Best practice of using regex on identify none-ASCII email address

2013-11-02 Thread Daode
Philippe Verdy  wrote:
 |and all this is about interacting with SMTP servers in the SMTP protocol
 |(or related protocols). Nothing correctly describes the embedding in a text
 |document (plain-text, or even HTML, XML) that can itself be fully reencoded
 |(and that may not even accept all raw UTF-8 byte values) ! That's what
 |you've forgotten.

yup, that train has left the station long ago.

 |Using the %-escapes used only in the SMTP protocol makes the embedding in
 |documents really unreadable: this defeats completely the work done to
 |support native names in IDNA for the domain name part of the address, if
 |the local part can only be represented using %nn-escaped bytes of UTF-8
 |sequences.

Just like Bruce Schneier wrote in his September letter, though in
a slightly different context:

  Two, we can design. We need to figure out how to re-engineer the
  Internet

Nice prospects!

--steffen
--- Begin Message ---
2013/11/2 Steffen Daode 

> There is RFC 5322 which specifies the format of internet messages,
> and then there were the 3+ RFCs (RFC 6530-32) which simply
> redefine that format to be UTF-8 aware and its limits to deal with
> characters not octets (multiply line lengths etc. with 4).
> These UTF-8 extensions can only be used when directly interacting
> with a SMTP / (POP3, IMAP; RFCs 6856 and 6855 i think belong)
> server.
>

and all this is about interacting with SMTP servers in the SMTP protocol
(or related protocols). Nothing correctly describes the embedding in a text
document (plain-text, or even HTML, XML) that can itself be fully reencoded
(and that may not even accept all raw UTF-8 byte values) ! That's what
you've forgotten.

Using the %-escapes used only in the SMTP protocol makes the embedding in
documents really unreadable: this defeats completely the work done to
support native names in IDNA for the domain name part of the address, if
the local part can only be represented using %nn-escaped bytes of UTF-8
sequences.

So instead of reading and typing
http://xn--glac-epa.example.net>>
users will have to decipher (or type):

Really poor ! Things would be better with:
http://xn--glac-epa.example.net>>

which should work as long as all non-ASCII characters exist in both the
specified target encoding (UTF-8 here) and the envelope encoding (it could
be windows-1252 here, e.g. for this email from me, or the encoding chosen
during the transport to reach you in your mailbox).

Note also that the SMTP server configured in ISO-8859-1 will not accept
   To: 
but may accept
   To: 
or only the raw form
   To: http://xn--glac-epa.example.net>>
or could accept the last two as *distinct* addresses (ech one needing its
own way to escape them in envelope formats.

However SMTP servers are supposed to understand MIME conventions in MIME
headers, they should be used here to solve the issue in other text
documents, outside SMTP itself. MIME already proposes quoted-printable
since long (it also proposes base-64), with clear identification of the
encoding in each protocol field.


Note also that the user interface of email agents does not have this
limitation, they can display directly the first form in their forms because
they know they are speaking SMTP, so they decipher these themselves,
independantly of the encoding of envelope formats. Here were' speaking
about situations where addresses are exchanged outside of SMTP, for example
in word processor documents, readme files...

In addition these newer RFCs are not followed on many SMTP servers that
absolutely don't understand these escapes or that have never accepted the
UTF-8 encoding, but still accept their own local 8-bit encoding **only** in
raw form.
--- End Message ---


Re: Best practice of using regex on identify none-ASCII email address

2013-11-02 Thread Philippe Verdy
2013/11/2 Philippe Verdy 

> Note also that the SMTP server configured in ISO-8859-1 will not accept
>To: 
> but may accept
>To: 
> or only the raw form
>To: http://xn--glac-epa.example.net>>
> or could accept the last two as *distinct* addresses (ech one needing its
> own way to escape them in envelope formats.
>

it should always accept the domain part "@xn--glac-epa.example.net"
everywhere "@glacé.example.net " is not
supported (when the SMTP server does not know IDN)


Re: Best practice of using regex on identify none-ASCII email address

2013-11-02 Thread Philippe Verdy
2013/11/2 Steffen Daode 

> There is RFC 5322 which specifies the format of internet messages,
> and then there were the 3+ RFCs (RFC 6530-32) which simply
> redefine that format to be UTF-8 aware and its limits to deal with
> characters not octets (multiply line lengths etc. with 4).
> These UTF-8 extensions can only be used when directly interacting
> with a SMTP / (POP3, IMAP; RFCs 6856 and 6855 i think belong)
> server.
>

and all this is about interacting with SMTP servers in the SMTP protocol
(or related protocols). Nothing correctly describes the embedding in a text
document (plain-text, or even HTML, XML) that can itself be fully reencoded
(and that may not even accept all raw UTF-8 byte values) ! That's what
you've forgotten.

Using the %-escapes used only in the SMTP protocol makes the embedding in
documents really unreadable: this defeats completely the work done to
support native names in IDNA for the domain name part of the address, if
the local part can only be represented using %nn-escaped bytes of UTF-8
sequences.

So instead of reading and typing
http://xn--glac-epa.example.net>>
users will have to decipher (or type):

Really poor ! Things would be better with:
http://xn--glac-epa.example.net>>

which should work as long as all non-ASCII characters exist in both the
specified target encoding (UTF-8 here) and the envelope encoding (it could
be windows-1252 here, e.g. for this email from me, or the encoding chosen
during the transport to reach you in your mailbox).

Note also that the SMTP server configured in ISO-8859-1 will not accept
   To: 
but may accept
   To: 
or only the raw form
   To: http://xn--glac-epa.example.net>>
or could accept the last two as *distinct* addresses (ech one needing its
own way to escape them in envelope formats.

However SMTP servers are supposed to understand MIME conventions in MIME
headers, they should be used here to solve the issue in other text
documents, outside SMTP itself. MIME already proposes quoted-printable
since long (it also proposes base-64), with clear identification of the
encoding in each protocol field.


Note also that the user interface of email agents does not have this
limitation, they can display directly the first form in their forms because
they know they are speaking SMTP, so they decipher these themselves,
independantly of the encoding of envelope formats. Here were' speaking
about situations where addresses are exchanged outside of SMTP, for example
in word processor documents, readme files...

In addition these newer RFCs are not followed on many SMTP servers that
absolutely don't understand these escapes or that have never accepted the
UTF-8 encoding, but still accept their own local 8-bit encoding **only** in
raw form.


Re: Best practice of using regex on identify none-ASCII email address

2013-11-02 Thread Daode
There is RFC 5322 which specifies the format of internet messages,
and then there were the 3+ RFCs (RFC 6530-32) which simply
redefine that format to be UTF-8 aware and its limits to deal with
characters not octets (multiply line lengths etc. with 4).
These UTF-8 extensions can only be used when directly interacting
with a SMTP / (POP3, IMAP; RFCs 6856 and 6855 i think belong)
server.  And then there are

  rfc6857.txt Post-Delivery Message Downgrading for
  Internationalized Email Messages
  rfc6858.txt Simplified POP and IMAP Downgrading for
  Internationalized Email

which describe the "necessary limitations" of the entire RFC
6530-32 and RFC 6855-58 complex.
Thus, either a message conforms to RFC 5322 (possibly including
"downgraded" headers, in case someone (already) cares about
those), or both sides agree to use the UTF-8 extension.


So, today.  Since RFC 5322 states very clearly:

  addr-spec   =   local-part "@" domain
  local-part  =   dot-atom / quoted-string / obs-local-part

  dot-atom=   [CFWS] dot-atom-text [CFWS]
  dot-atom-text   =   1*atext *("." 1*atext)
  qcontent=   qtext / quoted-pair
  qtext   =   %d33 / ; Printable US-ASCII
  %d35-91 /  ;  characters not including
  %d93-126 / ;  "\" or the quote character
  obs-qtext
  quoted-string   =   [CFWS]
  DQUOTE *([FWS] qcontent) [FWS] DQUOTE
  [CFWS]

any octet with a high bit set is not allowed in the local part.

 |>>> That being true, I wish that industry could come to consensus about
 |>>> requiring everything outside of a well-defined, backwards-compatible \
 |>>> set of
 |>>> characters to be expressed as UTF-8 percent-escaped characters in these
 |>>> fields when they are expressed as plaintext.
 |>>>
 |>>
 |>> If there is not already a convention for percent-escaped UTF-8 in email
 |>> addresses, then please let's not add one like that but rather escape *code
 |>> points*.

What about UTF-7 (RFC 2152):

   We also feel that UTF-8 in Base64 has high expansion for non-
   Western-European users, and is less desirable because it cannot
   be read directly, even when the content is largely US-ASCII.
   The base encoding of UTF-7 gives competitive results and is
   readable for ASCII text.

Due to lack of possibility to use MIME encoding in the local-part
(most likely due to RFC 3986, if that matters), the following from
[1] will possibly be rethought:

  It should be noted that the Unicode Standard also defines the
  UTF-7 charset, which was intended for Internet mail. However, MIME
  is quite capable of carrying UTF-8, and UTF-8 is expected to be
  used in many protocols, not just Internet mail. Fortunately, very
  few vendors implemented UTF-7, and its use is strongly discouraged
  in Internet mail.

  [1] 

--steffen
--- Begin Message ---
Me too... only raw bytes are ccepted by SMTP or POP3 protocols.
This does not mean that within URLs they can not (should not) be escaped !

Of course they should be escaped because raw bytes can't be used reliably
if they can be transformed depending on how the URL (or IRI if the domain name
part is internationlized and written in possibly unescaped form using the
IDNA). Note that IDNA is also NOT usable at all for the local part.

However, this is still not specified in any standard for URLs, meaning that
you cannot safely embed any email address in **any** plain-text document if
the local part contains non-ASCII byte values (I say "byte values" and not
"characters" because we absoluatelya don't know if these bytes represent
characters or not, and can't break them into elementarya suabsequences
representinag a siangale abstract character)

For suacha application where thaese byte values (between 0x80 to 0xFF
included) are uased in tahe local parta of an email address afora which the
binary encoding must be preserved (even if the container plain-text
document is reencoded), I see no other solution than using escaping. Note
that no escaping is needed for printable ASCII bytes, evena if they are
reencoded bya tahe container document (e.g. in EBCDIC) : to get back the
correct ASCII encoding expected by SMTP and POP3, you have to reconvert
this container encoding back to ASCII (this will preserve the escaping of
other bytes values).

Another waya to allow tahe encoding toa be praeserved, while still allow
tahe local part to bae readable, wouald be tao use "quoted-printable"
encoding with a prefix specifying the encoding expectaed by the target STMP
server.

E.g. suppose you want to write to "café@example.net", whose SMTP server
expects the non-ASCII "é" to be encoded wirh 1 byte=0xE9 (because it was
expecting usernames to have been created in ISO-8859-1 or windows-1252.

Then in an URL or in any plaintext document it should be escaped:
 or mailto:?Q?ISO-8859-1?ca
fé?@example.net">
**even

Re: Best practice of using regex on identify none-ASCII email address

2013-11-02 Thread Philippe Verdy
Me too... only raw bytes are ccepted by SMTP or POP3 protocols.
This does not mean that within URLs they can not (should not) be escaped !

Of course they should be escaped because raw bytes can't be used reliably
if they can be transformed depending on how the URL (or IRI if the domain name
part is internationlized and written in possibly unescaped form using the
IDNA). Note that IDNA is also NOT usable at all for the local part.

However, this is still not specified in any standard for URLs, meaning that
you cannot safely embed any email address in **any** plain-text document if
the local part contains non-ASCII byte values (I say "byte values" and not
"characters" because we absoluatelya don't know if these bytes represent
characters or not, and can't break them into elementarya suabsequences
representinag a siangale abstract character)

For suacha application where thaese byte values (between 0x80 to 0xFF
included) are uased in tahe local parta of an email address afora which the
binary encoding must be preserved (even if the container plain-text
document is reencoded), I see no other solution than using escaping. Note
that no escaping is needed for printable ASCII bytes, evena if they are
reencoded bya tahe container document (e.g. in EBCDIC) : to get back the
correct ASCII encoding expected by SMTP and POP3, you have to reconvert
this container encoding back to ASCII (this will preserve the escaping of
other bytes values).

Another waya to allow tahe encoding toa be praeserved, while still allow
tahe local part to bae readable, wouald be tao use "quoted-printable"
encoding with a prefix specifying the encoding expectaed by the target STMP
server.

E.g. suppose you want to write to "café@example.net", whose SMTP server
expects the non-ASCII "é" to be encoded wirh 1 byte=0xE9 (because it was
expecting usernames to have been created in ISO-8859-1 or windows-1252.

Then in an URL or in any plaintext document it should be escaped:
 or mailto:?Q?ISO-8859-1?ca
fé?@example.net">
**even** if the continer document is encoded in the same specified encoding.

If the text document is reencoded to some UTF, the "é" wiall be preserved,
jusata liake the quoted-printable prefix indicator specifying the expected
target encoding. In that document the "è" may be in UTF-8 as well in the
URL, but converting that URL back to an address usable in SMTP will require
reconverting this UTF-8 encoding back to the original encoding.

If the text document is converted to ASCII-only, quoted-printable will need
to be replaced by base-64, but the encoding will remain in the prefix
"?B?ISO-8859-1?"

A mailto URL or embedded email address that does not specify the target
encoding (in quoted-printable" or base-64 like in MIME) is NOT safe to use
if it contains ANY non-ASCII character.


2013/11/2 Buck Golemon 

>
>
>
> On Fri, Nov 1, 2013 at 8:40 AM, Markus Scherer wrote:
>
>> On Fri, Nov 1, 2013 at 1:37 AM, Mark Davis ☕  wrote:
>>
>>> That being true, I wish that industry could come to consensus about
>>> requiring everything outside of a well-defined, backwards-compatible set of
>>> characters to be expressed as UTF-8 percent-escaped characters in these
>>> fields when they are expressed as plaintext.
>>>
>>
>> If there is not already a convention for percent-escaped UTF-8 in email
>> addresses, then please let's not add one like that but rather escape *code
>> points*.
>>
>> markus
>>
>
> In my own trials, percent-escaped utf-8 does not work for the local part
> of the email.
> I found that only raw bytes (utf8 in my case) work acceptably.
>


Re: Best practice of using regex on identify none-ASCII email address

2013-11-01 Thread Buck Golemon
On Fri, Nov 1, 2013 at 8:40 AM, Markus Scherer  wrote:

> On Fri, Nov 1, 2013 at 1:37 AM, Mark Davis ☕  wrote:
>
>> That being true, I wish that industry could come to consensus about
>> requiring everything outside of a well-defined, backwards-compatible set of
>> characters to be expressed as UTF-8 percent-escaped characters in these
>> fields when they are expressed as plaintext.
>>
>
> If there is not already a convention for percent-escaped UTF-8 in email
> addresses, then please let's not add one like that but rather escape *code
> points*.
>
> markus
>

In my own trials, percent-escaped utf-8 does not work for the local part of
the email.
I found that only raw bytes (utf8 in my case) work acceptably.


Re: Best practice of using regex on identify none-ASCII email address

2013-11-01 Thread Markus Scherer
On Fri, Nov 1, 2013 at 1:37 AM, Mark Davis ☕  wrote:

> That being true, I wish that industry could come to consensus about
> requiring everything outside of a well-defined, backwards-compatible set of
> characters to be expressed as UTF-8 percent-escaped characters in these
> fields when they are expressed as plaintext.
>

If there is not already a convention for percent-escaped UTF-8 in email
addresses, then please let's not add one like that but rather escape *code
points*.

markus


Re: Best practice of using regex on identify none-ASCII email address

2013-11-01 Thread Mark Davis ☕
I'm not saying that what is sent to the server has to be those bytes; I'm
saying that if we use the convention that punctuation, whitespace, etc gets
escaped, it would allow us to recognize the boundaries of the local part in
plain text.

I think what you mention is part of a more general problem. Let's suppose
that I have an email address where the bytes that the server recognizes for
the local part are <61 B3>@foo.com. I convert that using Latin-14 to aġ@
foo.com. I send it in an email to you, and you receive it as UTF-8. You see
aġ@foo.com, but underneath the covers it is bytes <61 C4 A1>. But then you
send to the server <61 C4 A1>@foo.com, and it fails. Or worse yet, reaches
someone whose email is aġ@foo.com. (Ok, I could have poked around and
found a more compelling example, but you see the point).

If I really wanted to be absolutely certain that my email wouldn't be
munged by a conversion, I'd never convert from bytes: we'd never see "
m...@foo.com", we'd always see the equivalent of %6d%61%72...@foo.com.






Mark 
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Nov 1, 2013 at 1:36 PM, Philippe Verdy  wrote:

>
>
> 2013/11/1 Mark Davis ☕ 
>
>> These are two well-known serious flaws in EAI and URLs; there is no
>> useful syntactic limit on what is in the query part of a URL or on the
>> local part of an email address that would allow their boundaries to be
>> detected in plaintext.
>>
>> No use complaining about them, because people are concerned with
>> backwards compatibility, and wouldn't change the underlying specs.
>>
>> That being true, I wish that industry could come to consensus about
>> requiring everything outside of a well-defined, backwards-compatible set of
>> characters to be expressed as UTF-8 percent-escaped characters in these
>> fields when they are expressed as plaintext. (Something like XID_Continue ±
>> exceptions.) That would allow for unambiguous parsing in plaintext.
>>
>
> Why "UTF-8" only ? There exists already email accounts created with
> various ISO8859-* or windows codepages, or KOI-8R (or U). And none of these
> addresses are aliased with an UTF-8 encoded account name reaching the same
> mailbox (creting these aliases would help these users having such accounts
> to protect their privacy, however there may exist rare cases where these
> aliases woulda conflict with distinct mail accounts
>


Re: Best practice of using regex on identify none-ASCII email address

2013-11-01 Thread Philippe Verdy
2013/11/1 Mark Davis ☕ 

> These are two well-known serious flaws in EAI and URLs; there is no useful
> syntactic limit on what is in the query part of a URL or on the local part
> of an email address that would allow their boundaries to be detected in
> plaintext.
>
> No use complaining about them, because people are concerned with backwards
> compatibility, and wouldn't change the underlying specs.
>
> That being true, I wish that industry could come to consensus about
> requiring everything outside of a well-defined, backwards-compatible set of
> characters to be expressed as UTF-8 percent-escaped characters in these
> fields when they are expressed as plaintext. (Something like XID_Continue ±
> exceptions.) That would allow for unambiguous parsing in plaintext.
>

Why "UTF-8" only ? There exists already email accounts created with various
ISO8859-* or windows codepages, or KOI-8R (or U). And none of these
addresses are aliased with an UTF-8 encoded account name reaching the same
mailbox (creting these aliases would help these users having such accounts
to protect their privacy, however there may exist rare cases where these
aliases woulda conflict with distinct mail accounts


Re: Best practice of using regex on identify none-ASCII email address

2013-11-01 Thread Mark Davis ☕
These are two well-known serious flaws in EAI and URLs; there is no useful
syntactic limit on what is in the query part of a URL or on the local part
of an email address that would allow their boundaries to be detected in
plaintext.

No use complaining about them, because people are concerned with backwards
compatibility, and wouldn't change the underlying specs.

That being true, I wish that industry could come to consensus about
requiring everything outside of a well-defined, backwards-compatible set of
characters to be expressed as UTF-8 percent-escaped characters in these
fields when they are expressed as plaintext. (Something like XID_Continue ±
exceptions.) That would allow for unambiguous parsing in plaintext.


Mark 
*
*
*— Il meglio è l’inimico del bene —*
**


On Thu, Oct 31, 2013 at 8:37 PM, Philippe Verdy  wrote:

> How can it "suarprisingly work" if you need to safely embed an
> email address as an URI in a plain text document ? Yes there's way to worak
> with the IDNA part, but the local part is a challenge, that will require
> (to make it work) that the mail server will accept several aliased account
> names, depending on the document in which the address was embedded and
> encoded before being dereferenced and used to send mails.
>
> There's no easy way to embed the local part in plain-text when it can be
> arbitrary sequences of bytes in the non-ASCII range, whose encoding in the
> target domain name is unpredictable without first querying the MX server
> for that domain for this info, or without retrying sending mails with
> several guesses: these guesses with retries may cause privacy issues for
> the legitimate owner of non-ASCII email accounts (another reasons for using
> email of verification/confirmation of the owner, before sending him private
> messages).
>
> 2013/10/31 Shawn Steele 
>
>>  I think that’s true for non-ASCII non-EAI locale parts as well.  It’s
>> so inconsistent its surprising when it works?
>>
>
>


Re: Best practice of using regex on identify none-ASCII email address

2013-10-31 Thread Philippe Verdy
How can it "suarprisingly work" if you need to safely embed an
email address as an URI in a plain text document ? Yes there's way to worak
with the IDNA part, but the local part is a challenge, that will require
(to make it work) that the mail server will accept several aliased account
names, depending on the document in which the address was embedded and
encoded before being dereferenced and used to send mails.

There's no easy way to embed the local part in plain-text when it can be
arbitrary sequences of bytes in the non-ASCII range, whose encoding in the
target domain name is unpredictable without first querying the MX server
for that domain for this info, or without retrying sending mails with
several guesses: these guesses with retries may cause privacy issues for
the legitimate owner of non-ASCII email accounts (another reasons for using
email of verification/confirmation of the owner, before sending him private
messages).

2013/10/31 Shawn Steele 

>  I think that’s true for non-ASCII non-EAI locale parts as well.  It’s so
> inconsistent its surprising when it works?
>


RE: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Shawn Steele
I think that's true for non-ASCII non-EAI locale parts as well.  It's so 
inconsistent its surprising when it works?

From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy
Sent: Wednesday, October 30, 2013 6:30 PM
To: Shawn Steele
Cc: James Lin; Paweł Dyda; cldr-us...@unicode.org; unicode@unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

2013/10/31 Shawn Steele 
mailto:shawn.ste...@microsoft.com>>
For EAI (the question being asked), the entire address, local part and domain, 
are encoded in UTF-8.

No. the question being sked (by James Lin) did NOT include this restriction:

> "does anyone has the best practice or guideline on how to validate none-ASCII 
> email address by using regular expression?"

In his 2 replies, he did not added this restriction to EAI only (which is just 
a possible option on the Internet, not mandatory and frequently not followed in 
many domains).



Re: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Philippe Verdy
2013/10/31 Shawn Steele 

>  For EAI (the question being asked), the entire address, local part and
> domain, are encoded in UTF-8.
>

No. the question being sked (by James Lin) did NOT include this
restriction:

> "does anyone has the best practice or guideline on how to validate
none-ASCII email address by using regular expression?"

In his 2 replies, he did not added this restriction to EAI only (which is
just a possible option on the Internet, not mandatory and frequently not
followed in many domains).


RE: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Shawn Steele
For EAI (the question being asked), the entire address, local part and domain, 
are encoded in UTF-8.

-Shawn

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Philippe Verdy
Sent: Wednesday, October 30, 2013 4:08 PM
To: James Lin
Cc: Paweł Dyda; cldr-us...@unicode.org; unicode@unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

You should not ttempt to detect scripts or even assume that they are encoded 
based on Unicode, in the username part ; all you can do is to break at the 
first "@" to split it between user name part and the domin name, then use the 
IDN specs to validate the domain name part.

* 1. Domain name part:

You may want to restrict only to internet domains (that must contain a dot 
before the TLD), and validate the TLD label in a list that you do not restrict 
for local usage only (such as .local or .localnet), or only for your own 
domain, but I suggest that you validte all these domains only by performing a 
MX request on your DNS server (this could take time to reply, unless you just 
check the TLD part, which should be cached most often, or using the DNS request 
only for domins not in a wellknown list of gTLD, plus all 2-letter ccTLD which 
are not in the private-use range of ISO 3166-1).

Note that to send a mail, you need a MX resolution on DNS to get the address of 
a mail server, but it does not mean it will be immediately and constantly 
reachable : the UIP you get may be temporrily unreachable (due to your ISP or 
local routing problems, or because the remote mail server is temporarily offine 
or overloaded). Performing an MX request however is much faster than trying to 
send a mail to it, because MX resoltuion will use your local DNS server cache 
and caches of offstream DNS servers of your ISP (you normally don't need to 
perform authoritative MX requests which requires recursive search from the 
root, bypassing all caches, and the scalability of the DNS system (so it's not 
a good policy to do it by default).

If you need security, authoritative DNS queries should be replaced by secure 
emails based on direct authentication with the mail server at strt of the SMTP 
session. authoritative DNS queries should be performed only if this 
authentication fails (in order to bypass incorrect data in DNS caches), but not 
automaticlly (this could be caused by problems on your own site), so delay 
these unchecked email addresses in your database (the problem may be solved 
without doing anything when your server will retry several minutes or hours 
later, when it will have successed in sending the validation email for your 
subscribers).

Do not insert in your database any email addresses coming from any source you 
don't trust for having received the approval by the mail address owner, or not 
obeying to the same explicit approval policy seen by that user, or that is not 
in a domain in your own control ; otherwise you risk being flagged as spamming 
and have your site blocked on various mail servers: you need to send the 
validation email without sending any other kind of advertising, except your own 
identity.

Note that instead of a domain, you *may* accept a host name with an IPv4 
address (in decimal dotted format), or an IPv6 address (within [brackets], and 
in hexadecimal with colons), or some other host name formats for specific 
mail/messaging transport protocols you accept, for example 
"username@[irc:ircservernname:port:channelname]", or "username@{uuid}" using 
other punctuation not valid in domain names.


* 2. User name part:

There's no standard encoding there.

- Do not assume any encoding (unless you know the encoding used on each 
specific domain !). This part never obeys the IDNA.
- Every unrestricted byte in the printable 7-bit ASCII range, and all bytes in 
0x80..0xFF are valid in any sequence.
- Only few punctuations of the ASCII range need to be checked according to the 
RFC's.
- Never "canonicalise" user names by forcing the capitalisation (not even for 
the basic Latin letters : user names could be encoded with Base-64 for example 
where letter case is significant), even if you can do it for the domain name 
part.




2013/10/30 James Lin mailto:james_...@symantec.com>>
Hi
I am not expecting a single regular expression to solve all possible 
combination of scripts.  What I am looking for probably (which may not be 
possible due to combination of scripts and mix scripts) is somewhere along the 
line of having individual scripts that validate by the regular expression.  I 
am still thinking if it is possible to have regular expression for individual 
scripts only and not mix-match (for the time being) such as (i am being very 
high level here):

  *Phags-pa scripts

 *   Chinese: Traditional/Simplified
 *   Mongolian
 *   Sanskrit
 *   ...

  *   Kana scripts

 *   Japanese: hirakana/Kataka

Re: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Philippe Verdy
ssion, I was
> wondering if such validation can be accomplished here.
>
> Of course, RFC3696 standardize all email formatting rules and we can use
> such rule to validate the format before checking the scripts for validity.
>
> Warm Regards,
> -James Lin
>
>
>
> From: Paweł Dyda 
> Date: Wednesday, October 30, 2013 at 2:19 PM
> To: James Lin 
> Cc: "cldr-us...@unicode.org" , Unicode List <
> unicode@unicode.org>
>
> Subject: Re: Best practice of using regex on identify none-ASCII email
> address
>
> Hi James,
>
> I am not sure if you have seen my email, but... I believe Regular
> Expressions are not a valid tool for that job (that is validating Int'l
> email address format).
>
> In the internal email I especially gave one specific example, where to my
> knowledge it is (nearly) impossible to use Regular Expression to validate
> email address.
>
> The reason I gave was mixed-script scenario.
>
> How can we ensure that we allow mixture of  Hiragana, Katakana and Latin,
> while basically disallowing any other combinations with Latin (especially
> Latin + Cyrillic or Latin + Greek)?
> I am really curious to know...
>
> And of course there are several single-script (homographs and alike)
> attacks that we might want to prevent. I don't think it is even remotely
> possible with Regular Expressions. Please correct me if I am wrong.
>
> Cheers,
> Paweł.
>
>
> 2013/10/30 James Lin 
>
>> Let me include the unicode alias as well for wider audience since this
>> topic came up few times in the past.
>>
>> From: James Lin 
>> Date: Wednesday, October 30, 2013 at 1:11 PM
>> To: "cldr-us...@unicode.org" 
>> Subject: Best practice of using regex on identify none-ASCII email
>> address
>>
>> Hi
>> does anyone has the best practice or guideline on how to validate
>> none-ASCII email address by using regular expression?
>>
>> I looked through RFC6531, CLDR repository and nothing has a solid example
>> on how to validate none-ASCII email address.
>>
>> thanks everyone.
>> -James
>>
>
>


RE: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Shawn Steele
Mixed script stuff considerations are all supposed to be done by the mailbox 
administrator.  It's perfectly valid for a domain to assign Latin addresses and 
also Cyrillic ones.  Indeed for Cyrillic EAI, one probably would almost 
certainly require ASCII (eg: Latin) aliases during whatever the transition 
period is.

A German mailbox admins may only allow German letters and no other Latin 
characters in their mailbox names.  Other admins may want to allow Latin 
characters with other scripts (CJK locales come to mind).  And a Russian admin 
may provide all-Cyrillic mailboxes with all-Latin aliases to those names.  
(Hopefully that admin's being careful about homographs, but the standards still 
let the admin make the decisions).

The PUA isn't even forbidden (I'm hoping for a pIqaD alias some day).

-Shawn

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of James Lin
Sent: Wednesday, October 30, 2013 2:58 PM
To: Paweł Dyda
Cc: cldr-us...@unicode.org; unicode@unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi
I am not expecting a single regular expression to solve all possible 
combination of scripts.  What I am looking for probably (which may not be 
possible due to combination of scripts and mix scripts) is somewhere along the 
line of having individual scripts that validate by the regular expression.  I 
am still thinking if it is possible to have regular expression for individual 
scripts only and not mix-match (for the time being) such as (i am being very 
high level here):

  *Phags-pa scripts

 *   Chinese: Traditional/Simplified
 *   Mongolian
 *   Sanskrit
 *   ...

  *   Kana scripts

 *   Japanese: hirakana/Katakana
 *   ...

  *   Hebrew scripts

 *   Yiddish
 *   Hebrew
 *   Bukhori
 *   ...

  *   Latin scripts

 *   English
 *   Italian
 *   

  *   Hangul scripts

 *   Korean

  *   Cyrillic Scripts

 *   Russian
 *   Bulgarian
 *   Ukrainian
 *   ...
By focusing on each scripts to derive a regular expression, I was wondering if 
such validation can be accomplished here.

Of course, RFC3696 standardize all email formatting rules and we can use such 
rule to validate the format before checking the scripts for validity.

Warm Regards,
-James Lin



From: Paweł Dyda mailto:pawel.d...@gmail.com>>
Date: Wednesday, October 30, 2013 at 2:19 PM
To: James Lin mailto:james_...@symantec.com>>
Cc: "cldr-us...@unicode.org<mailto:cldr-us...@unicode.org>" 
mailto:cldr-us...@unicode.org>>, Unicode List 
mailto:unicode@unicode.org>>
Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi James,
I am not sure if you have seen my email, but... I believe Regular Expressions 
are not a valid tool for that job (that is validating Int'l email address 
format).

In the internal email I especially gave one specific example, where to my 
knowledge it is (nearly) impossible to use Regular Expression to validate email 
address.

The reason I gave was mixed-script scenario.

How can we ensure that we allow mixture of  Hiragana, Katakana and Latin, while 
basically disallowing any other combinations with Latin (especially Latin + 
Cyrillic or Latin + Greek)?
I am really curious to know...
And of course there are several single-script (homographs and alike) attacks 
that we might want to prevent. I don't think it is even remotely possible with 
Regular Expressions. Please correct me if I am wrong.
Cheers,
Paweł.

2013/10/30 James Lin mailto:james_...@symantec.com>>
Let me include the unicode alias as well for wider audience since this topic 
came up few times in the past.

From: James Lin mailto:james_...@symantec.com>>
Date: Wednesday, October 30, 2013 at 1:11 PM
To: "cldr-us...@unicode.org<mailto:cldr-us...@unicode.org>" 
mailto:cldr-us...@unicode.org>>
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII 
email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on 
how to validate none-ASCII email address.

thanks everyone.
-James



Re: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread James Lin
Hi
I am not expecting a single regular expression to solve all possible 
combination of scripts.  What I am looking for probably (which may not be 
possible due to combination of scripts and mix scripts) is somewhere along the 
line of having individual scripts that validate by the regular expression.  I 
am still thinking if it is possible to have regular expression for individual 
scripts only and not mix-match (for the time being) such as (i am being very 
high level here):

 *Phags-pa scripts
*   Chinese: Traditional/Simplified
*   Mongolian
*   Sanskrit
*   ...
 *   Kana scripts
*   Japanese: hirakana/Katakana
*   ...
 *   Hebrew scripts
*   Yiddish
*   Hebrew
*   Bukhori
*   …
 *   Latin scripts
*   English
*   Italian
*   ….
 *   Hangul scripts
*   Korean
 *   Cyrillic Scripts
*   Russian
*   Bulgarian
*   Ukrainian
*   ...

By focusing on each scripts to derive a regular expression, I was wondering if 
such validation can be accomplished here.

Of course, RFC3696 standardize all email formatting rules and we can use such 
rule to validate the format before checking the scripts for validity.

Warm Regards,
-James Lin



From: Paweł Dyda mailto:pawel.d...@gmail.com>>
Date: Wednesday, October 30, 2013 at 2:19 PM
To: James Lin mailto:james_...@symantec.com>>
Cc: "cldr-us...@unicode.org<mailto:cldr-us...@unicode.org>" 
mailto:cldr-us...@unicode.org>>, Unicode List 
mailto:unicode@unicode.org>>
Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi James,

I am not sure if you have seen my email, but... I believe Regular Expressions 
are not a valid tool for that job (that is validating Int'l email address 
format).

In the internal email I especially gave one specific example, where to my 
knowledge it is (nearly) impossible to use Regular Expression to validate email 
address.

The reason I gave was mixed-script scenario.

How can we ensure that we allow mixture of  Hiragana, Katakana and Latin, while 
basically disallowing any other combinations with Latin (especially Latin + 
Cyrillic or Latin + Greek)?
I am really curious to know...

And of course there are several single-script (homographs and alike) attacks 
that we might want to prevent. I don't think it is even remotely possible with 
Regular Expressions. Please correct me if I am wrong.

Cheers,
Paweł.


2013/10/30 James Lin mailto:james_...@symantec.com>>
Let me include the unicode alias as well for wider audience since this topic 
came up few times in the past.

From: James Lin mailto:james_...@symantec.com>>
Date: Wednesday, October 30, 2013 at 1:11 PM
To: "cldr-us...@unicode.org<mailto:cldr-us...@unicode.org>" 
mailto:cldr-us...@unicode.org>>
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII 
email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on 
how to validate none-ASCII email address.

thanks everyone.
-James



RE: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Shawn Steele
EAI doesn't really specify anything "more" than the older SMTP about validating 
email addresses.  Everything in the local part >= U+0080 is permissible and up 
to the server to sort out what characters it wants to allow, how it wants to 
map things like Turkish I, etc.  Some code points are clearly really unhelpful 
in an email local part, but the EAI RFCs leave it up to the servers how they 
want to assign mailboxes.

Obviously you could check the domain name to make sure it's a valid domain 
name, and the ASCII range of the local part to make sure it respects the 
earlier RFCs, and the lengths, but you won't really know if it's a legal name 
until the mail does/doesn't get accepted by the server.  AFAIK there isn't a 
published regex for doing the limited validation that is possible.

-Shawn

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of James Lin
Sent: Wednesday, October 30, 2013 1:42 PM
To: cldr-us...@unicode.org; unicode@unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

Let me include the unicode alias as well for wider audience since this topic 
came up few times in the past.

From: James Lin mailto:james_...@symantec.com>>
Date: Wednesday, October 30, 2013 at 1:11 PM
To: "cldr-us...@unicode.org<mailto:cldr-us...@unicode.org>" 
mailto:cldr-us...@unicode.org>>
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII 
email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on 
how to validate none-ASCII email address.

thanks everyone.
-James


Re: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread James Lin
Let me include the unicode alias as well for wider audience since this topic 
came up few times in the past.

From: James Lin mailto:james_...@symantec.com>>
Date: Wednesday, October 30, 2013 at 1:11 PM
To: "cldr-us...@unicode.org<mailto:cldr-us...@unicode.org>" 
mailto:cldr-us...@unicode.org>>
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII 
email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on 
how to validate none-ASCII email address.

thanks everyone.
-James