RE: Best practice of using regex on identify none-ASCII email address

Shawn Steele Wed, 30 Oct 2013 15:50:28 -0700

Mixed script stuff considerations are all supposed to be done by the mailbox 
administrator.  It's perfectly valid for a domain to assign Latin addresses and 
also Cyrillic ones.  Indeed for Cyrillic EAI, one probably would almost 
certainly require ASCII (eg: Latin) aliases during whatever the transition 
period is.


A German mailbox admins may only allow German letters and no other Latin 
characters in their mailbox names.  Other admins may want to allow Latin 
characters with other scripts (CJK locales come to mind).  And a Russian admin 
may provide all-Cyrillic mailboxes with all-Latin aliases to those names.  
(Hopefully that admin's being careful about homographs, but the standards still 
let the admin make the decisions).

The PUA isn't even forbidden (I'm hoping for a pIqaD alias some day).

-Shawn

From: [email protected] [mailto:[email protected]] On Behalf 
Of James Lin
Sent: Wednesday, October 30, 2013 2:58 PM
To: Paweł Dyda
Cc: [email protected]; [email protected]
Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi
I am not expecting a single regular expression to solve all possible 
combination of scripts.  What I am looking for probably (which may not be 
possible due to combination of scripts and mix scripts) is somewhere along the 
line of having individual scripts that validate by the regular expression.  I 
am still thinking if it is possible to have regular expression for individual 
scripts only and not mix-match (for the time being) such as (i am being very 
high level here):

  *    Phags-pa scripts

     *   Chinese: Traditional/Simplified
     *   Mongolian
     *   Sanskrit
     *   ...

  *   Kana scripts

     *   Japanese: hirakana/Katakana
     *   ...

  *   Hebrew scripts

     *   Yiddish
     *   Hebrew
     *   Bukhori
     *   ...

  *   Latin scripts

     *   English
     *   Italian
     *   ....

  *   Hangul scripts

     *   Korean

  *   Cyrillic Scripts

     *   Russian
     *   Bulgarian
     *   Ukrainian
     *   ...
By focusing on each scripts to derive a regular expression, I was wondering if 
such validation can be accomplished here.

Of course, RFC3696 standardize all email formatting rules and we can use such 
rule to validate the format before checking the scripts for validity.

Warm Regards,
-James Lin



From: Paweł Dyda <[email protected]<mailto:[email protected]>>
Date: Wednesday, October 30, 2013 at 2:19 PM
To: James Lin <[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>, Unicode List 
<[email protected]<mailto:[email protected]>>
Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi James,
I am not sure if you have seen my email, but... I believe Regular Expressions 
are not a valid tool for that job (that is validating Int'l email address 
format).

In the internal email I especially gave one specific example, where to my 
knowledge it is (nearly) impossible to use Regular Expression to validate email 
address.

The reason I gave was mixed-script scenario.

How can we ensure that we allow mixture of  Hiragana, Katakana and Latin, while 
basically disallowing any other combinations with Latin (especially Latin + 
Cyrillic or Latin + Greek)?
I am really curious to know...
And of course there are several single-script (homographs and alike) attacks 
that we might want to prevent. I don't think it is even remotely possible with 
Regular Expressions. Please correct me if I am wrong.
Cheers,
Paweł.

2013/10/30 James Lin <[email protected]<mailto:[email protected]>>
Let me include the unicode alias as well for wider audience since this topic 
came up few times in the past.

From: James Lin <[email protected]<mailto:[email protected]>>
Date: Wednesday, October 30, 2013 at 1:11 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII 
email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on 
how to validate none-ASCII email address.

thanks everyone.
-James

RE: Best practice of using regex on identify none-ASCII email address

Reply via email to