[ 
https://issues.apache.org/jira/browse/AVRO-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13206546#comment-13206546
 ] 

Scott Carey commented on AVRO-1022:
-----------------------------------

I see the wisdom in restricting names to be a simple set of ASCII characters.  
Until just a few minutes ago the arguments above were convincing me that the 
[A-Za-z_][A-Za-z0-9_]+
name format was a very useful simplification.  

But now I think names should be almost entirely open.  Defining "isLetter() or 
isDigit()" is problematic as pointed out above.  So don't even bother with 
that.  How about defining it only with respect to ASCII.  The naming rule in 
the spec would apply to ASCII only, all other code points are allowed.  Unlike 
some notion of isLetter(), this does not imply c or c++ needs a big library 
like ICU.  All implementations must already support UTF-8 in order to support 
JSON.  Languages can define internally how they map messy names to variables, 
types, or enum symbols. 

If AVRO restricts valid names, then it won't be able to convert schemas from 
other systems into avro schemas.

For example, how does this relate to 
https://issues.apache.org/jira/browse/PIG-1339
?

If names are restricted, then consuming schemas from other systems will be 
difficult.  Fewer restrictions in Avro make it more compatible and capable.  

If there are stringent naming rules in the spec, it would be wise to 
standardize name mangling from external sources into Avro in the spec.

So I see two options that make sense:
* Enforce the restriction in the current spec, add flexibility for reading 
schemas that do not comply (that may have already been persisted into permanent 
storage), and add to the spec standardized name mangling for translating 
schemas from other systems to Avro and back.  
* Open up the spec for naming to be significantly more flexible.  At minimum 
also allow all code points above 127. Consider opening up even more characters 
in ASCII as valid names.

There are two kinds of mangling to consider.

* "External system" to and from Avro.  For example, a valid name in an external 
system might start with a number.  If translated into Avro and Avro does not 
allow this, it would be very useful if all languages could look at the 
resulting name and convert it back if required.  This should be standardized 
across Avro.  The fewer restrictions in Avro, the easier this translation 
process is.

* Avro to and from language identifiers in an implementation.  This is a 
different issue that is language local.  Because it is language local and up to 
the Avro implementation, this is less of a concern to me than translation from 
external schema sources.  Most languages don't allow a newline in an 
identifier, but should Avro disallow that?  Language implementations need to be 
prepared to mangle disallowed characters and strings regardless of what Avro 
specifies.

                
> Error in validate name
> ----------------------
>
>                 Key: AVRO-1022
>                 URL: https://issues.apache.org/jira/browse/AVRO-1022
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>            Reporter: Raymie Stata
>            Priority: Minor
>         Attachments: AVRO-1022.patch, AVRO-1022.patch
>
>
> Fix schema.validateName to allow only ASCII letters, not Unicode letters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to