[ 
https://issues.apache.org/jira/browse/AVRO-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293940#comment-17293940
 ] 

Werner Daehn commented on AVRO-2952:
------------------------------------

Puh, How should I respond....

Let me start with the "I am not a huge fan of":
 * The name {{AvroDatatype}} : I don't care. It was the first name to come to 
mind.
 * {{AvroNCLOB}} is a String with an underlying String: Let me defer that when 
I answer our major different perception in the paragraph at the end.
 * implied "validation constraints": I have no problem with enforcing them. 
Essentially it is a design decision. Should existing code break at runtime, 
should it handle it gracefully or is it just an information? Let's say we have 
a NVARCHAR(8) and the payload is a string with 10 chars. We could fail in the 
setter/conversion logic. We could truncate the text to 8 chars. We could retain 
the 10 chars - we don't know how the user wants to deal with it - and the 
information that the string **should** be 8 chars long is just an additional 
information. I wanted to avoid creating troubles so decided for the latter. 
I tried to replicate what Avro does today. You can put into every field every 
value - no error. At serialization it might fail or might not. Hence I did not 
modify the values either. Example: put into a URI logical type the value "Hello 
World". Does not raise and error anywhere.
 * conversion rules were surprising: My thought was better allow conversions 
from/to as much as possible than throwing an error and document it. I actually 
documented all in a MD file but had to remove it due to the build validations. 
If you point me to the right place, I write the conversion table.

Yes, I agree that a more open logical type support would be better but that 
means rewriting the existing LogicalType classes into something that is 
incompatible probably. I was not brave enough to do that.

But adding that logic as it is right now into the put/get or a second type of 
put/get method pair is no big deal as well.

 

And now for the elephant in the room: *carry as much metadata as possible*

It really depends on the point of view and use case. There are definitely use 
cases where you define the naked Avro schema as the gold standard and everybody 
has to align to it. Say Avro data ends up in Parquet files. Why would you care 
about a NVARCHAR(10), its a string. Period. But then you do not need logical 
datatypes at all. We all know what an URI looks like, why an additional logical 
datatype? For a good reason. To set a standard.

The nice thing about Avro however is that you can have both. You can augment 
the metadata, so whoever is using it knows that it is a NVARCHAR(10). And for 
those who are not using this additional information, it is a Avro String. You 
can use the additional metadata but you do not have to. User's choice. That's a 
comfortable position and one of the many reasons I like Avro so much.

I am coming from the data integration space. Source is a database, data is 
converted to Avro, streamed through Kafka and there can be many consumers. 
Parquet is one consumer, but another database can be another consumer. Ideally 
the target database should load the source column of type NVARCHAR(10) into a 
NVARCHAR(10) column, shouldn't it? But if the Avro schema does not provide that 
information, all it tells is "string" the only feasible datatype is either 
NVARCHAR(4000) and pray it is enough, or NCLOB datatypes for everything. Either 
way the result is unusable. NCLOBs are really slow in databases and have lots 
of limitations (no indexing, not all functions, no primary key,...). Having 
tables with all columns NVARCHAR(4000), even fields like GENDER_CODE, does not 
look good either.

The only option is to provide additional metadata. With Avro being used more, 
that requires everybody to add custom properties, custom logical types or 
documented rules. At the end tools are inconsistent. Hence my suggestion to 
standardize that at least for the JDBC based data types.

In other words, what is the harm in carrying more metadata? More information is 
always better, especially if it optional to provide and optional to use and 
does not cost anything and does not break anything!

And that also answers the difference between an Avro String primitive and an 
NCLOB logical type with Avro String as backing: The Avro native string tells "I 
have no clue what data it will contain, can be anything" The NCLOB as 
additional metadata tells "I don't know how big it will be but it will be 
large. Else the datatype would have been a NVARCHAR(10) or whatever".

 

 

 

> Logical Types and Conversions enhancements
> ------------------------------------------
>
>                 Key: AVRO-2952
>                 URL: https://issues.apache.org/jira/browse/AVRO-2952
>             Project: Apache Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.10.1
>            Reporter: Werner Daehn
>            Priority: Critical
>             Fix For: 1.10.2
>
>
> *Summary*:
>  * Added a method *Field.getDataType()* which returns an object with common 
> data type related methods. Most important are methods to call the converters.
>  * Added trivial LogicalTypes to allow better database integration, e.g. the 
> LogicalType VARCHAR(10) is a STRING that carries the information that only 
> ASCII chars are in the payload and up to 10 chars max.
> *Example*:
> The user has a record with one ENUM field and he has a Java enum class. 
> Instead of manually converting the Java string into a GenericEnumSymbol he 
> can use the convertToRawType of the AvroDataType class.
> f...Field to be set
> {{testRecord.put(f.name(), 
> f.getDataType().convertToRawType(myEnum.male.name()));}}
> Using the {{f.getDataType().convertToRawType() }}does all the conversion. I 
> considered adding that conversion into the put() method itself but feared 
> side effects. So the user has to invoke the convertToRawType().
> *Reasoning*:
> I am working with Avro (Kafka) for two years now and have implemented 
> improvements around Logical Types. These I merged into the Avro code with 
> zero side effects - pure additions. No breaking changes for other Avro users 
> but a great help for them.
> Imagine you connect two databases via Kafka using Avro as the message payload.
>  # The first problem you will be facing is that RawTypes and LogicalTypes are 
> handled differently. For LogicalTypes there are conversion functions that 
> provide metadata (e.g. getConvertedType returns that a Java Instant is the 
> best data type for a timestamp-millis plus conversion logic. For raw types 
> there is no such thing. A Boolean can be provided as true, "TRUE", 1,...
>  # Second problem will be the lack of getObject()/setObject() methods similar 
> to JDBC. The result are endless switch-case lists to call the correct 
> methods. In every single project for every user.
>  # Number three is the usage of the Converters as such. The intended usage is 
> to add converters to the GenericData and the reader/writer uses the best 
> suited converter. What I have seen most people do however is to use the 
> converters manually and assign the raw value directly. While adding 
> converters is possible still, the conversion at GenericRecord.put() and 
> GenericRecord.get() is easy now.
>  # For a data exchange format like Avro, it is important to carry as much 
> metadata as possible. For example, purely seen from Avro a STRING data type 
> is just fine. 99% of the string data types in a database are VARCHAR(length) 
> and NVARCHAR(length). While putting an ASCII String of length 10 into a 
> STRING is no problem, on the consumer side the only matching data type is a 
> NCLOB - the worst for a data base. The LogicalTypes provide such nice methods 
> to carry such metadata, e.g. a LogicalType VARCHAR(10) backed by a String. 
> These Logical Types do not have any conversion, they just exist for the 
> metadata. You have such a thing already with the UUID LogicalType.
>  
> *Changes*:
>  * A new package logicaltypes created. It includes all new LogicalTypes and 
> the AvroDataType implementations for the various raw data types.
>  * The existing LogicalTypes are unchanged. The corresponding classes in the 
> logicaltype package just extend them.
>  * For that some LogicalType fields needed to be made public.
>  * The LogicalTypes return the more detailed logicaltype.* classes.
>  * A test class created.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to