JFinis commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1587367949

   > > It isn't clear to me if this should be a logical type or a physical 
type. We would need understand if there is different handling for forward 
compatibility purposes (what do we want the desired behavior to be be). I think 
C++ might be lenient here, but don't know about parquet-mr @gszadovszky 
thoughts?
   > 
   > I think the basic idea behind having physical and logical types is to 
support forward compatibility since we can always represent (somehow) a 
long-existing physical type while logical types are getting extended. 
Parquet-mr should work fine with "unknown" logical types by reading it back as 
an un-annotated physical vale (a `Binary` with two bytes in this case). So, if 
the community supports having a half-precision floating point type I would vote 
on specifying it as a logical type.
   > 
   > The tricky thing will be the implementations. Even though parquet-mr does 
not really care about converting the values according to their logical types we 
still need to care about the logical types at the ordering (min/max values in 
the statistics). It would not be too easy to implement the half-precision 
floating point comparison logic since java does not have such a primitive type. 
(BTW the sorting order of floating point numbers are still an open issue: 
[PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222))
   
   FWIW, I rather think it should be a physical type for the following reasons:
   
   * encodings are currently only defined on the physical type, not the logical 
one. So allowing BYTE_STREAM_SPLIT for this type would actually break this if 
it is a logical type.
   * Having this be a logical type while float and double are physical types 
seems inconsistent.
   * There might eventually be hardware support or native language support for 
this for this type. In this case, having it as physical type would allow easier 
to leverage this hardware / language support, as most libraries instantiate 
encoders/decoders based on the physical type. Again, having now one exception 
where you would need a decoder based on a *logical* type would break this 
pattern and require additional effort. If Java and C++ had a float16 type, I 
guess more people would agree that it should be a physical type. So is the 
intuition of this being a logical type just based on the yet missing language 
support for this?
   * IMHO, the basic idea behind physical and logical types is not to support 
forward compatibility; that is just a byproduct. Otherwise, there should just 
be one or two physical types in the first place (FIXED_LEN_BYTE_ARRAY and 
BYTE_ARRAY). The basic idea is rather to make a distinction between physical 
representation and what the values logically mean. In my mental model it is 
rather a layered approach: There are layers that only care about the physical 
types (e.g., the encoders/decoders) and then further layers that also care 
about the logical type (e.g. the statistics maintenance code). And here again, 
this would break this layering.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to