[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

ASF GitHub Bot (Jira) Mon, 12 Jun 2023 06:42:04 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731621#comment-17731621
 ]


ASF GitHub Bot commented on PARQUET-758:
----------------------------------------

JFinis commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1587367949

   > > It isn't clear to me if this should be a logical type or a physical 
type. We would need understand if there is different handling for forward 
compatibility purposes (what do we want the desired behavior to be be). I think 
C++ might be lenient here, but don't know about parquet-mr @gszadovszky 
thoughts?
   > 
   > I think the basic idea behind having physical and logical types is to 
support forward compatibility since we can always represent (somehow) a 
long-existing physical type while logical types are getting extended. 
Parquet-mr should work fine with "unknown" logical types by reading it back as 
an un-annotated physical vale (a `Binary` with two bytes in this case). So, if 
the community supports having a half-precision floating point type I would vote 
on specifying it as a logical type.
   > 
   > The tricky thing will be the implementations. Even though parquet-mr does 
not really care about converting the values according to their logical types we 
still need to care about the logical types at the ordering (min/max values in 
the statistics). It would not be too easy to implement the half-precision 
floating point comparison logic since java does not have such a primitive type. 
(BTW the sorting order of floating point numbers are still an open issue: 
[PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222))
   
   FWIW, I rather think it should be a physical type for the following reasons:
   
   * encodings are currently only defined on the physical type, not the logical 
one. So allowing BYTE_STREAM_SPLIT for this type would actually break this if 
it is a logical type.
   * Having this be a logical type while float and double are physical types 
seems inconsistent.
   * There might eventually be hardware support or native language support for 
this for this type. In this case, having it as physical type would allow easier 
to leverage this hardware / language support, as most libraries instantiate 
encoders/decoders based on the physical type. Again, having now one exception 
where you would need a decoder based on a *logical* type would break this 
pattern and require additional effort. If Java and C++ had a float16 type, I 
guess more people would agree that it should be a physical type. So is the 
intuition of this being a logical type just based on the yet missing language 
support for this?
   * IMHO, the basic idea behind physical and logical types is not to support 
forward compatibility; that is just a byproduct. Otherwise, there should just 
be one or two physical types in the first place (FIXED_LEN_BYTE_ARRAY and 
BYTE_ARRAY). The basic idea is rather to make a distinction between physical 
representation and what the values logically mean. In my mental model it is 
rather a layered approach: There are layers that only care about the physical 
types (e.g., the encoders/decoders) and then further layers that also care 
about the logical type (e.g. the statistics maintenance code). And here again, 
this would break this layering.
   




> [Format] HALF precision FLOAT Logical type
> ------------------------------------------
>
>                 Key: PARQUET-758
>                 URL: https://issues.apache.org/jira/browse/PARQUET-758
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Julien Le Dem
>            Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

Reply via email to