[
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731621#comment-17731621
]
ASF GitHub Bot commented on PARQUET-758:
----------------------------------------
JFinis commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1587367949
> > It isn't clear to me if this should be a logical type or a physical
type. We would need understand if there is different handling for forward
compatibility purposes (what do we want the desired behavior to be be). I think
C++ might be lenient here, but don't know about parquet-mr @gszadovszky
thoughts?
>
> I think the basic idea behind having physical and logical types is to
support forward compatibility since we can always represent (somehow) a
long-existing physical type while logical types are getting extended.
Parquet-mr should work fine with "unknown" logical types by reading it back as
an un-annotated physical vale (a `Binary` with two bytes in this case). So, if
the community supports having a half-precision floating point type I would vote
on specifying it as a logical type.
>
> The tricky thing will be the implementations. Even though parquet-mr does
not really care about converting the values according to their logical types we
still need to care about the logical types at the ordering (min/max values in
the statistics). It would not be too easy to implement the half-precision
floating point comparison logic since java does not have such a primitive type.
(BTW the sorting order of floating point numbers are still an open issue:
[PARQUET-1222](https://issues.apache.org/jira/browse/PARQUET-1222))
FWIW, I rather think it should be a physical type for the following reasons:
* encodings are currently only defined on the physical type, not the logical
one. So allowing BYTE_STREAM_SPLIT for this type would actually break this if
it is a logical type.
* Having this be a logical type while float and double are physical types
seems inconsistent.
* There might eventually be hardware support or native language support for
this for this type. In this case, having it as physical type would allow easier
to leverage this hardware / language support, as most libraries instantiate
encoders/decoders based on the physical type. Again, having now one exception
where you would need a decoder based on a *logical* type would break this
pattern and require additional effort. If Java and C++ had a float16 type, I
guess more people would agree that it should be a physical type. So is the
intuition of this being a logical type just based on the yet missing language
support for this?
* IMHO, the basic idea behind physical and logical types is not to support
forward compatibility; that is just a byproduct. Otherwise, there should just
be one or two physical types in the first place (FIXED_LEN_BYTE_ARRAY and
BYTE_ARRAY). The basic idea is rather to make a distinction between physical
representation and what the values logically mean. In my mental model it is
rather a layered approach: There are layers that only care about the physical
types (e.g., the encoders/decoders) and then further layers that also care
about the logical type (e.g. the statistics maintenance code). And here again,
this would break this layering.
> [Format] HALF precision FLOAT Logical type
> ------------------------------------------
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Julien Le Dem
> Priority: Minor
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)