paleolimbot commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1893240333
##########
format/spec.md:
##########
@@ -584,8 +589,8 @@ The schema of a manifest file is a struct called
`manifest_entry` with the follo
| _optional_ | _optional_ | _optional_ | **`110 null_value_counts`** |
`map<121: int, 122: long>` |
Map from column id to number of null values in the column
|
| _optional_ | _optional_ | _optional_ | **`137 nan_value_counts`** |
`map<138: int, 139: long>` |
Map from column id to number of NaN values in the column
|
| _optional_ | _optional_ | _optional_ | **`111 distinct_counts`** |
`map<123: int, 124: long>` |
Map from column id to number of distinct values in the column; distinct counts
must be derived using values in the file by counting or using sketches, but not
using methods like merging existing distinct counts |
-| _optional_ | _optional_ | _optional_ | **`125 lower_bounds`** |
`map<126: int, 127: binary>` |
Map from column id to lower bound in the column serialized as binary [1]. Each
value must be less than or equal to all non-null, non-NaN values in the column
for the file [2] |
-| _optional_ | _optional_ | _optional_ | **`128 upper_bounds`** |
`map<129: int, 130: binary>` |
Map from column id to upper bound in the column serialized as binary [1]. Each
value must be greater than or equal to all non-null, non-Nan values in the
column for the file [2] |
+| _optional_ | _optional_ | _optional_ | **`125 lower_bounds`** |
`map<126: int, 127: binary>` |
Map from column id to lower bound in the column serialized as binary [1]. Each
value must be less than or equal to all non-null, non-NaN values in the column
for the file [2]. See [7] for`geometry` and [8] for `geography`. |
+| _optional_ | _optional_ | _optional_ | **`128 upper_bounds`** |
`map<129: int, 130: binary>` |
Map from column id to upper bound in the column serialized as binary [1]. Each
value must be greater than or equal to all non-null, non-Nan values in the
column for the file [2]. See [9] for `geometry` and [10] for `geography`. |
Review Comment:
If we do want to do this (and it may be a good idea to *not* do this because
it introduces another discussion point and increases the complexity of the
Geometry type that is not strictly necessary), here is an option. It involves
separating the geometry and geography definitions because I can't think of a
concise way to combine them.
```
7. `geometry`, this is a point: X, Y, Z, and M take the min value of all
component points of all geometries in file. For the X value only, this is
permitted to be the westernmost longitude if the CRS is a geographic CRS. See
Appendix D for encoding.
8. `geography`, this is a point: X = westernmost bound of all geometries in
file, Y = northernmost bound of all geometries in file, Z is min value for all
component points of all geometries in the file, M is min value of all component
points of all geometries in the file. See Appendix D for encoding.
9. `geometry`, this is a point: X, Y, Z, and M take the max value of all
component points of all geometries in file. For the X value only, this is
permitted to be the westernmost longitude if the CRS is a geographic CRS. See
Appendix D for encoding.
10. `geography`, this is a point: X = easternmost bound of all geometries in
file, Y = southernmost bound of all geometries in file, Z is max value for all
component points of all geometries in the file, M is max value of all component
points of all geometries in the file. See Appendix D for encoding.
11. `geography` and `geometry`, the concepts of westernmost and easternmost
values are explicitly introduced to address cases involving anti-meridian
crossing, where the `lower_bound` may be greater than `upper_bound`. The
canonical ranges for the bounding box covering all points in the coordinate
system is [-180 180] for the west-east range and [-90 90] for the south-north
range.
```
This does make it harder for the low-level code that has to deal with this
(e.g., something like Parquet in C++), which has to implement a simple min/max
pushdown AND a geographic wrapped bounding box pushdown anyway, but if we add
this exception, the low-level implementation needs to be told if the CRS is
geographic. (I am not sure if in practice this will actually be a problem).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]