gortiz commented on code in PR #11824:
URL: https://github.com/apache/pinot/pull/11824#discussion_r1371372462
##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/FieldSpec.java:
##########
@@ -300,6 +303,27 @@ public void setTransformFunction(@Nullable String
transformFunction) {
_transformFunction = transformFunction;
}
+ /**
+ * Returns whether the column is nullable or not.
+ *
+ * This Java property itself is nullable. In case null is returned, actual
nullability depends on the value of
+ * {@link IndexingConfig#isNullHandlingEnabled()}.
+ *
+ * @return true if the field is nullable, false if it is not and null if
column level nullability should be delegated
+ * to {@link IndexingConfig#isNullHandlingEnabled()}.
+ */
+ @Nullable
+ public Boolean getNullable() {
Review Comment:
> Do you see a clear path to move away from the table level config?
I think it is clear:
1. Phase 1:
* For read path: Use `FieldSpec` nullability if declared, otherwise
delegate on `IndexingConfig.isNullHandlingEnabled()`
* For write path: Use `IndexingConfig.isNullHandlingEnabled()`
2. Phase 2:
* For read path (same as Phase 1): Use `FieldSpec` nullability if
declared, otherwise delegate on `IndexingConfig.isNullHandlingEnabled()`
* For write path: Use `FieldSpec` nullability if declared, otherwise
delegate on `IndexingConfig.isNullHandlingEnabled()`
3. Phase 3:
* Deprecate `IndexingConfig.isNullHandlingEnabled()` (although we still
need to support it for compatibility reasons)
> There are potential big performance impact when a column is registered as
nullable, so I wouldn't register it nullable unless it is explicitly configured
nullable here.
Sure, we need to document that and users should know that. With this change
we expect to reduce the number of columns that are null right now, given what
we are providing (specially in phase 1) is the ability to make columns not
nullable in tables that are nullable by default.
> should we default it to false unless otherwise specified (and thus no need
for non-primitive boolean)
I don't think that is a good idea. Without making this non-primitive we
cannot know whether the value was explicitly defined as false or it has
defaulted to false. We would also be breaking backward compatibility. By using
a primitive boolean with false by default, in order to make a column nullable
users would need to:
- In Phase 1:
- Set `IndexingConfig.isNullHandlingEnabled()` to true. This is required
at write time, given in V1 write path ignores `FieldSpec` nullability.
- Set each, for each nullable column, `FieldSpec` nullability to true.
This is a breaking change. All tables already created would need to be
modified, otherwise they would be suddenly treated as not null.
- In Phase 2:
- Set each, for each nullable column, `FieldSpec` nullability to true.
The current implementation is backward compatible (users don't need to set
`FieldSpec` nullability). The semantics can be read in
`NullValueIndexTypeTest`. Another advantage of using a non-primitive boolean is
that we don't lose information when field specs are serialized and deserialized
without having to do any fancy thing.
For example, a FieldSpec where nullability is not defined will be serialized
to:
```json
{
"name": "col1",
"dataType": "INT"
}
```
while in your proposal it would be serialized by default to:
```json
{
"name": "col1",
"dataType": "INT",
"nullable": false
}
```
Alternatively we could not serialize nullable when its value is false, but
then we would lose information when it was explicitly declared.
> as long as the query flag EnableNullHandling overrides this setting we
should be good for v1 compatibility --> as in v2 we will no use this option
`enableNullHandling` query flag does not override this value. They are
applied at different parts of the code.
- Column level nullability indicates:
- In phase 1: Whether the null value vector should be considered for this
column or not.
- In phase 2: Phase 1 plus also be used to actually decide whether the
null value vector is generated or not.
- `enableNullHandling` query flag indicates whether we use null value vector
during query execution or not. It basically means to change the algorithm used
to evaluate some operators. For example, a `count(col1)` without
`enableNullHandling` can be resolved reading segment metainfo while a the same
count with `enableNullHandling` also requires to read the null value vector to
do substract the number of null values in the column
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]