gortiz commented on code in PR #11824:
URL: https://github.com/apache/pinot/pull/11824#discussion_r1371372462


##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/FieldSpec.java:
##########
@@ -300,6 +303,27 @@ public void setTransformFunction(@Nullable String 
transformFunction) {
     _transformFunction = transformFunction;
   }
 
+  /**
+   * Returns whether the column is nullable or not.
+   *
+   * This Java property itself is nullable. In case null is returned, actual 
nullability depends on the value of
+   * {@link IndexingConfig#isNullHandlingEnabled()}.
+   *
+   * @return true if the field is nullable, false if it is not and null if 
column level nullability should be delegated
+   * to {@link IndexingConfig#isNullHandlingEnabled()}.
+   */
+  @Nullable
+  public Boolean getNullable() {

Review Comment:
   > Do you see a clear path to move away from the table level config?
   I think it is clear:
   1. Phase 1: 
      * For read path: Use `FieldSpec` nullability if declared, otherwise 
delegate on `IndexingConfig.isNullHandlingEnabled()`
      * For write path: Use `IndexingConfig.isNullHandlingEnabled()`
   2. Phase 2:
      * For read path (same as Phase 1): Use `FieldSpec` nullability if 
declared, otherwise delegate on `IndexingConfig.isNullHandlingEnabled()`
      * For write path: Use `FieldSpec` nullability if declared, otherwise 
delegate on `IndexingConfig.isNullHandlingEnabled()`
   3. Phase 3:
     * Deprecate `IndexingConfig.isNullHandlingEnabled()` (although we still 
need to support it for compatibility reasons)
   
   > There are potential big performance impact when a column is registered as 
nullable, so I wouldn't register it nullable unless it is explicitly configured 
nullable here.
   
   Sure, we need to document that and users should know that. With this change 
we expect to reduce the number of columns that are null right now, given what 
we are providing (specially in phase 1) is the ability to make columns not 
nullable in tables that are nullable by default.
   
   > should we default it to false unless otherwise specified (and thus no need 
for non-primitive boolean)
   I don't think that is a good idea. Without making this non-primitive we 
cannot know whether the value was explicitly defined as false or it has 
defaulted to false. We would also be breaking backward compatibility. By using 
a primitive boolean with false by default, in order to make a column nullable 
users would need to:
   
   - In Phase 1:
      - Set `IndexingConfig.isNullHandlingEnabled()` to true. This is required 
at write time, given in V1 write path ignores `FieldSpec` nullability.
      - Set each, for each nullable column, `FieldSpec` nullability to true. 
This is a breaking change. All tables already created would need to be 
modified, otherwise they would be suddenly treated as not null.
   - In Phase 2:
      - Set each, for each nullable column, `FieldSpec` nullability to true.
   
   The current implementation is backward compatible (users don't need to set 
`FieldSpec` nullability). The semantics can be read in 
`NullValueIndexTypeTest`. Another advantage of using a non-primitive boolean is 
that we don't lose information when field specs are serialized and deserialized 
without having to do any fancy thing.
   
   For example, a FieldSpec where nullability is not defined will be serialized 
to:
   ```json
   {
     "name": "col1",
     "dataType": "INT"
   }
   ```
   
   while in your proposal it would be serialized by default to:
   ```json
   {
     "name": "col1",
     "dataType": "INT",
     "nullable": false
   }
   ```
   
   Alternatively we could not serialize nullable when its value is false, but 
then we would lose information when it was explicitly declared.
   
   > as long as the query flag EnableNullHandling overrides this setting we 
should be good for v1 compatibility --> as in v2 we will no use this option
   
   `enableNullHandling` query flag does not override this value. They are 
applied at different parts of the code.
   
   - Column level nullability indicates:
     - In phase 1: Whether the null value vector should be considered for this 
column or not.
     - In phase 2: Phase 1 plus also be used to actually decide whether the 
null value vector is generated or not.
   - `enableNullHandling` query flag indicates whether we use null value vector 
during query execution or not. It basically means to change the algorithm used 
to evaluate some operators. For example, a `count(col1)` without 
`enableNullHandling` can be resolved reading segment metainfo while a the same 
count with `enableNullHandling` also requires to read the null value vector to 
do substract the number of null values in the column



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to