Re: [PR] Add PrePlanTable and PlanTable Endpoints to open api spec [iceberg]

via GitHub Tue, 16 Jul 2024 15:30:15 -0700


syun64 commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1680131625



##########
open-api/rest-catalog-open-api.yaml:
##########
@@ -3642,6 +3781,173 @@ components:
             type: integer
           description: "List of equality field IDs"
 
+    PreplanTableRequest:
+      type: object
+      required:
+        - table-scan-context
+      properties:
+        table-scan-context:
+          $ref: '#/components/schemas/TableScanContext'
+
+    PlanTableRequest:
+      type: object
+      required:
+        - table-scan-context
+      properties:
+        table-scan-context:
+          $ref: '#/components/schemas/TableScanContext'
+        plan-task:
+          $ref: '#/components/schemas/PlanTask'
+        stats-fields:
+          description:
+            A list of fields that the client requests the server to send 
statistics
+            in each `FileScanTask` returned in the response
+          type: array
+          items:
+            $ref: '#/components/schemas/FieldName'
+
+    TableScanContext:
+      anyOf:
+        - $ref: '#/components/schemas/SnapshotScanContext'
+        - $ref: '#/components/schemas/IncrementalSnapshotScanContext'
+
+    BaseTableScanContext:
+      discriminator:
+        propertyName: table-scan-type
+        mapping:
+          snapshot-scan: '#/components/schemas/SnapshotScanContext'
+          incremental-snapshot-scan: 
'#/components/schemas/IncrementalSnapshotScanContext'
+      type: object
+      required:
+        - table-scan-type
+      properties:
+        table-scan-type:
+          type: string
+
+    SnapshotScanContext:
+      description: context for scanning data in a specific snapshot
+      type: object
+      allOf:
+        - $ref: '#/components/schemas/BaseTableScanContext'
+      required:
+        - table-scan-type
+      properties:
+        table-scan-type:
+          type: string
+          enum: ["snapshot-scan"]
+        select:
+          $ref: '#/components/schemas/SelectedFieldNames'
+        filter:
+          $ref: '#/components/schemas/Filter'
+        case-sensitive:
+          description: If field selection and filtering should be case 
sensitive
+          type: boolean
+          default: true
+        snapshot-id:
+          description:
+            The ID of the snapshot to use for the table scan.
+            If not specified, the snapshot at the main branch head will be 
used.
+          type: integer
+          format: int64
+        use-snapshot-schema:
+          description:
+            If the schema of the specific snapshot should be used instead of 
the table schema.
+          type: boolean
+          default: false
+
+    IncrementalSnapshotScanContext:
+      description:
+        Context for scanning data appended in a range of snapshots.
+        The scan always follows the schema of the snapshot at the main branch 
head.
+      type: object
+      allOf:
+        - $ref: '#/components/schemas/BaseTableScanContext'
+      required:
+        - table-scan-type
+        - start-snapshot-id
+      properties:
+        table-scan-type:
+          type: string
+          enum: ["incremental-snapshot-scan"]
+        select:
+          $ref: '#/components/schemas/SelectedFieldNames'
+        filter:
+          $ref: '#/components/schemas/Filter'
+        case-sensitive:
+          description: If field selection and filtering should be case 
sensitive
+          type: boolean
+          default: true
+        start-snapshot-id:
+          description: The ID of the starting snapshot of the incremental scan
+          type: integer
+          format: int64
+        inclusive-start:
+          description: If the data appended in the start snapshot should be 
included in the scan
+          type: boolean
+          default: false
+        end-snapshot-id:
+          description:
+            The ID of the inclusive ending snapshot of the incremental scan.
+            If not specified, the snapshot at the main branch head will be 
used as the end snapshot.
+          type: integer
+          format: int64
+
+    FieldName:

Review Comment:
   Hi @jackye1995 - first off, thank you very much for the work you are doing 
on this item!
   
   I'm personally incredibly excited at the performance optimization this 
feature will enable Rest Catalogs to bring to data ecosystems everywhere.
   
   I do have a question on the decision to use flat '.' delimited strings to 
represent FieldName here instead of an array.
   
   While working on PRs on PyIceberg, we discovered that flattening the names 
has a potential to cause namespace conflicts in the field names. For example, 
this will cause a field named `a.b` to collide with the child field `b` of 
field `a`. This can be represented by an Iceberg Schema like:
   
   ```
   Schema(
       NestedField("a.b", StringType(), id=1),
       NestedField("a", StructType(
           [NestedField("b", StringType(), id=3)]
       ), id=2),
   )
   ```
   
   Flattening the name actually contrasts our approach documented in `Column 
Projection` section of the docs, where we note that a name may contain '.' but 
that this refers to a literal name, and not a nested field: 
https://iceberg.apache.org/spec/#column-projection
    
   Open API Schema for Namespace, which also represents a nested structure:
   
https://github.com/apache/iceberg/blob/4a0ae22199375e34f9033bed9781da3dc90d53c6/open-api/rest-catalog-open-api.yaml#L1625-L1630
   
   Although it goes against our existing approach, I can see how flattening the 
names would be appealing given that maintaining the logic to create a nested 
structure and implement lookups in all of our libraries may be a more 
challenging task. But given that we already have documentation that says 
otherwise, I think if we decide to use a flat representation here, we should be 
intentional about it, and update our documentations to note that using '.' 
character in column names will result in errors with our field name projection



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add PrePlanTable and PlanTable Endpoints to open api spec [iceberg]

Reply via email to