Re: [PR] Support for filters in the Druid Delta Lake connector (druid)

via GitHub Thu, 18 Apr 2024 14:04:56 -0700


317brian commented on code in PR #16288:
URL: https://github.com/apache/druid/pull/16288#discussion_r1571365240



##########
docs/ingestion/input-sources.md:
##########
@@ -1141,7 +1141,85 @@ To use the Delta Lake input source, load the extension 
[`druid-deltalake-extensi
 You can use the Delta input source to read data stored in a Delta Lake table. 
For a given table, the input source scans
 the latest snapshot from the configured table. Druid ingests the underlying 
delta files from the table.
 
-The following is a sample spec:
+ | Property|Description|Required|
+|---------|-----------|--------|
+| type|Set this value to `delta`.|yes|
+| tablePath|The location of the Delta table.|yes|
+| filter|The JSON Object that filters data files within a snapshot.|no|
+
+### Delta filter object
+
+You can use these filters to filter out data files from a snapshot, reducing 
the number of files Druid has to ingest from
+a Delta table. This input source provides the following filters: `and`, `or`, 
`not`, `=`, `>`, `>=`, `<`, `<=`.
+
+When a filter is applied on non-partitioned columns, the filtering is 
best-effort as the Delta Kernel solely relies
+on statistics collected when the non-partitioned table is created. In this 
scenario, this Druid connector may ingest
+data that doesn't match the filter. For guaranteed filtering behavior, use it 
only on partitioned columns.
+
+
+`and` Filter:

Review Comment:
   Sentence case for headings/titles
   
   ```suggestion
   `and` filter:
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -1141,7 +1141,85 @@ To use the Delta Lake input source, load the extension 
[`druid-deltalake-extensi
 You can use the Delta input source to read data stored in a Delta Lake table. 
For a given table, the input source scans
 the latest snapshot from the configured table. Druid ingests the underlying 
delta files from the table.
 
-The following is a sample spec:
+ | Property|Description|Required|
+|---------|-----------|--------|
+| type|Set this value to `delta`.|yes|
+| tablePath|The location of the Delta table.|yes|
+| filter|The JSON Object that filters data files within a snapshot.|no|
+
+### Delta filter object
+
+You can use these filters to filter out data files from a snapshot, reducing 
the number of files Druid has to ingest from
+a Delta table. This input source provides the following filters: `and`, `or`, 
`not`, `=`, `>`, `>=`, `<`, `<=`.
+
+When a filter is applied on non-partitioned columns, the filtering is 
best-effort as the Delta Kernel solely relies
+on statistics collected when the non-partitioned table is created. In this 
scenario, this Druid connector may ingest
+data that doesn't match the filter. For guaranteed filtering behavior, use it 
only on partitioned columns.
+
+
+`and` Filter:
+
+| Property | Description                                                       
                                    | Required |
+|----------|-------------------------------------------------------------------------------------------------------|----------|
+| type     | Set this value to `and`.                                          
                                    | yes      |
+| filters  | List of Delta filter predicates that needs to be AND-ed. `and` 
filter requires two filter predicates. | yes      |

Review Comment:
   I get what you're going for with `AND-ed`, so maybe:
   
   ```suggestion
   | filters  | List of Delta filter predicates that get evaluated using AND 
logic where both conditions need to be true. `and` filter requires two filter 
predicates. | yes      |
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -1141,7 +1141,85 @@ To use the Delta Lake input source, load the extension 
[`druid-deltalake-extensi
 You can use the Delta input source to read data stored in a Delta Lake table. 
For a given table, the input source scans
 the latest snapshot from the configured table. Druid ingests the underlying 
delta files from the table.
 
-The following is a sample spec:
+ | Property|Description|Required|
+|---------|-----------|--------|
+| type|Set this value to `delta`.|yes|
+| tablePath|The location of the Delta table.|yes|
+| filter|The JSON Object that filters data files within a snapshot.|no|
+
+### Delta filter object
+
+You can use these filters to filter out data files from a snapshot, reducing 
the number of files Druid has to ingest from
+a Delta table. This input source provides the following filters: `and`, `or`, 
`not`, `=`, `>`, `>=`, `<`, `<=`.
+
+When a filter is applied on non-partitioned columns, the filtering is 
best-effort as the Delta Kernel solely relies
+on statistics collected when the non-partitioned table is created. In this 
scenario, this Druid connector may ingest
+data that doesn't match the filter. For guaranteed filtering behavior, use it 
only on partitioned columns.

Review Comment:
   It's unclear what `it` refers to in `use it only on...`?
   
   Is it referring to filters?
   
   ```suggestion
   data that doesn't match the filter. For guaranteed filtering behavior, only 
use filters on on partitioned columns.
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -1141,7 +1141,85 @@ To use the Delta Lake input source, load the extension 
[`druid-deltalake-extensi
 You can use the Delta input source to read data stored in a Delta Lake table. 
For a given table, the input source scans
 the latest snapshot from the configured table. Druid ingests the underlying 
delta files from the table.
 
-The following is a sample spec:
+ | Property|Description|Required|
+|---------|-----------|--------|
+| type|Set this value to `delta`.|yes|
+| tablePath|The location of the Delta table.|yes|
+| filter|The JSON Object that filters data files within a snapshot.|no|
+
+### Delta filter object
+
+You can use these filters to filter out data files from a snapshot, reducing 
the number of files Druid has to ingest from
+a Delta table. This input source provides the following filters: `and`, `or`, 
`not`, `=`, `>`, `>=`, `<`, `<=`.
+
+When a filter is applied on non-partitioned columns, the filtering is 
best-effort as the Delta Kernel solely relies
+on statistics collected when the non-partitioned table is created. In this 
scenario, this Druid connector may ingest
+data that doesn't match the filter. For guaranteed filtering behavior, use it 
only on partitioned columns.
+
+
+`and` Filter:
+
+| Property | Description                                                       
                                    | Required |
+|----------|-------------------------------------------------------------------------------------------------------|----------|
+| type     | Set this value to `and`.                                          
                                    | yes      |
+| filters  | List of Delta filter predicates that needs to be AND-ed. `and` 
filter requires two filter predicates. | yes      |
+
+`or` Filter:
+
+| Property | Description                                                       
                                  | Required |
+|----------|-----------------------------------------------------------------------------------------------------|----------|
+| type     | Set this value to `or`.                                           
                                  | yes      |
+| filters  | List of Delta filter predicates that needs to be OR-ed. `or` 
filter requires two filter predicates. | yes      |

Review Comment:
   ```suggestion
   | filters  | List of Delta filter predicates that get evaluated using OR 
logic where only one condition needs to be true. `or` filter requires two 
filter predicates. | yes      |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Support for filters in the Druid Delta Lake connector (druid)

Reply via email to