maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605357874



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 
2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this 
would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the 
original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not 
contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH 
segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 
records` 
+   in the datasource. Now the user is trying to re-ingest with new data that 
overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb 
has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with 
overwrite (using the same MONTH segmentGranularity) would be:
+   `Jan has 1 record, Feb has 10 records, Mar has 9 records`. However, this is 
incorrect as the new data has 0 record for Jan 
+   and the user would expect to see that Jan has 0 record. By setting 
`dropExisting` flag to true, we can drop the original

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to