techdocsmith commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605235412



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 

Review comment:
       ```suggestion
     start and end within your `granularitySpec`'s intervals.  This applies 
whether or not the new data covers all existing segments. 
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 

Review comment:
       ```suggestion
   `dropExisting` only applies when `appendToExisting` is false and the  
`granularitySpec` contains an `interval`. 
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true

Review comment:
       ```suggestion
     The following examples demonstrate when to set the `dropExisting` property 
to true in the `ioConfig`:
   ```

##########
File path: docs/ingestion/compaction.md
##########
@@ -52,7 +52,7 @@ In cases where you require more control over compaction, you 
can manually submit
 See [Setting up a manual compaction task](#setting-up-manual-compaction) for 
more about manual compaction tasks.
 
 ## Data handling with compaction
-During compaction, Druid overwrites the original set of segments with the 
compacted set. During compaction Druid locks the segments for the time interval 
being compacted to ensure data consistency. By default, compaction tasks do not 
modify the underlying data. You can configure the compaction task to change the 
query granularity or add or remove dimensions in the compaction task. This 
means that the only changes to query results should be the result of 
intentional, not automatic, changes.
+During compaction, Druid overwrites the original set of segments with the 
compacted set. During compaction Druid locks the segments for the time interval 
being compacted to ensure data consistency. By default, compaction tasks do not 
modify the underlying data. You can configure the compaction task to change the 
query granularity or add or remove dimensions in the compaction task. This 
means that the only changes to query results should be the result of 
intentional, not automatic, changes. Note that compaction task automatically 
set `dropExisting` flag of the underlying ingestion task to true. This means 
that compaction task would drop (mark unused) all existing segments that are 
fully contain by the `interval` in the compaction task. This is to handle when 
compaction task changes segmentGranularity of the existing data to a finer 
segmentGranularity and the set of new segments (with the new 
segmentGranularity) does not fully cover the original croaser granularity time 
interval (as t
 here may not be data in every time chunk of the new finer segmentGranularity). 

Review comment:
       ```suggestion
   During compaction, Druid overwrites the original set of segments with the 
compacted set. Druid also locks the segments for the time interval being 
compacted to ensure data consistency. By default, compaction tasks do not 
modify the underlying data. You can configure the compaction task to change the 
query granularity or add or remove dimensions in the compaction task. This 
means that the only changes to query results should be the result of 
intentional, not automatic, changes.
   
   For compaction tasks, `dropExisting` for underlying ingestion tasks is 
"true". This means that Druid can drop or mark unused all the un-compacted 
segments fully within interval for the compaction task. For an example of why 
this is important, see the suggestion for reindexing with finer granularity 
under [Implementation 
considerations](native-batch.md#implementation-considerations). 
   ```
   I think it is better not to clutter this section with an example, especially 
if you can't change the value. The customer doesn't need to figure out how to 
set it another way. If they want to understand, they can read the example in 
`native-batch.md`. I had to add the header in because the recommendations don't 
relate to the compression header.

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 
2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this 
would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the 
original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not 
contains time intervals that already existed

Review comment:
       ```suggestion
     - Example 2: Consider the case where you want to re-ingest or overwrite a 
datasource and the new data does not contains some time intervals that exist
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 
2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this 
would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the 
original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not 
contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH 
segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 
records` 
+   in the datasource. Now the user is trying to re-ingest with new data that 
overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb 
has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with 
overwrite (using the same MONTH segmentGranularity) would be:

Review comment:
       ```suggestion
      Unless you set `dropExisting` to true, the result after ingestion with 
overwrite using the same MONTH segmentGranularity would be:
      January: 1 record
      February: 10 records
      March: 9 records
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 
2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this 
would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the 
original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not 
contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH 
segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 
records` 
+   in the datasource. Now the user is trying to re-ingest with new data that 
overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb 
has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with 
overwrite (using the same MONTH segmentGranularity) would be:
+   `Jan has 1 record, Feb has 10 records, Mar has 9 records`. However, this is 
incorrect as the new data has 0 record for Jan 
+   and the user would expect to see that Jan has 0 record. By setting 
`dropExisting` flag to true, we can drop the original

Review comment:
       ```suggestion
    
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 
2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this 
would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the 
original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not 
contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH 
segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 
records` 
+   in the datasource. Now the user is trying to re-ingest with new data that 
overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb 
has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with 
overwrite (using the same MONTH segmentGranularity) would be:
+   `Jan has 1 record, Feb has 10 records, Mar has 9 records`. However, this is 
incorrect as the new data has 0 record for Jan 

Review comment:
       ```suggestion
      This is incorrect since the new data has 0 records for January. Setting 
`dropExisting` to true to drop the original
      segment for Janurary that is not needed since the newly ingested data has 
no records for January.
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 

Review comment:
       ```suggestion
     overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data 
using the finer segmentGranularity of MONTH. 
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 
2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this 
would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the 
original 2020-01-01 to 2021-01-01 

Review comment:
       ```suggestion
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 
2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this 
would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the 
original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not 
contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH 
segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 
records` 
+   in the datasource. Now the user is trying to re-ingest with new data that 
overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb 
has 10 records, Mar has 9 records`.

Review comment:
       ```suggestion
     You want to re-ingest and overwrite with new data as follows:
     January: 0 records
     February: 10 records
     March: 9 records
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 
2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this 
would then prevent the original YEAR segment 

Review comment:
       ```suggestion
    Druid cannot drop the original YEAR segment even if it does include all the 
replacement. Set `dropExisting` to true in this case to drop the original 
segment at year `segmentgGranularity` since you no longer need it.
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 
2020-01-01 to 2021-01-01

Review comment:
       ```suggestion
     If the replacement data does not have a record within every months from 
2020-01-01 to 2021-01-01
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 
2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this 
would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the 
original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not 
contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH 
segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 
records` 

Review comment:
       ```suggestion
      in the datasource. For example, a datasource contains the following data 
at MONTH segmentGranularity:
      January: 1 record
      February: 10 records
      March: 10 records
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 
2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this 
would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the 
original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.

Review comment:
       ```suggestion
   
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -193,6 +213,7 @@ that range if there's some stray data with unexpected 
timestamps.
 |type|The task type, this should always be `index_parallel`.|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to 
parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, 
effectively appending to the segment set instead of replacing it. This means 
that you can append new segments to any datasource regardless of its original 
partitioning scheme. You must use the `dynamic` partitioning type for the 
appended segments. If you specify a different partitioning type, the task fails 
with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` is set to false and 
`interval` is specified in `granularitySpec`), then the ingestion task would 
drop (mark unused) all existing segments that are fully contained by the 
`interval` in the `granularitySpec` when the task publishes new segments (no 
segments would be dropped (marked unused) if the ingestion fails). Note that if 
either `appendToExisting` is `true` or `interval` is not specified in 
`granularitySpec` then no segments would be dropped even if `dropExisting` is 
set to `true`.|false|no|

Review comment:
       ```suggestion
   |dropExisting|If `true` and `appendToExisting` is `false` and the 
`granularitySpec` contains an`interval`, then the ingestion task drops (mark 
unused) all existing segments fully contained by the specified `interval` when 
the task publishes new segments. If ingestion fails, Druid does not drop or 
mark unused any segments. In the case of misconfiguration where either 
`appendToExisting` is `true` or `interval` is not specified in 
`granularitySpec`, Druid does not drop any segments even if `dropExisting` is 
`true`.|false|no|
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 

Review comment:
       ```suggestion
     - Example 1: Consider an existing segment with an interval of 2020-01-01 
to 2021-01-01 and YEAR segmentGranularity. You want to 
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 
2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this 
would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the 
original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not 
contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH 
segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 
records` 
+   in the datasource. Now the user is trying to re-ingest with new data that 
overwrites all the existing data. 

Review comment:
       ```suggestion
    
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -719,6 +741,7 @@ that range if there's some stray data with unexpected 
timestamps.
 |type|The task type, this should always be "index".|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to 
parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, 
effectively appending to the segment set instead of replacing it. This means 
that you can append new segments to any datasource regardless of its original 
partitioning scheme. You must use the `dynamic` partitioning type for the 
appended segments. If you specify a different partitioning type, the task fails 
with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` is set to false and 
`interval` is specified in `granularitySpec`), then the ingestion task would 
drop (mark unused) all existing segments that are fully contained by the 
`interval` in the `granularitySpec` when the task publishes new segments (no 
segments would be dropped (marked unused) if the ingestion fails). Note that if 
either `appendToExisting` is `true` or `interval` is not specified in 
`granularitySpec` then no segments would be dropped even if `dropExisting` is 
set to `true`.|false|no|

Review comment:
       same as line 216

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your 
`granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing 
segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new 
segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the 
ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if 
new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and 
`interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` 
to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 
(YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in 
smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 
2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this 
would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the 
original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not 
contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH 
segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 
records` 
+   in the datasource. Now the user is trying to re-ingest with new data that 
overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb 
has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with 
overwrite (using the same MONTH segmentGranularity) would be:
+   `Jan has 1 record, Feb has 10 records, Mar has 9 records`. However, this is 
incorrect as the new data has 0 record for Jan 
+   and the user would expect to see that Jan has 0 record. By setting 
`dropExisting` flag to true, we can drop the original
+   segment of Janurary which is no longer needed (as new ingested data does 
not have any data in Janurary).

Review comment:
       ```suggestion
     
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to