Re: [PR] Enhance Compaction task to be able to write to a different/new datasource (druid)

via GitHub Fri, 17 Oct 2025 17:35:40 -0700


kfaraz commented on PR #18612:
URL: https://github.com/apache/druid/pull/18612#issuecomment-3384543274


   Yes, @maytasm , I agree that we should be able to leverage the auto-discover 
capabilities of compaction task for re-indexing too. For that reason, it makes 
sense to extend that feature.
   
   > We can have a new task type but I think that's making Druid harder to 
use/understand...just like adding new runtime properties and tuning configs.
   
   Yeah, more runtime properties and tuning configs often make Druid harder to 
use and understand.
   In fact, my concern is the same. Overloading the same feature to satisfy 
completely different use cases
   makes things confusing and less maintainable. The question is of intent. 
There is no reason a user
   trying to move data from one DS to another should have to launch a `compact` 
task.
   
   Since the use case here is a new capability altogether, there is no harm in 
adding a new task type or a new input source type, whichever seems simpler to 
implement.
   
   > You can already use Compaction task for more than Compacting. There are 
case where users change the schema, drop dimensions, change granularity level, 
etc. You can even give it a finer segmentGranularity and it will be expanding 
the datasource
   
   Absolutely, this is one of the things we have been discussing, that a 
`compact` task should ideally never change the meaning of the data, only how 
it's laid out/partitioned (the change in this PR would only add to that 
discrepancy).
   Since we have already added this capability, we don't want to get rid of it 
right now as users may already be using it.
   In the future, the compaction templates in #18402 will have capability to 
validate that a template does not change the
   meaning of data and does only "compaction".
   
   >  The schema detection, spec-autofill etc here only make sense for only one 
of those input source, the Druid input source. For compaction task, the input 
source is always Druid but for native batch it isn't.
   
   Oh, absolutely. To clarify, I meant that we should try to bring the 
auto-detection capabilities of `compact` task
   into native batch + `druid` input source, not native batch in general. But I 
suppose that might be more involved to implement, and perhaps an overkill 
anyway. We might as well just extend the `compact` task, as that seems simpler.
   
   Also, to add to the suggestions from @clintropolis , you could also consider 
using an MSQ INSERT/REPLACE statement.
   I don't know for sure if they provide all the auto-discover niceties or not 
but probably worth a shot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Enhance Compaction task to be able to write to a different/new datasource (druid)

Reply via email to