jadami10 opened a new issue, #9149:
URL: https://github.com/apache/pinot/issues/9149

   We were trying to test partitioning data in REALTIME tables and found 
consuming segments were never getting queried when applying the filter `where 
<partition_column> = X`. We eventually traced it down to this:
   - we have our own stream ingestion plugin that is ingesting data from 
multiple kafka sources
   - to do so, it uses an algorithm we wrote to convert the upstream config 
into a pinot partition int id
   - therefore, `murmur(partition_column) % num_partitions` will never actually 
equal partition id
   
   But what we saw is, after the segment is sealed, pinot actually knows to 
update the partition metadata with our made up id plus the actual partition it 
saw in the data. Having this functionality on the consuming segment would be 
really useful for us.
   
   For redundancy, we actually want to repartition our upstream data into 4 
separate sources not just 1. So we want each `<partition_column>` to exist on 4 
pinot partitions. Since pinot already knows to update the segment metadata on 
seal, can it just do it in realtime for consuming segments? There should only 
be 1 thread touching a consuming segment at a time anyway.
   
   The other proposal is to offer disabling of partitioning on the consuming 
segment. But this really defeats the purpose of what we're trying to do as it 
greatly increases the number of segments we query.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to