[I] Data ingestion CPU efficiency improvements [pinot]

via GitHub Wed, 05 Jun 2024 12:52:35 -0700


lnbest0707-uber opened a new issue, #13319:
URL: https://github.com/apache/pinot/issues/13319

Pinot data ingestion from Kafka is following the 1 thread per Kafka
partition mechanism. The scaling up is relying on increasing number of Kafka
topic partitions. However, due to the nature of ingestion computation load,
Kafka broker usually has a far higher traffic volume limit per partition than
Pinot.
For example, with same type of hardware, Kafka could afford traffic over
8MB/s/partition but Pinot if doing complex transformation and index building
(e.g. SchemaConformingTransformer & text index) can only afford <2
MB/s/partition. This makes the Kafka partition expansion not able to be always
in sync with Pinot's system load.
In reality, we are observing that in a Pinot server with tens of cores, only
20% are busy with ingesting and others relatively idle.

Hence, there's requirement to improve the computation efficiency and do
parallel (at least part of) single partition message processing.

![image](https://github.com/apache/pinot/assets/106711887/a6e6390b-ddfc-48f6-97b7-0959dad88bfc)
From the attached pic, there are a few components to be improved:

- gzip compression -> to zstd with proper level
- transformers -> using batch and parallel processing
- indexing -> TBD
- Kafka polling -> batch polling

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[I] Data ingestion CPU efficiency improvements [pinot]

Reply via email to