Hi Gyorgy,
comments inline.
On 7/1/24 15:10, Balogh, György wrote:
Hi Jan,
Let me add a few more details to show the full picture. We have live
datastreams (video analysis metadata) and we would like to run both
live and historic pipelines on the metadata (eg.: live alerts,
historic video searches).
This should be fine due to Beam's unified model. You can write a
PTransform that handles PCollection<...> without the need to worry if
the PCollection was created from Kafka or some bounded source.
We planned to use kafka to store the streaming data and directly run
both types of queries on top. You are suggesting to consider
having kafka with small retention to server the live queries and store
the historic data somewhere else which scales better for historic
queries? We need to have on prem options here. What options should we
consider that scales nicely (in terms of IO parallelization) with
beam? (eg. hdfs?)
Yes, I would not say necessarily "small" retention, but probably
"limited" retention. Running on premise you can choose from HDFS or
maybe S3 compatible minio or some other distributed storage, depends on
the scale and deployment options (e.g. YARN or k8s).
I also happen to work on a system which targets exactly these
streaming-batch workloads (persisting upserts from stream to batch for
reprocessing), see [1]. Please feel free to contact me directly if this
sounds interesting.
Best,
Jan
[1] https://github.com/O2-Czech-Republic/proxima-platform
Thank you,
Gyorgy
On Mon, Jul 1, 2024 at 9:21 AM Jan Lukavský <je...@seznam.cz> wrote:
H Gyorgy,
I don't think it is possible to co-locate tasks as you describe
it. Beam has no information about location of 'splits'. On the
other hand, if batch throughput is the main concern, then reading
from Kafka might not be the optimal choice. Although Kafka
provides tiered storage for offloading historical data, it still
somewhat limits scalability (and thus throughput), because the
data have to be read by a broker and only then passed to a
consumer. The parallelism is therefore limited by the number of
Kafka partitions and not parallelism of the Flink job. A more
scalable approach could be to persist data from Kafka to a batch
storage (e.g. S3 or GCS) and reprocess it from there.
Best,
Jan
On 6/29/24 09:12, Balogh, György wrote:
Hi,
I'm planning a distributed system with multiple kafka brokers co
located with flink workers.
Data processing throughput for historic queries is a main KPI. So
I want to make sure all flink workers read local data and not
remote. I'm defining my pipelines in beam using java.
Is it possible? What are the critical config elements to achieve
this?
Thank you,
Gyorgy
--
György Balogh
CTO
E gyorgy.bal...@ultinous.com <mailto:zsolt.sala...@ultinous.com>
M +36 30 270 8342 <tel:+36%2030%20270%208342>
A HU, 1117 Budapest, Budafoki út 209.
W www.ultinous.com <http://www.ultinous.com>
--
György Balogh
CTO
E gyorgy.bal...@ultinous.com <mailto:zsolt.sala...@ultinous.com>
M +36 30 270 8342 <tel:+36%2030%20270%208342>
A HU, 1117 Budapest, Budafoki út 209.
W www.ultinous.com <http://www.ultinous.com>