Re: beam using flink runner to achive data locality in a distributed setup?

Jan Lukavský Mon, 01 Jul 2024 06:45:39 -0700

Hi Gyorgy,

comments inline.


On 7/1/24 15:10, Balogh, György wrote:

Hi Jan,
Let me add a few more details to show the full picture. We have livedatastreams (video analysis metadata) and we would like to run bothlive and historic pipelines on the metadata (eg.: live alerts,historic video searches).

This should be fine due to Beam's unified model. You can write aPTransform that handles PCollection<...> without the need to worry ifthe PCollection was created from Kafka or some bounded source.

We planned to use kafka to store the streaming data and directly runboth types of queries on top. You are suggesting to considerhaving kafka with small retention to server the live queries and storethe historic data somewhere else which scales better for historicqueries? We need to have on prem options here. What options should weconsider that scales nicely (in terms of IO parallelization) withbeam? (eg. hdfs?)

Yes, I would not say necessarily "small" retention, but probably"limited" retention. Running on premise you can choose from HDFS ormaybe S3 compatible minio or some other distributed storage, depends onthe scale and deployment options (e.g. YARN or k8s).

I also happen to work on a system which targets exactly thesestreaming-batch workloads (persisting upserts from stream to batch forreprocessing), see [1]. Please feel free to contact me directly if thissounds interesting.


Best,

 Jan

[1] https://github.com/O2-Czech-Republic/proxima-platform

Thank you,
Gyorgy

On Mon, Jul 1, 2024 at 9:21 AM Jan Lukavský <je...@seznam.cz> wrote:

    H Gyorgy,

    I don't think it is possible to co-locate tasks as you describe
    it. Beam has no information about location of 'splits'. On the
    other hand, if batch throughput is the main concern, then reading
    from Kafka might not be the optimal choice. Although Kafka
    provides tiered storage for offloading historical data, it still
    somewhat limits scalability (and thus throughput), because the
    data have to be read by a broker and only then passed to a
    consumer. The parallelism is therefore limited by the number of
    Kafka partitions and not parallelism of the Flink job. A more
    scalable approach could be to persist data from Kafka to a batch
    storage (e.g. S3 or GCS) and reprocess it from there.

    Best,

     Jan

    On 6/29/24 09:12, Balogh, György wrote:

    Hi,
    I'm planning a distributed system with multiple kafka brokers co
    located with flink workers.
    Data processing throughput for historic queries is a main KPI. So
    I want to make sure all flink workers read local data and not
    remote. I'm defining my pipelines in beam using java.
    Is it possible? What are the critical config elements to achieve
    this?
    Thank you,
    Gyorgy

--

    György Balogh
    CTO
    E   gyorgy.bal...@ultinous.com <mailto:zsolt.sala...@ultinous.com>
    M   +36 30 270 8342 <tel:+36%2030%20270%208342>
    A   HU, 1117 Budapest, Budafoki út 209.
    W   www.ultinous.com <http://www.ultinous.com>



--

György Balogh
CTO
E       gyorgy.bal...@ultinous.com <mailto:zsolt.sala...@ultinous.com>
M       +36 30 270 8342 <tel:+36%2030%20270%208342>
A       HU, 1117 Budapest, Budafoki út 209.
W       www.ultinous.com <http://www.ultinous.com>

Re: beam using flink runner to achive data locality in a distributed setup?

Reply via email to