Re: How to read large amount of data from hive and write to redis, in a batch manner?

Piotr Nowojski Thu, 08 Jul 2021 08:09:38 -0700

Great, thanks for coming back and I'm glad that it works for you!

Piotrek


czw., 8 lip 2021 o 13:34 Yik San Chan <evan.chanyik...@gmail.com>
napisał(a):

> Hi Piotr,
>
> Thanks! I end up doing option 1, and that works great.
>
> Best,
> Yik San
>
> On Tue, May 25, 2021 at 11:43 PM Piotr Nowojski <pnowoj...@apache.org>
> wrote:
>
>> Hi,
>>
>> You could always buffer records in your sink function/operator, until a
>> large enough batch is accumulated and upload the whole batch at once. Note
>> that if you want to have at-least-once or exactly-once semantics, you would
>> need to take care of those buffered records in one way or another. For
>> example you could:
>> 1. Buffer records on some in memory data structure (not Flink's state),
>> and just make sure that those records are flushed to the underlying sink on
>> `CheckpointedFunction#snapshotState()` calls
>> 2. Buffer records on Flink's state (heap state backend or rocksdb - heap
>> state backend would be the fastest with little overhead, but you can risk
>> running out of memory), and that would easily give you exactly-once. That
>> way your batch could span multiple checkpoints.
>> 3. Buffer/write records to temporary files, but in that case keep in mind
>> that those files need to be persisted and recovered in case of failure and
>> restart.
>> 4. Ignore checkpointing and either always restart the job from scratch or
>> accept some occasional data loss.
>>
>> FYI, virtually every connector/sink is internally batching writes to some
>> extent. Usually by doing option 1.
>>
>> Piotrek
>>
>> wt., 25 maj 2021 o 14:50 Yik San Chan <evan.chanyik...@gmail.com>
>> napisał(a):
>>
>>> Hi community,
>>>
>>> I have a Hive table that stores tens of millions rows of data. In my
>>> Flink job, I want to process the data in batch manner:
>>>
>>> - Split the data into batches, each batch has (maybe) 10,000 rows.
>>> - For each batch, call a batchPut() API on my redis client to dump in
>>> Redis.
>>>
>>> Doing so in a streaming manner is not expected, as that will cause too
>>> many round trips between Flink workers and Redis.
>>>
>>> Is there a way to do that? I find little clue in Flink docs, since
>>> almost all APIs feel better suited for streaming processing by default.
>>>
>>> Thank you!
>>>
>>> Best,
>>> Yik San
>>>
>>

Re: How to read large amount of data from hive and write to redis, in a batch manner?

Reply via email to