Fwd: Verifying correctness of StreamingFileSink (Kafka -> S3)

Kostas Kloudas Wed, 16 Oct 2019 13:45:47 -0700

Hi Amran,

I am including also the message to the public ML because this may be
interesting to other users as well.


You can always write your input stream to two sinks, as shown below:

DataStream<...> myInputFromKafka = env.addSource(new
FlinkKafkaConsumerVERSION(...));

final StreamingFileSink<String> sinkA = StreamingFileSink
    .forRowFormat(new Path(outputPathA), new SimpleStringEncoder<>("UTF-8"))
    .withBucketAssigner(assignerA)
    .build();

final StreamingFileSink<String> sinkB = StreamingFileSink
    .forRowFormat(new Path(outputPathB), new SimpleStringEncoder<>("UTF-8"))
    .withBucketAssigner(assignerB)
    .build();

myInputFromKafka.addSink(sinkA)
myInputFromKafka.addSink(sinkB)

Cheers,
Kostas

---------- Forwarded message ---------
From: amran dean <[email protected]>
Date: Wed, Oct 16, 2019 at 8:43 PM
Subject: Re: Verifying correctness of StreamingFileSink (Kafka -> S3)
To: Kostas Kloudas <[email protected]>


Hi Kostas,
Suppose there is a requirement that current S3 object format cannot
change (avoid client migration).
Would it be possible to achieve a sort of dual-write, where Kafka
records are written to two separate S3 prefixes:

/bucket/topic/data/dt=2019-10-16/part-x-xx...
- One bucket holds raw record data as clients previously expect

/bucket/topic/metadata/dt=2019-10-16/partition_x/part-x-xx...
- A second bucket holds only offset data, for data integrity
verification purposes via periodic checks (i.e no "holes")

Does StreamingFileSink permit this?

On Wed, Oct 16, 2019 at 12:34 AM Kostas Kloudas <[email protected]> wrote:
>
> Hi Amran,
>
> If you want to know from which partition your input data come from,
> you can always have a separate bucket for each partition.
> As described in [1], you can extract the offset/partition/topic
> information for an incoming record and based on this, decide the
> appropriate bucket to put the record.
>
> Cheers,
> Kostas
>
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html
>
> On Wed, Oct 16, 2019 at 4:00 AM amran dean <[email protected]> wrote:
> >
> > I am evaluating StreamingFileSink (Kafka 0.10.11) as a production-ready 
> > alternative to a current Kafka -> S3 solution.
> >
> > Is there any way to verify the integrity of data written in S3? I'm 
> > confused how the file names (e.g part-1-17) map to Kafka partitions, and 
> > further unsure how to ensure that no Kafka records are lost (I know Flink 
> > guarantees exactly-once, but this is more of a sanity check).
> >
> >
> >
> >

Fwd: Verifying correctness of StreamingFileSink (Kafka -> S3)

Reply via email to